subject:"Re\: \[patch\] mm, memcg\: add oom killer delay"

Re: [patch] mm, memcg: add oom killer delay

2013-07-10 Thread Michal Hocko

On Fri 14-06-13 03:12:52, David Rientjes wrote:
> On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:
> 
> > Reading your discussion, I think I understand your requirements.
> > The problem is that I can't think you took into all options into
> > accounts and found the best way is this new oom_delay. IOW, I can't
> > convice oom-delay is the best way to handle your issue.
> > 
> 
> Ok, let's talk about it.
> 
> > Your requeirement is
> >  - Allowing userland oom-handler within local memcg.
> > 
> 
> Another requirement:
> 
>  - Allow userland oom handler for global oom conditions.
> 
> Hopefully that's hooked into memcg because the functionality is already 
> there, we can simply duplicate all of the oom functionality that we'll be 
> adding for the root memcg.

I understand this and that's why I mentioned enabling oom_control for
root memcg at LSF. I liked the idea at the time but the more I think
about it the more I am scared about all the consequences. Maybe my fear
is not justified as in-kernel (scripting or module registering its own
handler) would be fragile in a similar ways. I would like to see this
being discussed before we go enable memcg way.

I _do_ agree that we really _want_ a way to provide a customer oom
handler.

> > Considering straightforward, the answer should be
> >  - Allowing oom-handler daemon out of memcg's control by its limit.
> >(For example, a flag/capability for a task can archive this.)
> >Or attaching some *fixed* resource to the task rather than cgroup.
> > 
> >Allow to set task->secret_saving=20M.
> > 
> 
> Exactly!

I thought your users are untrusted so you cannot give them any
additional reserves. Or is this only about the root cgroup?

> First of all, thanks very much for taking an interest in our usecase and 
> discussing it with us.
> 
> I didn't propose what I referred to earlier in the thread as "memcg 
> reserves" because I thought it was going to be a more difficult battle.  
> The fact that you brought it up first actually makes me think it's less 
> insane :)

reserves for oom handler doesn't sound like a bad idea but it assumes
that the handler is trusted.

> We do indeed want memcg reserves and I have patches to add it if you'd 
> like to see that first.  It ensures that this userspace oom handler can 
> actually do some work in determining which process to kill.  The reserve 
> is a fraction of true memory reserves (the space below the per-zone min 
> watermarks) which is dependent on min_free_kbytes.  This does indeed 
> become more difficult with true and complete kmem charging.  That "work" 
> could be opening the tasks file (which allocates the pidlist within the 
> kernel),

I am not familiar with details why it has to allocate memory but I guess
we need to read tasks file without any in-kernel allocations.

> checking /proc/pid/status for rss,

This doesn't require in kernel allocation AFAICS.

> checking for how long a process has been running, checking for tid,
> sending a signal to drop caches, etc.
> 
> We'd also like to do this for global oom conditions, which makes it even 
> more interesting.  I was thinking of using a fraction of memory reserves 
> as the oom killer currently does (that memory below the min watermark) for 
> these purposes.
> 
> Memory charging is simply bypassed for these oom handlers (we only grant 
> access to those waiting on the memory.oom_control eventfd) up to 
> memory.limit_in_bytes + (min_free_kbytes / 4), for example.  I don't think 
> this is entirely insane because these oom handlers should lead to future 
> memory freeing, just like TIF_MEMDIE processes.
> 
> > Going back to your patch, what's confusing is your approach.
> > Why the problem caused by the amount of memory should be solved by
> > some dealy, i.e. the amount of time ?
> > 
> > This exchanging sounds confusing to me.
> > 
> 
> Even with all of the above (which is not actually that invasive of a 
> patch), I still think we need memory.oom_delay_millisecs. 

I am still not convinced. I have already mentioned that this can be
handled from userspace.
A simple watchdog which sits on oom_control eventfd which triggers
timer (and reschedule it on a new event while there is one scheduled
already) which reads oom_control to check under_oom before it enables
in-kernel oom handling again doesn't need any memory allocation so there
is nothing to prevent it from resurrecting the system. So I really do
not see a reason for a new knob.

> I probably made a mistake in describing what that is addressing if it
> seems like it's trying to address any of the above.
> 
> If a userspace oom handler fails to respond even with access to those 
> "memcg reserves", the kernel needs to kill within that memcg.  Do we do 
> that above a set time period (this patch) or when the reserves are 
> completely exhausted?  That's debatable, but if we are to allow it for 
> global oom conditions as well then my opinion was to make it as safe as 
> possible; today, we can't disable

Re: [patch] mm, memcg: add oom killer delay

2013-07-10 Thread Michal Hocko

On Fri 14-06-13 03:12:52, David Rientjes wrote:
 On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:
 
  Reading your discussion, I think I understand your requirements.
  The problem is that I can't think you took into all options into
  accounts and found the best way is this new oom_delay. IOW, I can't
  convice oom-delay is the best way to handle your issue.
  
 
 Ok, let's talk about it.
 
  Your requeirement is
   - Allowing userland oom-handler within local memcg.
  
 
 Another requirement:
 
  - Allow userland oom handler for global oom conditions.
 
 Hopefully that's hooked into memcg because the functionality is already 
 there, we can simply duplicate all of the oom functionality that we'll be 
 adding for the root memcg.

I understand this and that's why I mentioned enabling oom_control for
root memcg at LSF. I liked the idea at the time but the more I think
about it the more I am scared about all the consequences. Maybe my fear
is not justified as in-kernel (scripting or module registering its own
handler) would be fragile in a similar ways. I would like to see this
being discussed before we go enable memcg way.

I _do_ agree that we really _want_ a way to provide a customer oom
handler.

  Considering straightforward, the answer should be
   - Allowing oom-handler daemon out of memcg's control by its limit.
 (For example, a flag/capability for a task can archive this.)
 Or attaching some *fixed* resource to the task rather than cgroup.
  
 Allow to set task-secret_saving=20M.
  
 
 Exactly!

I thought your users are untrusted so you cannot give them any
additional reserves. Or is this only about the root cgroup?

 First of all, thanks very much for taking an interest in our usecase and 
 discussing it with us.
 
 I didn't propose what I referred to earlier in the thread as memcg 
 reserves because I thought it was going to be a more difficult battle.  
 The fact that you brought it up first actually makes me think it's less 
 insane :)

reserves for oom handler doesn't sound like a bad idea but it assumes
that the handler is trusted.

 We do indeed want memcg reserves and I have patches to add it if you'd 
 like to see that first.  It ensures that this userspace oom handler can 
 actually do some work in determining which process to kill.  The reserve 
 is a fraction of true memory reserves (the space below the per-zone min 
 watermarks) which is dependent on min_free_kbytes.  This does indeed 
 become more difficult with true and complete kmem charging.  That work 
 could be opening the tasks file (which allocates the pidlist within the 
 kernel),

I am not familiar with details why it has to allocate memory but I guess
we need to read tasks file without any in-kernel allocations.

 checking /proc/pid/status for rss,

This doesn't require in kernel allocation AFAICS.

 checking for how long a process has been running, checking for tid,
 sending a signal to drop caches, etc.
 
 We'd also like to do this for global oom conditions, which makes it even 
 more interesting.  I was thinking of using a fraction of memory reserves 
 as the oom killer currently does (that memory below the min watermark) for 
 these purposes.
 
 Memory charging is simply bypassed for these oom handlers (we only grant 
 access to those waiting on the memory.oom_control eventfd) up to 
 memory.limit_in_bytes + (min_free_kbytes / 4), for example.  I don't think 
 this is entirely insane because these oom handlers should lead to future 
 memory freeing, just like TIF_MEMDIE processes.
 
  Going back to your patch, what's confusing is your approach.
  Why the problem caused by the amount of memory should be solved by
  some dealy, i.e. the amount of time ?
  
  This exchanging sounds confusing to me.
  
 
 Even with all of the above (which is not actually that invasive of a 
 patch), I still think we need memory.oom_delay_millisecs. 

I am still not convinced. I have already mentioned that this can be
handled from userspace.
A simple watchdog which sits on oom_control eventfd which triggers
timer (and reschedule it on a new event while there is one scheduled
already) which reads oom_control to check under_oom before it enables
in-kernel oom handling again doesn't need any memory allocation so there
is nothing to prevent it from resurrecting the system. So I really do
not see a reason for a new knob.

 I probably made a mistake in describing what that is addressing if it
 seems like it's trying to address any of the above.
 
 If a userspace oom handler fails to respond even with access to those 
 memcg reserves, the kernel needs to kill within that memcg.  Do we do 
 that above a set time period (this patch) or when the reserves are 
 completely exhausted?  That's debatable, but if we are to allow it for 
 global oom conditions as well then my opinion was to make it as safe as 
 possible; today, we can't disable the global oom killer from userspace and 
 I don't think we should ever allow it to be disabled.  I think

Re: [patch] mm, memcg: add oom killer delay

2013-06-26 Thread David Rientjes

On Tue, 25 Jun 2013, Kamezawa Hiroyuki wrote:

> Considering only memcg, bypassing all charge-limit-check will work.
> But as you say, that will not work against global-oom.

I think it will since we have per-zone memory reserves that can be 
bypassed in the page allocator, not to the level of PF_MEMALLOC or 
TIF_MEMDIE but perhaps to the min watermark / 4, for example.  A userspace 
global oom handler will obviously be mlocked in memory and this reserve is 
used only for true kmem accounting so that reading things like the memcg 
tasks file or reading /proc/pid/stat works, or dynamically allocate a 
buffer to store data to iterate over.  This is why 
memory.oom_delay_millisecs is crucial: we want the same functionality that 
the "user root" has for global oom conditions at the memcg root level and 
in case reserves are exhausted that the kernel will kill something (which 
should be rare, but possible) and use the rest of memory reserves to allow 
to exit.

> > Even with all of the above (which is not actually that invasive of a
> > patch), I still think we need memory.oom_delay_millisecs.  I probably made
> > a mistake in describing what that is addressing if it seems like it's
> > trying to address any of the above.
> > 
> > If a userspace oom handler fails to respond even with access to those
> > "memcg reserves",
> 
> How this happens ?
> 

If the memcg reserves are exhausted, then the kernel needs to kill 
something even in global oom conditions (consider a "user root" memcg tree 
to be the same as a global oom condition for processes attached to that 
tree) since otherwise the machine hangs.  There's no guarantee that some 
root process sitting in the root memcg would be able to enforce this delay 
as Michal suggests since reserves could be depleted.  It's important we 
don't do something as extreme as PF_MEMALLOC so all per-zone reserves are 
depleted so that the kernel can still intervene and kill something when 
userspace is unresponsive.

> Someone may be against that kind of control and say "Hey, I have better idea".
> That was another reason that oom-scirpiting was discussed. No one can
> implement
> general-purpose-victim-selection-logic.
> 

Completely agreed, and our experience shows that users who manipulate 
their own "user root" memcgs have their own logic, this is why we're 
trying to make userspace oom handling as powerful as possible without 
risking making the machine unresponsive.

> IMHO, it will be difficult but allowing to write script/filter for oom-killing
> will be worth to try. like..
> 
> ==
> for_each_process :
>   if comm == mem_manage_daemon :
>  continue
>   if user == root  :
>  continue
>   score = default_calc_score()
>   if score > high_score :
>  selected = current
> ==
> 

This is effectively what already happens with the oom delay as proposed 
here, the userspace oom handler is given access to "memcg reserves" and a 
period of time to respond; if that fails, then the kernel will kill 
something the next time we try to charge to the memcg.

> BTW, if you love the logic in the userland oom daemon, why you can't implement
> it in the kernel ? Does that do some pretty things other than sending SIGKILL
> ?
> 

Some do "pretty" things like collect stats and dump it before killing 
something, but we also want oom handlers that don't do SIGKILL at all.  An 
example: we statically allocate hugepages at boot because we need a large 
percentage of memory to be backed by hugepages for a certain class of 
applications and it's only available at boot.  We also have a userspace 
that runs on these machines that is shared between hugepage machines and 
non-hugepage machines.  At times, this userspace becomes oom because the 
remainder of available memory is allocated as hugetlb pages when in 
reality they are unmapped and sitting in a free pool.  In that case, our 
userspace oom handler wants to free those hugetlb pages back to the kernel 
down to a certain watermark and then opportunistically reallocate them to 
the pool when memory usage on the system is lower due to spikes in the 
userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-26 Thread David Rientjes

On Tue, 25 Jun 2013, Kamezawa Hiroyuki wrote:

 Considering only memcg, bypassing all charge-limit-check will work.
 But as you say, that will not work against global-oom.

I think it will since we have per-zone memory reserves that can be 
bypassed in the page allocator, not to the level of PF_MEMALLOC or 
TIF_MEMDIE but perhaps to the min watermark / 4, for example.  A userspace 
global oom handler will obviously be mlocked in memory and this reserve is 
used only for true kmem accounting so that reading things like the memcg 
tasks file or reading /proc/pid/stat works, or dynamically allocate a 
buffer to store data to iterate over.  This is why 
memory.oom_delay_millisecs is crucial: we want the same functionality that 
the user root has for global oom conditions at the memcg root level and 
in case reserves are exhausted that the kernel will kill something (which 
should be rare, but possible) and use the rest of memory reserves to allow 
to exit.

  Even with all of the above (which is not actually that invasive of a
  patch), I still think we need memory.oom_delay_millisecs.  I probably made
  a mistake in describing what that is addressing if it seems like it's
  trying to address any of the above.
  
  If a userspace oom handler fails to respond even with access to those
  memcg reserves,
 
 How this happens ?
 

If the memcg reserves are exhausted, then the kernel needs to kill 
something even in global oom conditions (consider a user root memcg tree 
to be the same as a global oom condition for processes attached to that 
tree) since otherwise the machine hangs.  There's no guarantee that some 
root process sitting in the root memcg would be able to enforce this delay 
as Michal suggests since reserves could be depleted.  It's important we 
don't do something as extreme as PF_MEMALLOC so all per-zone reserves are 
depleted so that the kernel can still intervene and kill something when 
userspace is unresponsive.

 Someone may be against that kind of control and say Hey, I have better idea.
 That was another reason that oom-scirpiting was discussed. No one can
 implement
 general-purpose-victim-selection-logic.
 

Completely agreed, and our experience shows that users who manipulate 
their own user root memcgs have their own logic, this is why we're 
trying to make userspace oom handling as powerful as possible without 
risking making the machine unresponsive.

 IMHO, it will be difficult but allowing to write script/filter for oom-killing
 will be worth to try. like..
 
 ==
 for_each_process :
   if comm == mem_manage_daemon :
  continue
   if user == root  :
  continue
   score = default_calc_score()
   if score  high_score :
  selected = current
 ==
 

This is effectively what already happens with the oom delay as proposed 
here, the userspace oom handler is given access to memcg reserves and a 
period of time to respond; if that fails, then the kernel will kill 
something the next time we try to charge to the memcg.

 BTW, if you love the logic in the userland oom daemon, why you can't implement
 it in the kernel ? Does that do some pretty things other than sending SIGKILL
 ?
 

Some do pretty things like collect stats and dump it before killing 
something, but we also want oom handlers that don't do SIGKILL at all.  An 
example: we statically allocate hugepages at boot because we need a large 
percentage of memory to be backed by hugepages for a certain class of 
applications and it's only available at boot.  We also have a userspace 
that runs on these machines that is shared between hugepage machines and 
non-hugepage machines.  At times, this userspace becomes oom because the 
remainder of available memory is allocated as hugetlb pages when in 
reality they are unmapped and sitting in a free pool.  In that case, our 
userspace oom handler wants to free those hugetlb pages back to the kernel 
down to a certain watermark and then opportunistically reallocate them to 
the pool when memory usage on the system is lower due to spikes in the 
userspace.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-24 Thread Kamezawa Hiroyuki


(2013/06/14 19:12), David Rientjes wrote:

On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:


Reading your discussion, I think I understand your requirements.
The problem is that I can't think you took into all options into
accounts and found the best way is this new oom_delay. IOW, I can't
convice oom-delay is the best way to handle your issue.



Ok, let's talk about it.



I'm sorry that my RTT is long in these days.


Your requeirement is
  - Allowing userland oom-handler within local memcg.



Another requirement:

  - Allow userland oom handler for global oom conditions.

Hopefully that's hooked into memcg because the functionality is already
there, we can simply duplicate all of the oom functionality that we'll be
adding for the root memcg.



At mm-summit, it was discussed ant people seems to think user-land-oom-handler
is impossible. Hm, and in-kernel scripting was discussed, as far as I remember.




Considering straightforward, the answer should be
  - Allowing oom-handler daemon out of memcg's control by its limit.
(For example, a flag/capability for a task can archive this.)
Or attaching some *fixed* resource to the task rather than cgroup.

Allow to set task->secret_saving=20M.



Exactly!

First of all, thanks very much for taking an interest in our usecase and
discussing it with us.

I didn't propose what I referred to earlier in the thread as "memcg
reserves" because I thought it was going to be a more difficult battle.
The fact that you brought it up first actually makes me think it's less
insane :)

We do indeed want memcg reserves and I have patches to add it if you'd
like to see that first.  It ensures that this userspace oom handler can
actually do some work in determining which process to kill.  The reserve
is a fraction of true memory reserves (the space below the per-zone min
watermarks) which is dependent on min_free_kbytes.  This does indeed
become more difficult with true and complete kmem charging.  That "work"
could be opening the tasks file (which allocates the pidlist within the
kernel), checking /proc/pid/status for rss, checking for how long a
process has been running, checking for tid, sending a signal to drop
caches, etc.




Considering only memcg, bypassing all charge-limit-check will work.
But as you say, that will not work against global-oom.
Then, in-kernel scripting was discussed.



We'd also like to do this for global oom conditions, which makes it even
more interesting.  I was thinking of using a fraction of memory reserves
as the oom killer currently does (that memory below the min watermark) for
these purposes.

Memory charging is simply bypassed for these oom handlers (we only grant
access to those waiting on the memory.oom_control eventfd) up to
memory.limit_in_bytes + (min_free_kbytes / 4), for example.  I don't think
this is entirely insane because these oom handlers should lead to future
memory freeing, just like TIF_MEMDIE processes.



I think that kinds of bypassing is acceptable.



Going back to your patch, what's confusing is your approach.
Why the problem caused by the amount of memory should be solved by
some dealy, i.e. the amount of time ?

This exchanging sounds confusing to me.



Even with all of the above (which is not actually that invasive of a
patch), I still think we need memory.oom_delay_millisecs.  I probably made
a mistake in describing what that is addressing if it seems like it's
trying to address any of the above.

If a userspace oom handler fails to respond even with access to those
"memcg reserves",


How this happens ?


 the kernel needs to kill within that memcg.  Do we do
that above a set time period (this patch) or when the reserves are
completely exhausted?  That's debatable, but if we are to allow it for
global oom conditions as well then my opinion was to make it as safe as
possible; today, we can't disable the global oom killer from userspace and
I don't think we should ever allow it to be disabled.  I think we should
allow userspace a reasonable amount of time to respond and then kill if it
is exceeded.

For the global oom case, we want to have a priority-based memcg selection.
Select the lowest priority top-level memcg and kill within it.  If it has
an oom notifier, send it a signal to kill something.  If it fails to
react, kill something after memory.oom_delay_millisecs has elapsed.  If
there isn't a userspace oom notifier, kill something within that lowest
priority memcg.



Someone may be against that kind of control and say "Hey, I have better idea".
That was another reason that oom-scirpiting was discussed. No one can implement
general-purpose-victim-selection-logic.


The bottomline with my approach is that I don't believe there is ever a
reason for an oom memcg to remain oom indefinitely.  That's why I hate
memory.oom_control == 1 and I think for the global notification it would
be deemed a nonstarter since you couldn't even login to the machine.


I'm not against what you finally want to do, but I

Re: [patch] mm, memcg: add oom killer delay

2013-06-24 Thread Kamezawa Hiroyuki


(2013/06/14 19:12), David Rientjes wrote:

On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:


Reading your discussion, I think I understand your requirements.
The problem is that I can't think you took into all options into
accounts and found the best way is this new oom_delay. IOW, I can't
convice oom-delay is the best way to handle your issue.



Ok, let's talk about it.



I'm sorry that my RTT is long in these days.


Your requeirement is
  - Allowing userland oom-handler within local memcg.



Another requirement:

  - Allow userland oom handler for global oom conditions.

Hopefully that's hooked into memcg because the functionality is already
there, we can simply duplicate all of the oom functionality that we'll be
adding for the root memcg.



At mm-summit, it was discussed ant people seems to think user-land-oom-handler
is impossible. Hm, and in-kernel scripting was discussed, as far as I remember.




Considering straightforward, the answer should be
  - Allowing oom-handler daemon out of memcg's control by its limit.
(For example, a flag/capability for a task can archive this.)
Or attaching some *fixed* resource to the task rather than cgroup.

Allow to set task-secret_saving=20M.



Exactly!

First of all, thanks very much for taking an interest in our usecase and
discussing it with us.

I didn't propose what I referred to earlier in the thread as memcg
reserves because I thought it was going to be a more difficult battle.
The fact that you brought it up first actually makes me think it's less
insane :)

We do indeed want memcg reserves and I have patches to add it if you'd
like to see that first.  It ensures that this userspace oom handler can
actually do some work in determining which process to kill.  The reserve
is a fraction of true memory reserves (the space below the per-zone min
watermarks) which is dependent on min_free_kbytes.  This does indeed
become more difficult with true and complete kmem charging.  That work
could be opening the tasks file (which allocates the pidlist within the
kernel), checking /proc/pid/status for rss, checking for how long a
process has been running, checking for tid, sending a signal to drop
caches, etc.




Considering only memcg, bypassing all charge-limit-check will work.
But as you say, that will not work against global-oom.
Then, in-kernel scripting was discussed.



We'd also like to do this for global oom conditions, which makes it even
more interesting.  I was thinking of using a fraction of memory reserves
as the oom killer currently does (that memory below the min watermark) for
these purposes.

Memory charging is simply bypassed for these oom handlers (we only grant
access to those waiting on the memory.oom_control eventfd) up to
memory.limit_in_bytes + (min_free_kbytes / 4), for example.  I don't think
this is entirely insane because these oom handlers should lead to future
memory freeing, just like TIF_MEMDIE processes.



I think that kinds of bypassing is acceptable.



Going back to your patch, what's confusing is your approach.
Why the problem caused by the amount of memory should be solved by
some dealy, i.e. the amount of time ?

This exchanging sounds confusing to me.



Even with all of the above (which is not actually that invasive of a
patch), I still think we need memory.oom_delay_millisecs.  I probably made
a mistake in describing what that is addressing if it seems like it's
trying to address any of the above.

If a userspace oom handler fails to respond even with access to those
memcg reserves,


How this happens ?


 the kernel needs to kill within that memcg.  Do we do
that above a set time period (this patch) or when the reserves are
completely exhausted?  That's debatable, but if we are to allow it for
global oom conditions as well then my opinion was to make it as safe as
possible; today, we can't disable the global oom killer from userspace and
I don't think we should ever allow it to be disabled.  I think we should
allow userspace a reasonable amount of time to respond and then kill if it
is exceeded.

For the global oom case, we want to have a priority-based memcg selection.
Select the lowest priority top-level memcg and kill within it.  If it has
an oom notifier, send it a signal to kill something.  If it fails to
react, kill something after memory.oom_delay_millisecs has elapsed.  If
there isn't a userspace oom notifier, kill something within that lowest
priority memcg.



Someone may be against that kind of control and say Hey, I have better idea.
That was another reason that oom-scirpiting was discussed. No one can implement
general-purpose-victim-selection-logic.


The bottomline with my approach is that I don't believe there is ever a
reason for an oom memcg to remain oom indefinitely.  That's why I hate
memory.oom_control == 1 and I think for the global notification it would
be deemed a nonstarter since you couldn't even login to the machine.


I'm not against what you finally want to do, but I don't like

Re: [patch] mm, memcg: add oom killer delay

2013-06-19 Thread David Rientjes

On Fri, 14 Jun 2013, David Rientjes wrote:

> Even with all of the above (which is not actually that invasive of a 
> patch), I still think we need memory.oom_delay_millisecs.  I probably made 
> a mistake in describing what that is addressing if it seems like it's 
> trying to address any of the above.
> 
> If a userspace oom handler fails to respond even with access to those 
> "memcg reserves", the kernel needs to kill within that memcg.  Do we do 
> that above a set time period (this patch) or when the reserves are 
> completely exhausted?  That's debatable, but if we are to allow it for 
> global oom conditions as well then my opinion was to make it as safe as 
> possible; today, we can't disable the global oom killer from userspace and 
> I don't think we should ever allow it to be disabled.  I think we should 
> allow userspace a reasonable amount of time to respond and then kill if it 
> is exceeded.
> 
> For the global oom case, we want to have a priority-based memcg selection.  
> Select the lowest priority top-level memcg and kill within it.  If it has 
> an oom notifier, send it a signal to kill something.  If it fails to 
> react, kill something after memory.oom_delay_millisecs has elapsed.  If 
> there isn't a userspace oom notifier, kill something within that lowest 
> priority memcg.
> 
> The bottomline with my approach is that I don't believe there is ever a 
> reason for an oom memcg to remain oom indefinitely.  That's why I hate 
> memory.oom_control == 1 and I think for the global notification it would 
> be deemed a nonstarter since you couldn't even login to the machine.
> 

What's the status of this patch?  I'd love to be able to introduce memcg 
reserves so that userspace oom notifications can actually work, but we 
still require a failsafe in the kernel if those reserves are depleted or 
userspace cannot respond.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-19 Thread David Rientjes

On Fri, 14 Jun 2013, David Rientjes wrote:

 Even with all of the above (which is not actually that invasive of a 
 patch), I still think we need memory.oom_delay_millisecs.  I probably made 
 a mistake in describing what that is addressing if it seems like it's 
 trying to address any of the above.
 
 If a userspace oom handler fails to respond even with access to those 
 memcg reserves, the kernel needs to kill within that memcg.  Do we do 
 that above a set time period (this patch) or when the reserves are 
 completely exhausted?  That's debatable, but if we are to allow it for 
 global oom conditions as well then my opinion was to make it as safe as 
 possible; today, we can't disable the global oom killer from userspace and 
 I don't think we should ever allow it to be disabled.  I think we should 
 allow userspace a reasonable amount of time to respond and then kill if it 
 is exceeded.
 
 For the global oom case, we want to have a priority-based memcg selection.  
 Select the lowest priority top-level memcg and kill within it.  If it has 
 an oom notifier, send it a signal to kill something.  If it fails to 
 react, kill something after memory.oom_delay_millisecs has elapsed.  If 
 there isn't a userspace oom notifier, kill something within that lowest 
 priority memcg.
 
 The bottomline with my approach is that I don't believe there is ever a 
 reason for an oom memcg to remain oom indefinitely.  That's why I hate 
 memory.oom_control == 1 and I think for the global notification it would 
 be deemed a nonstarter since you couldn't even login to the machine.
 

What's the status of this patch?  I'd love to be able to introduce memcg 
reserves so that userspace oom notifications can actually work, but we 
still require a failsafe in the kernel if those reserves are depleted or 
userspace cannot respond.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-14 Thread David Rientjes

On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:

> Reading your discussion, I think I understand your requirements.
> The problem is that I can't think you took into all options into
> accounts and found the best way is this new oom_delay. IOW, I can't
> convice oom-delay is the best way to handle your issue.
> 

Ok, let's talk about it.

> Your requeirement is
>  - Allowing userland oom-handler within local memcg.
> 

Another requirement:

 - Allow userland oom handler for global oom conditions.

Hopefully that's hooked into memcg because the functionality is already 
there, we can simply duplicate all of the oom functionality that we'll be 
adding for the root memcg.

> Considering straightforward, the answer should be
>  - Allowing oom-handler daemon out of memcg's control by its limit.
>(For example, a flag/capability for a task can archive this.)
>Or attaching some *fixed* resource to the task rather than cgroup.
> 
>Allow to set task->secret_saving=20M.
> 

Exactly!

First of all, thanks very much for taking an interest in our usecase and 
discussing it with us.

I didn't propose what I referred to earlier in the thread as "memcg 
reserves" because I thought it was going to be a more difficult battle.  
The fact that you brought it up first actually makes me think it's less 
insane :)

We do indeed want memcg reserves and I have patches to add it if you'd 
like to see that first.  It ensures that this userspace oom handler can 
actually do some work in determining which process to kill.  The reserve 
is a fraction of true memory reserves (the space below the per-zone min 
watermarks) which is dependent on min_free_kbytes.  This does indeed 
become more difficult with true and complete kmem charging.  That "work" 
could be opening the tasks file (which allocates the pidlist within the 
kernel), checking /proc/pid/status for rss, checking for how long a 
process has been running, checking for tid, sending a signal to drop 
caches, etc.

We'd also like to do this for global oom conditions, which makes it even 
more interesting.  I was thinking of using a fraction of memory reserves 
as the oom killer currently does (that memory below the min watermark) for 
these purposes.

Memory charging is simply bypassed for these oom handlers (we only grant 
access to those waiting on the memory.oom_control eventfd) up to 
memory.limit_in_bytes + (min_free_kbytes / 4), for example.  I don't think 
this is entirely insane because these oom handlers should lead to future 
memory freeing, just like TIF_MEMDIE processes.

> Going back to your patch, what's confusing is your approach.
> Why the problem caused by the amount of memory should be solved by
> some dealy, i.e. the amount of time ?
> 
> This exchanging sounds confusing to me.
> 

Even with all of the above (which is not actually that invasive of a 
patch), I still think we need memory.oom_delay_millisecs.  I probably made 
a mistake in describing what that is addressing if it seems like it's 
trying to address any of the above.

If a userspace oom handler fails to respond even with access to those 
"memcg reserves", the kernel needs to kill within that memcg.  Do we do 
that above a set time period (this patch) or when the reserves are 
completely exhausted?  That's debatable, but if we are to allow it for 
global oom conditions as well then my opinion was to make it as safe as 
possible; today, we can't disable the global oom killer from userspace and 
I don't think we should ever allow it to be disabled.  I think we should 
allow userspace a reasonable amount of time to respond and then kill if it 
is exceeded.

For the global oom case, we want to have a priority-based memcg selection.  
Select the lowest priority top-level memcg and kill within it.  If it has 
an oom notifier, send it a signal to kill something.  If it fails to 
react, kill something after memory.oom_delay_millisecs has elapsed.  If 
there isn't a userspace oom notifier, kill something within that lowest 
priority memcg.

The bottomline with my approach is that I don't believe there is ever a 
reason for an oom memcg to remain oom indefinitely.  That's why I hate 
memory.oom_control == 1 and I think for the global notification it would 
be deemed a nonstarter since you couldn't even login to the machine.

> I'm not against what you finally want to do, but I don't like the fix.
> 

I'm thrilled to hear that, and I hope we can work to make userspace oom 
handling more effective.

What do you think about that above?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-14 Thread David Rientjes

On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:

 Reading your discussion, I think I understand your requirements.
 The problem is that I can't think you took into all options into
 accounts and found the best way is this new oom_delay. IOW, I can't
 convice oom-delay is the best way to handle your issue.
 

Ok, let's talk about it.

 Your requeirement is
  - Allowing userland oom-handler within local memcg.
 

Another requirement:

 - Allow userland oom handler for global oom conditions.

Hopefully that's hooked into memcg because the functionality is already 
there, we can simply duplicate all of the oom functionality that we'll be 
adding for the root memcg.

 Considering straightforward, the answer should be
  - Allowing oom-handler daemon out of memcg's control by its limit.
(For example, a flag/capability for a task can archive this.)
Or attaching some *fixed* resource to the task rather than cgroup.
 
Allow to set task-secret_saving=20M.
 

Exactly!

First of all, thanks very much for taking an interest in our usecase and 
discussing it with us.

I didn't propose what I referred to earlier in the thread as memcg 
reserves because I thought it was going to be a more difficult battle.  
The fact that you brought it up first actually makes me think it's less 
insane :)

We do indeed want memcg reserves and I have patches to add it if you'd 
like to see that first.  It ensures that this userspace oom handler can 
actually do some work in determining which process to kill.  The reserve 
is a fraction of true memory reserves (the space below the per-zone min 
watermarks) which is dependent on min_free_kbytes.  This does indeed 
become more difficult with true and complete kmem charging.  That work 
could be opening the tasks file (which allocates the pidlist within the 
kernel), checking /proc/pid/status for rss, checking for how long a 
process has been running, checking for tid, sending a signal to drop 
caches, etc.

We'd also like to do this for global oom conditions, which makes it even 
more interesting.  I was thinking of using a fraction of memory reserves 
as the oom killer currently does (that memory below the min watermark) for 
these purposes.

Memory charging is simply bypassed for these oom handlers (we only grant 
access to those waiting on the memory.oom_control eventfd) up to 
memory.limit_in_bytes + (min_free_kbytes / 4), for example.  I don't think 
this is entirely insane because these oom handlers should lead to future 
memory freeing, just like TIF_MEMDIE processes.

 Going back to your patch, what's confusing is your approach.
 Why the problem caused by the amount of memory should be solved by
 some dealy, i.e. the amount of time ?
 
 This exchanging sounds confusing to me.
 

Even with all of the above (which is not actually that invasive of a 
patch), I still think we need memory.oom_delay_millisecs.  I probably made 
a mistake in describing what that is addressing if it seems like it's 
trying to address any of the above.

If a userspace oom handler fails to respond even with access to those 
memcg reserves, the kernel needs to kill within that memcg.  Do we do 
that above a set time period (this patch) or when the reserves are 
completely exhausted?  That's debatable, but if we are to allow it for 
global oom conditions as well then my opinion was to make it as safe as 
possible; today, we can't disable the global oom killer from userspace and 
I don't think we should ever allow it to be disabled.  I think we should 
allow userspace a reasonable amount of time to respond and then kill if it 
is exceeded.

For the global oom case, we want to have a priority-based memcg selection.  
Select the lowest priority top-level memcg and kill within it.  If it has 
an oom notifier, send it a signal to kill something.  If it fails to 
react, kill something after memory.oom_delay_millisecs has elapsed.  If 
there isn't a userspace oom notifier, kill something within that lowest 
priority memcg.

The bottomline with my approach is that I don't believe there is ever a 
reason for an oom memcg to remain oom indefinitely.  That's why I hate 
memory.oom_control == 1 and I think for the global notification it would 
be deemed a nonstarter since you couldn't even login to the machine.

 I'm not against what you finally want to do, but I don't like the fix.
 

I'm thrilled to hear that, and I hope we can work to make userspace oom 
handling more effective.

What do you think about that above?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-13 Thread Kamezawa Hiroyuki


(2013/06/14 7:25), David Rientjes wrote:

On Thu, 13 Jun 2013, Michal Hocko wrote:


That's not at all the objective, the changelog quite explicitly states
this is a deadlock as the result of userspace having disabled the oom
killer so that its userspace oom handler can resolve the condition and it
being unresponsive or unable to perform its job.


Ohh, so another round. Sigh. You insist on having user space handlers
running in the context of the limited group. OK, I can understand your
use case, although I think it is pushing the limits of the interface and
it is dangerous.


Ok, this is where our misunderstanding is, and I can see why you have
reacted the way you have.  It's my fault for not describing where we're
going with this.



Reading your discussion, I think I understand your requirements.
The problem is that I can't think you took into all options into
accounts and found the best way is this new oom_delay. IOW, I can't
convice oom-delay is the best way to handle your issue.

Your requeirement is
 - Allowing userland oom-handler within local memcg.

Considering straightforward, the answer should be
 - Allowing oom-handler daemon out of memcg's control by its limit.
   (For example, a flag/capability for a task can archive this.)
   Or attaching some *fixed* resource to the task rather than cgroup.

   Allow to set task->secret_saving=20M.


Going back to your patch, what's confusing is your approach.
Why the problem caused by the amount of memory should be solved by
some dealy, i.e. the amount of time ?

This exchanging sounds confusing to me.

I'm not against what you finally want to do, but I don't like the fix.

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-13 Thread David Rientjes

On Thu, 13 Jun 2013, Michal Hocko wrote:

> > That's not at all the objective, the changelog quite explicitly states 
> > this is a deadlock as the result of userspace having disabled the oom 
> > killer so that its userspace oom handler can resolve the condition and it 
> > being unresponsive or unable to perform its job.
> 
> Ohh, so another round. Sigh. You insist on having user space handlers
> running in the context of the limited group. OK, I can understand your
> use case, although I think it is pushing the limits of the interface and
> it is dangerous.

Ok, this is where our misunderstanding is, and I can see why you have 
reacted the way you have.  It's my fault for not describing where we're 
going with this.

Userspace oom handling right now is very inadequate, not only for the 
reasons that you and Johannes have described.  This requirement that 
userspace oom handlers must run with more memory than the memcgs it is 
monitoring is where the pain is.

We have to accept the fact that users can be given control of their own 
memcg trees by chowning that "user root" directory to the user.  The only 
real change you must enforce is that the limit cannot be increased unless 
you have a special capability.  Otherwise, users can be given complete 
control over their own tree and enforce their own memory limits (or just 
accounting).

I haven't seen any objection to that usecase from anybody, so I'm assuming 
that it's something that is understood and acknowledged.  Nothing in the 
cgroups code prevents it and I believe it gives a much more powerful 
interface to the user to control their own destiny much like setting up 
root level memcgs for the admin.

The problem is that we want to control global oom conditions as well.  
oom_score_adj simply doesn't allow for user-specific implementations 
either within user subcontainers or at the root.  Users may very well want 
to kill the allocating task in their memcg to avoid expensive scans (SGI 
required oom_kill_allocating_task at a global level), kill the largest 
memory consumer, do a priority-based memcg scoring, kill the newest 
process, etc.  That's the power of user oom notifications beyond just 
statistics collection or creating a tombstone to examine later.

We want to do that at the root level as well for global oom conditions.  
We have a priority-based scoring method amongst top-level memcgs.  They 
are jobs that are scheduled on machines and given a certain priority to 
run at.  We also overcommit the machine so the sum of all top-level
memory.limit_in_bytes exceeds the amount of physical RAM since jobs seldom 
use their entire reservation.  When RAM is exhausted, we want to kill the 
lowest priority process from the lowest priority memcg.

That's not possible to do currently because we lack global oom 
notification and userspace oom handling.  We want to change that by 
notifying on the root memcg's memory.oom_control for global oom and wakeup 
a process that will handle global oom conditions.  Memcg is the perfect 
vehicle for this since it already does userspace oom notification and 
because the oom condition in a "user root" can be handled the same as a 
"global oom".

Waking up userspace and asking it to do anything useful in a global oom 
condition is problematic if it is unresponsive for whatever reason.  
Especially with true kmem accounting of all slab, we cannot be guaranteed 
that this global oom handler can actually do things like query the memory 
usage of a process or read a memcg's tasks file.  For this reason, we want 
a kernel failsafe to ensure that if it isn't able to respond in a 
reasonable amount of time, the kernel frees memory itself.

There's simply no other alternative to doing something like 
memory.oom_delay_millisecs then at the root level.  For notification to 
actually work, we need to preempt the kernel at least for a certain amount 
of time from killing: the solution right now is to disable the oom killer 
entirely with memory.oom_control == 1.  If the global oom handler fails 
the respond, the machine is essentially dead unless something starts 
freeing memory back for some strange reason.

For our own priority-based memcg oom scoring implementation, this global 
oom handler would then signal the memcg oom handler in the lowest priority 
memcg, if it exists, to handle the situation or it would kill something 
from that memcg itself.

The key is that without some sort of kernel-imposed delay, there's no way 
to sanely implement a global oom handler that doesn't risk killing the 
entire machine because the kernel oom killer is deferred.  This patch 
makes that delay configurable.

I could certainly push the patch that does the global oom notification on 
the root level memcg's memory.oom_control first if you'd like, but I 
didn't think there would be so much opposition to running a memcg oom 
handler in the memcg itself, which is what this essentially is.

If that's done, then we can't simply rely on the global

Re: [patch] mm, memcg: add oom killer delay

2013-06-13 Thread Michal Hocko

On Wed 12-06-13 14:27:05, David Rientjes wrote:
> On Wed, 12 Jun 2013, Michal Hocko wrote:
> 
> > But the objective is to handle oom deadlocks gracefully and you cannot
> > possibly miss those as they are, well, _deadlocks_.
> 
> That's not at all the objective, the changelog quite explicitly states 
> this is a deadlock as the result of userspace having disabled the oom 
> killer so that its userspace oom handler can resolve the condition and it 
> being unresponsive or unable to perform its job.

Ohh, so another round. Sigh. You insist on having user space handlers
running in the context of the limited group. OK, I can understand your
use case, although I think it is pushing the limits of the interface and
it is dangerous.
As the problems/deadlocks are unavoidable with this approach you really
need a backup plan to reduce the damage once it happens. You insist on
having in-kernel solution while there is a user space alternative
possible IMO.

> When you allow users to create their own memcgs, which we do and is 
> possible by chowning the user's root to be owned by it, and implement 
> their own userspace oom notifier, you must then rely on their 
> implementation to work 100% of the time, otherwise all those gigabytes of 
> memory go unfreed forever.  What you're insisting on is that this 
> userspace is perfect

No, I am not saying that. You can let your untrusted users handle
their OOMs as you like. The watchdog (your fallback solution) _has_
to be trusted though and if you want it to do the job then you better
implement it correctly. Is this requirement a problem?

> and there is never any memory allocated (otherwise it may oom its own
> user root memcg where the notifier is hosted)

If the watchdog runs under root memcg then there is no limit so the only
OOM that might happen is the global one and you can protect the watchdog
by oom_adj as I have mentioned earlier.

> and it is always responsive and able to handle the situation.

If it is a trusted code then it can run with a real time priority.

> This is not reality.

It seems you just do not want to accept that there is other solution
because the kernel solution sounds like an easier option for you.

> This is why the kernel has its own oom killer and doesn't wait for a user 
> to go to kill something.  There's no option to disable the kernel oom 
> killer.  It's because we don't want to leave the system in a state where 
> no progress can be made.  The same intention is for memcgs to not be left 
> in a state where no progress can be made even if userspace has the best 
> intentions.
> 
> Your solution of a global entity to prevent these situations doesn't work 
> for the same reason we can't implement the kernel oom killer in userspace.  
> It's the exact same reason. 

No it is not! The core difference is that there is _always_ some memory
for the watchdog (because something else might be killed to free some
memory) while there is none for the global OOM. So the watchdog even
doesn't need to mlock everything in.

> We also want to push patches that allow global oom conditions to
> trigger an eventfd notification on the root memcg with the exact same
> semantics of a memcg oom:

As already mentioned we have discussed this at LSF. I am still not
sure it is the right thing to do though. The interface would be too
tricky. There are other options to implement user defined policy for the
global OOM and they should be considered before there is a decision to
push it into memcg.

> allow it time to respond but 
> step in and kill something if it fails to respond.  Memcg happens to be 
> the perfect place to implement such a userspace policy and we want to have 
> a priority-based killing mechanism that is hierarchical and different from 
> oom_score_adj.

It might sound perfect for your use cases but we should be _really_
careful to not pull another tricky interface into memcg that would fit
a certain scenario.
Remember use_hierarchy thingy? It sounded like a good idea at the time
and it turned into a nightmare over time. It also aimed at solving a
restriction at the time. The restriction is not here anymore AFAICT but
we have a crippled hierarchy semantic.

> For that to work properly, it cannot possibly allocate memory even on page 
> fault so it must be mlocked in memory and have enough buffers to store the 
> priorities of top-level memcgs.  Asking a global watchdog to sit there 
> mlocked in memory to store thousands of memcgs, their priorities, their 
> last oom, their timeouts, etc, is a non-starter.

I have asked you about an estimation already. I do not think that the
memory consumption would really matter here. We are talking about few
megs at most and even that is exaggerated.

> I don't buy your argument that we're pushing any interface to an extreme.  

OK, call it a matter of taste but handling oom in the context of the
oom itself without any requirements to the handler implementation and
without access to any memory reserves because

Re: [patch] mm, memcg: add oom killer delay

2013-06-13 Thread Michal Hocko

On Wed 12-06-13 14:27:05, David Rientjes wrote:
 On Wed, 12 Jun 2013, Michal Hocko wrote:
 
  But the objective is to handle oom deadlocks gracefully and you cannot
  possibly miss those as they are, well, _deadlocks_.
 
 That's not at all the objective, the changelog quite explicitly states 
 this is a deadlock as the result of userspace having disabled the oom 
 killer so that its userspace oom handler can resolve the condition and it 
 being unresponsive or unable to perform its job.

Ohh, so another round. Sigh. You insist on having user space handlers
running in the context of the limited group. OK, I can understand your
use case, although I think it is pushing the limits of the interface and
it is dangerous.
As the problems/deadlocks are unavoidable with this approach you really
need a backup plan to reduce the damage once it happens. You insist on
having in-kernel solution while there is a user space alternative
possible IMO.

 When you allow users to create their own memcgs, which we do and is 
 possible by chowning the user's root to be owned by it, and implement 
 their own userspace oom notifier, you must then rely on their 
 implementation to work 100% of the time, otherwise all those gigabytes of 
 memory go unfreed forever.  What you're insisting on is that this 
 userspace is perfect

No, I am not saying that. You can let your untrusted users handle
their OOMs as you like. The watchdog (your fallback solution) _has_
to be trusted though and if you want it to do the job then you better
implement it correctly. Is this requirement a problem?

 and there is never any memory allocated (otherwise it may oom its own
 user root memcg where the notifier is hosted)

If the watchdog runs under root memcg then there is no limit so the only
OOM that might happen is the global one and you can protect the watchdog
by oom_adj as I have mentioned earlier.

 and it is always responsive and able to handle the situation.

If it is a trusted code then it can run with a real time priority.

 This is not reality.

It seems you just do not want to accept that there is other solution
because the kernel solution sounds like an easier option for you.

 This is why the kernel has its own oom killer and doesn't wait for a user 
 to go to kill something.  There's no option to disable the kernel oom 
 killer.  It's because we don't want to leave the system in a state where 
 no progress can be made.  The same intention is for memcgs to not be left 
 in a state where no progress can be made even if userspace has the best 
 intentions.
 
 Your solution of a global entity to prevent these situations doesn't work 
 for the same reason we can't implement the kernel oom killer in userspace.  
 It's the exact same reason. 

No it is not! The core difference is that there is _always_ some memory
for the watchdog (because something else might be killed to free some
memory) while there is none for the global OOM. So the watchdog even
doesn't need to mlock everything in.

 We also want to push patches that allow global oom conditions to
 trigger an eventfd notification on the root memcg with the exact same
 semantics of a memcg oom:

As already mentioned we have discussed this at LSF. I am still not
sure it is the right thing to do though. The interface would be too
tricky. There are other options to implement user defined policy for the
global OOM and they should be considered before there is a decision to
push it into memcg.

 allow it time to respond but 
 step in and kill something if it fails to respond.  Memcg happens to be 
 the perfect place to implement such a userspace policy and we want to have 
 a priority-based killing mechanism that is hierarchical and different from 
 oom_score_adj.

It might sound perfect for your use cases but we should be _really_
careful to not pull another tricky interface into memcg that would fit
a certain scenario.
Remember use_hierarchy thingy? It sounded like a good idea at the time
and it turned into a nightmare over time. It also aimed at solving a
restriction at the time. The restriction is not here anymore AFAICT but
we have a crippled hierarchy semantic.

 For that to work properly, it cannot possibly allocate memory even on page 
 fault so it must be mlocked in memory and have enough buffers to store the 
 priorities of top-level memcgs.  Asking a global watchdog to sit there 
 mlocked in memory to store thousands of memcgs, their priorities, their 
 last oom, their timeouts, etc, is a non-starter.

I have asked you about an estimation already. I do not think that the
memory consumption would really matter here. We are talking about few
megs at most and even that is exaggerated.

 I don't buy your argument that we're pushing any interface to an extreme.  

OK, call it a matter of taste but handling oom in the context of the
oom itself without any requirements to the handler implementation and
without access to any memory reserves because the handler is not trusted
feels weird and

Re: [patch] mm, memcg: add oom killer delay

2013-06-13 Thread David Rientjes

On Thu, 13 Jun 2013, Michal Hocko wrote:

  That's not at all the objective, the changelog quite explicitly states 
  this is a deadlock as the result of userspace having disabled the oom 
  killer so that its userspace oom handler can resolve the condition and it 
  being unresponsive or unable to perform its job.
 
 Ohh, so another round. Sigh. You insist on having user space handlers
 running in the context of the limited group. OK, I can understand your
 use case, although I think it is pushing the limits of the interface and
 it is dangerous.

Ok, this is where our misunderstanding is, and I can see why you have 
reacted the way you have.  It's my fault for not describing where we're 
going with this.

Userspace oom handling right now is very inadequate, not only for the 
reasons that you and Johannes have described.  This requirement that 
userspace oom handlers must run with more memory than the memcgs it is 
monitoring is where the pain is.

We have to accept the fact that users can be given control of their own 
memcg trees by chowning that user root directory to the user.  The only 
real change you must enforce is that the limit cannot be increased unless 
you have a special capability.  Otherwise, users can be given complete 
control over their own tree and enforce their own memory limits (or just 
accounting).

I haven't seen any objection to that usecase from anybody, so I'm assuming 
that it's something that is understood and acknowledged.  Nothing in the 
cgroups code prevents it and I believe it gives a much more powerful 
interface to the user to control their own destiny much like setting up 
root level memcgs for the admin.

The problem is that we want to control global oom conditions as well.  
oom_score_adj simply doesn't allow for user-specific implementations 
either within user subcontainers or at the root.  Users may very well want 
to kill the allocating task in their memcg to avoid expensive scans (SGI 
required oom_kill_allocating_task at a global level), kill the largest 
memory consumer, do a priority-based memcg scoring, kill the newest 
process, etc.  That's the power of user oom notifications beyond just 
statistics collection or creating a tombstone to examine later.

We want to do that at the root level as well for global oom conditions.  
We have a priority-based scoring method amongst top-level memcgs.  They 
are jobs that are scheduled on machines and given a certain priority to 
run at.  We also overcommit the machine so the sum of all top-level
memory.limit_in_bytes exceeds the amount of physical RAM since jobs seldom 
use their entire reservation.  When RAM is exhausted, we want to kill the 
lowest priority process from the lowest priority memcg.

That's not possible to do currently because we lack global oom 
notification and userspace oom handling.  We want to change that by 
notifying on the root memcg's memory.oom_control for global oom and wakeup 
a process that will handle global oom conditions.  Memcg is the perfect 
vehicle for this since it already does userspace oom notification and 
because the oom condition in a user root can be handled the same as a 
global oom.

Waking up userspace and asking it to do anything useful in a global oom 
condition is problematic if it is unresponsive for whatever reason.  
Especially with true kmem accounting of all slab, we cannot be guaranteed 
that this global oom handler can actually do things like query the memory 
usage of a process or read a memcg's tasks file.  For this reason, we want 
a kernel failsafe to ensure that if it isn't able to respond in a 
reasonable amount of time, the kernel frees memory itself.

There's simply no other alternative to doing something like 
memory.oom_delay_millisecs then at the root level.  For notification to 
actually work, we need to preempt the kernel at least for a certain amount 
of time from killing: the solution right now is to disable the oom killer 
entirely with memory.oom_control == 1.  If the global oom handler fails 
the respond, the machine is essentially dead unless something starts 
freeing memory back for some strange reason.

For our own priority-based memcg oom scoring implementation, this global 
oom handler would then signal the memcg oom handler in the lowest priority 
memcg, if it exists, to handle the situation or it would kill something 
from that memcg itself.

The key is that without some sort of kernel-imposed delay, there's no way 
to sanely implement a global oom handler that doesn't risk killing the 
entire machine because the kernel oom killer is deferred.  This patch 
makes that delay configurable.

I could certainly push the patch that does the global oom notification on 
the root level memcg's memory.oom_control first if you'd like, but I 
didn't think there would be so much opposition to running a memcg oom 
handler in the memcg itself, which is what this essentially is.

If that's done, then we can't simply rely on the global oom killer, which

Re: [patch] mm, memcg: add oom killer delay

2013-06-13 Thread Kamezawa Hiroyuki


(2013/06/14 7:25), David Rientjes wrote:

On Thu, 13 Jun 2013, Michal Hocko wrote:


That's not at all the objective, the changelog quite explicitly states
this is a deadlock as the result of userspace having disabled the oom
killer so that its userspace oom handler can resolve the condition and it
being unresponsive or unable to perform its job.


Ohh, so another round. Sigh. You insist on having user space handlers
running in the context of the limited group. OK, I can understand your
use case, although I think it is pushing the limits of the interface and
it is dangerous.


Ok, this is where our misunderstanding is, and I can see why you have
reacted the way you have.  It's my fault for not describing where we're
going with this.



Reading your discussion, I think I understand your requirements.
The problem is that I can't think you took into all options into
accounts and found the best way is this new oom_delay. IOW, I can't
convice oom-delay is the best way to handle your issue.

Your requeirement is
 - Allowing userland oom-handler within local memcg.

Considering straightforward, the answer should be
 - Allowing oom-handler daemon out of memcg's control by its limit.
   (For example, a flag/capability for a task can archive this.)
   Or attaching some *fixed* resource to the task rather than cgroup.

   Allow to set task-secret_saving=20M.


Going back to your patch, what's confusing is your approach.
Why the problem caused by the amount of memory should be solved by
some dealy, i.e. the amount of time ?

This exchanging sounds confusing to me.

I'm not against what you finally want to do, but I don't like the fix.

Thanks,
-Kame





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-12 Thread David Rientjes

On Wed, 12 Jun 2013, Michal Hocko wrote:

> But the objective is to handle oom deadlocks gracefully and you cannot
> possibly miss those as they are, well, _deadlocks_.

That's not at all the objective, the changelog quite explicitly states 
this is a deadlock as the result of userspace having disabled the oom 
killer so that its userspace oom handler can resolve the condition and it 
being unresponsive or unable to perform its job.

When you allow users to create their own memcgs, which we do and is 
possible by chowning the user's root to be owned by it, and implement 
their own userspace oom notifier, you must then rely on their 
implementation to work 100% of the time, otherwise all those gigabytes of 
memory go unfreed forever.  What you're insisting on is that this 
userspace is perfect and there is never any memory allocated (otherwise it 
may oom its own user root memcg where the notifier is hosted) and it is 
always responsive and able to handle the situation.  This is not reality.

This is why the kernel has its own oom killer and doesn't wait for a user 
to go to kill something.  There's no option to disable the kernel oom 
killer.  It's because we don't want to leave the system in a state where 
no progress can be made.  The same intention is for memcgs to not be left 
in a state where no progress can be made even if userspace has the best 
intentions.

Your solution of a global entity to prevent these situations doesn't work 
for the same reason we can't implement the kernel oom killer in userspace.  
It's the exact same reason.  We also want to push patches that allow 
global oom conditions to trigger an eventfd notification on the root memcg 
with the exact same semantics of a memcg oom: allow it time to respond but 
step in and kill something if it fails to respond.  Memcg happens to be 
the perfect place to implement such a userspace policy and we want to have 
a priority-based killing mechanism that is hierarchical and different from 
oom_score_adj.

For that to work properly, it cannot possibly allocate memory even on page 
fault so it must be mlocked in memory and have enough buffers to store the 
priorities of top-level memcgs.  Asking a global watchdog to sit there 
mlocked in memory to store thousands of memcgs, their priorities, their 
last oom, their timeouts, etc, is a non-starter.

I don't buy your argument that we're pushing any interface to an extreme.  
Users having the ability to manipulate their own memcgs and subcontainers 
isn't extreme, it's explicitly allowed by cgroups!  What we're asking for 
is that level of control for memcg is sane and that if userspace is 
unresponsive that we don't lose gigabytes of memory forever.  And since 
we've supported this type of functionality even before memcg was created 
for cpusets and have used and supported it for six years, I have no 
problem supporting such a thing upstream.

I do understand that we're the largest user of memcg and use it unlike you 
or others on this thread do, but that doesn't mean our usecase is any less 
important or that we should aim for the most robust behavior possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-12 Thread Michal Hocko

On Tue 11-06-13 13:33:40, David Rientjes wrote:
> On Mon, 10 Jun 2013, Michal Hocko wrote:
[...]
> > Your only objection against userspace handling so far was that admin
> > has no control over sub-hierarchies. But that is hardly a problem. A
> > periodic tree scan would solve this easily (just to make sure we are on
> > the same page - this is still the global watchdog living in a trusted
> > root cgroup which is marked as unkillable for the global oom).
> > 
> 
> There's no way a "periodic tree scan" would solve this easily, it's 
> completely racy and leaves the possibility of missing oom events when they 
> happen in a new memcg before the next tree scan.

But the objective is to handle oom deadlocks gracefully and you cannot
possibly miss those as they are, well, _deadlocks_. So while the
watchdog might really miss some oom events which have been already
handled that is absolutely not important here. Please note that the
eventfd is notified also during registration if the group is under oom
already. So you cannot miss a deadlocked group.

> Not only that, but what happens when the entire machine is oom?

Then a memcg deadlock is irrelevant, isn't it? Deadlocked tasks are
sitting in the KILLABLE queue so they can be killed. And you can protect
the watchdog from being killed.

> The global watchdog can't allocate any memory to handle the event, yet
> it is supposed to always be doing entire memcg tree scans, registering
> oom handlers, and handling wakeups when notified?

So what? The global OOM will find a victim and the life goes on. I do
not think that the watchdog consumption would be negligible (simple
queue for timers, eventfd stuff for each existing group, few threads).

> How does this "global watchdog" know when to stop the timer?  In the 
> kernel that would happen if the user expands the memcg limit or there's an 
> uncharge to the memcg.  

Then the group won't be marked under_oom in memory.oom_control at the
time when timeout triggers so the watchdog knows that the oom has been
handled already and it can be discarded.

> The global watchdog stores the limit at the time of oom and compares  
> it before killing something?

Why it would need to check the limit? It cares only about the oom
events.

> How many memcgs do we have to store this value for without allocating
> memory in a global oom condition?  You expect us to run with all this
> memory mlocked in memory forever?

No, the only thing that the watchdog has to care about is memory.oom_control.

> How does the global watchdog know that there was an uncharge and then a 
> charge in the memcg so it is still oom?

It will get a new event for every new oom.

> This would reset the timer in the kernel but not in userspace and may
> result in unnecessary killing.

User space watchdog would do the same thing. When it receives an event
it will enqueue or requeue the corresponding timer.

> If the timeout is 10s, the memcg is
> oom for 9s, uncharges, recharges, and the global watchdog checks at
> 10s that it is still oom, we don't want the kill because we do get
> uncharge events.
>
> This idea for a global watchdog just doesn't work.

I am not convinced about that. Maybe I am missing some aspect but all
problems you have mentioned are _solvable_.

[...]
> > David, maybe I am still missing something but it seems to me that the
> > in-kernel timeout is not necessary because the user-space variant is not
> > that hard to implement and I really do not want to add new knobs for
> > something that can live outside.
> 
> It's vitally necessary and unless you answer the questions I've asked 
> about your proposed "global watchdog" that exists only in your theory then 

Which questions are still not answered?

> we'll just continue to have this circular argument.  You cannot implement 
> a userspace variant that works this way and insisting that you can is 
> ridiculous.  These are problems that real users face and we insist on the 
> problem being addressed.

Your use case pushes the interface to extreme, to be honest. You are
proposing a new knob that would have to be supported for ever. I feel
that we should be careful and prefer user space solution if there is
any.

To be clear, I am not nacking this patch and I will not ack it either.
If Johannes, Kamezawa or Andrew are OK with it I will not block it.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-12 Thread Michal Hocko

On Tue 11-06-13 13:33:40, David Rientjes wrote:
 On Mon, 10 Jun 2013, Michal Hocko wrote:
[...]
  Your only objection against userspace handling so far was that admin
  has no control over sub-hierarchies. But that is hardly a problem. A
  periodic tree scan would solve this easily (just to make sure we are on
  the same page - this is still the global watchdog living in a trusted
  root cgroup which is marked as unkillable for the global oom).
  
 
 There's no way a periodic tree scan would solve this easily, it's 
 completely racy and leaves the possibility of missing oom events when they 
 happen in a new memcg before the next tree scan.

But the objective is to handle oom deadlocks gracefully and you cannot
possibly miss those as they are, well, _deadlocks_. So while the
watchdog might really miss some oom events which have been already
handled that is absolutely not important here. Please note that the
eventfd is notified also during registration if the group is under oom
already. So you cannot miss a deadlocked group.

 Not only that, but what happens when the entire machine is oom?

Then a memcg deadlock is irrelevant, isn't it? Deadlocked tasks are
sitting in the KILLABLE queue so they can be killed. And you can protect
the watchdog from being killed.

 The global watchdog can't allocate any memory to handle the event, yet
 it is supposed to always be doing entire memcg tree scans, registering
 oom handlers, and handling wakeups when notified?

So what? The global OOM will find a victim and the life goes on. I do
not think that the watchdog consumption would be negligible (simple
queue for timers, eventfd stuff for each existing group, few threads).

 How does this global watchdog know when to stop the timer?  In the 
 kernel that would happen if the user expands the memcg limit or there's an 
 uncharge to the memcg.  

Then the group won't be marked under_oom in memory.oom_control at the
time when timeout triggers so the watchdog knows that the oom has been
handled already and it can be discarded.

 The global watchdog stores the limit at the time of oom and compares  
 it before killing something?

Why it would need to check the limit? It cares only about the oom
events.

 How many memcgs do we have to store this value for without allocating
 memory in a global oom condition?  You expect us to run with all this
 memory mlocked in memory forever?

No, the only thing that the watchdog has to care about is memory.oom_control.

 How does the global watchdog know that there was an uncharge and then a 
 charge in the memcg so it is still oom?

It will get a new event for every new oom.

 This would reset the timer in the kernel but not in userspace and may
 result in unnecessary killing.

User space watchdog would do the same thing. When it receives an event
it will enqueue or requeue the corresponding timer.

 If the timeout is 10s, the memcg is
 oom for 9s, uncharges, recharges, and the global watchdog checks at
 10s that it is still oom, we don't want the kill because we do get
 uncharge events.

 This idea for a global watchdog just doesn't work.

I am not convinced about that. Maybe I am missing some aspect but all
problems you have mentioned are _solvable_.
 
[...]
  David, maybe I am still missing something but it seems to me that the
  in-kernel timeout is not necessary because the user-space variant is not
  that hard to implement and I really do not want to add new knobs for
  something that can live outside.
 
 It's vitally necessary and unless you answer the questions I've asked 
 about your proposed global watchdog that exists only in your theory then 

Which questions are still not answered?

 we'll just continue to have this circular argument.  You cannot implement 
 a userspace variant that works this way and insisting that you can is 
 ridiculous.  These are problems that real users face and we insist on the 
 problem being addressed.

Your use case pushes the interface to extreme, to be honest. You are
proposing a new knob that would have to be supported for ever. I feel
that we should be careful and prefer user space solution if there is
any.

To be clear, I am not nacking this patch and I will not ack it either.
If Johannes, Kamezawa or Andrew are OK with it I will not block it.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-12 Thread David Rientjes

On Wed, 12 Jun 2013, Michal Hocko wrote:

 But the objective is to handle oom deadlocks gracefully and you cannot
 possibly miss those as they are, well, _deadlocks_.

That's not at all the objective, the changelog quite explicitly states 
this is a deadlock as the result of userspace having disabled the oom 
killer so that its userspace oom handler can resolve the condition and it 
being unresponsive or unable to perform its job.

When you allow users to create their own memcgs, which we do and is 
possible by chowning the user's root to be owned by it, and implement 
their own userspace oom notifier, you must then rely on their 
implementation to work 100% of the time, otherwise all those gigabytes of 
memory go unfreed forever.  What you're insisting on is that this 
userspace is perfect and there is never any memory allocated (otherwise it 
may oom its own user root memcg where the notifier is hosted) and it is 
always responsive and able to handle the situation.  This is not reality.

This is why the kernel has its own oom killer and doesn't wait for a user 
to go to kill something.  There's no option to disable the kernel oom 
killer.  It's because we don't want to leave the system in a state where 
no progress can be made.  The same intention is for memcgs to not be left 
in a state where no progress can be made even if userspace has the best 
intentions.

Your solution of a global entity to prevent these situations doesn't work 
for the same reason we can't implement the kernel oom killer in userspace.  
It's the exact same reason.  We also want to push patches that allow 
global oom conditions to trigger an eventfd notification on the root memcg 
with the exact same semantics of a memcg oom: allow it time to respond but 
step in and kill something if it fails to respond.  Memcg happens to be 
the perfect place to implement such a userspace policy and we want to have 
a priority-based killing mechanism that is hierarchical and different from 
oom_score_adj.

For that to work properly, it cannot possibly allocate memory even on page 
fault so it must be mlocked in memory and have enough buffers to store the 
priorities of top-level memcgs.  Asking a global watchdog to sit there 
mlocked in memory to store thousands of memcgs, their priorities, their 
last oom, their timeouts, etc, is a non-starter.

I don't buy your argument that we're pushing any interface to an extreme.  
Users having the ability to manipulate their own memcgs and subcontainers 
isn't extreme, it's explicitly allowed by cgroups!  What we're asking for 
is that level of control for memcg is sane and that if userspace is 
unresponsive that we don't lose gigabytes of memory forever.  And since 
we've supported this type of functionality even before memcg was created 
for cpusets and have used and supported it for six years, I have no 
problem supporting such a thing upstream.

I do understand that we're the largest user of memcg and use it unlike you 
or others on this thread do, but that doesn't mean our usecase is any less 
important or that we should aim for the most robust behavior possible.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-11 Thread David Rientjes

On Mon, 10 Jun 2013, Michal Hocko wrote:

> > > > I don't think you yet understand the problem, which is probably my 
> > > > fault.  
> > > > I'm not insisting this be implemented in the kernel, I'm saying it's 
> > > > not 
> > > > possible to do it in userspace.  
> > > 
> > > Because you still insist on having this fallback mode running inside
> > > untrusted environment AFAIU.
> > > 
> > 
> > -ENOPARSE. 
> 
> Either the global watchdog is running as a trusted code and you _can_
> implement it without dealocks (this is what I claim and I haven't heard
> strong arguments against that) or even the watchdog runs as an untrusted
> code and then I would ask why.
> 

As I already said, no global entity could possibly know and monitor the 
entire set of memcgs when users are given the power to create their own 
child memcgs within their tree.  Not only would it be unnecessarily 
expensive to scan the entire memcg tree all the time, but it would also be 
racy and miss oom events when a child memcg is created and hits its limit 
before the global entity can scan for it.

> Your only objection against userspace handling so far was that admin
> has no control over sub-hierarchies. But that is hardly a problem. A
> periodic tree scan would solve this easily (just to make sure we are on
> the same page - this is still the global watchdog living in a trusted
> root cgroup which is marked as unkillable for the global oom).
> 

There's no way a "periodic tree scan" would solve this easily, it's 
completely racy and leaves the possibility of missing oom events when they 
happen in a new memcg before the next tree scan.

Not only that, but what happens when the entire machine is oom?  The 
global watchdog can't allocate any memory to handle the event, yet it is 
supposed to always be doing entire memcg tree scans, registering oom 
handlers, and handling wakeups when notified?

How does this "global watchdog" know when to stop the timer?  In the 
kernel that would happen if the user expands the memcg limit or there's an 
uncharge to the memcg.  The global watchdog stores the limit at the time 
of oom and compares it before killing something?  How many memcgs do we 
have to store this value for without allocating memory in a global oom 
condition?  You expect us to run with all this memory mlocked in memory 
forever?

How does the global watchdog know that there was an uncharge and then a 
charge in the memcg so it is still oom?  This would reset the timer in the 
kernel but not in userspace and may result in unnecessary killing.  If 
the timeout is 10s, the memcg is oom for 9s, uncharges, recharges, and the 
global watchdog checks at 10s that it is still oom, we don't want the kill 
because we do get uncharge events.

This idea for a global watchdog just doesn't work.

> > Meanwhile, the trusted resource has no knowledge whatsoever of these
> > user subcontainers and it can't infinitely scan the memcg tree to find
> > them because that requires memory that may not be available because of
> > global oom or because of slab accounting.
> 
> Global oom can be handled by oom_adj for the global watchdog and
> slab accounting should be non issue as the limit cannot be set for the
> root cgroup - or the watchdog can live in an unlimited group.
> 

I'm referring to a global oom condition where physical RAM is exhausted, 
not anything to do with memcg.

> Besides that. How much memory are we talking about here (per
> memcg)? Have you measure that? Is it possible that your untrusted
> users could cause a DoS by creating too many groups? I would be really
> surprised and argument about global watchdog quality then.
> 

We limit the number of css_id in children but not the top level of memcgs 
that are created by the trusted resource.

> David, maybe I am still missing something but it seems to me that the
> in-kernel timeout is not necessary because the user-space variant is not
> that hard to implement and I really do not want to add new knobs for
> something that can live outside.

It's vitally necessary and unless you answer the questions I've asked 
about your proposed "global watchdog" that exists only in your theory then 
we'll just continue to have this circular argument.  You cannot implement 
a userspace variant that works this way and insisting that you can is 
ridiculous.  These are problems that real users face and we insist on the 
problem being addressed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-11 Thread David Rientjes

On Mon, 10 Jun 2013, Michal Hocko wrote:

I don't think you yet understand the problem, which is probably my 
fault.  
I'm not insisting this be implemented in the kernel, I'm saying it's 
not 
possible to do it in userspace.  
   
   Because you still insist on having this fallback mode running inside
   untrusted environment AFAIU.
   
  
  -ENOPARSE. 
 
 Either the global watchdog is running as a trusted code and you _can_
 implement it without dealocks (this is what I claim and I haven't heard
 strong arguments against that) or even the watchdog runs as an untrusted
 code and then I would ask why.
 

As I already said, no global entity could possibly know and monitor the 
entire set of memcgs when users are given the power to create their own 
child memcgs within their tree.  Not only would it be unnecessarily 
expensive to scan the entire memcg tree all the time, but it would also be 
racy and miss oom events when a child memcg is created and hits its limit 
before the global entity can scan for it.

 Your only objection against userspace handling so far was that admin
 has no control over sub-hierarchies. But that is hardly a problem. A
 periodic tree scan would solve this easily (just to make sure we are on
 the same page - this is still the global watchdog living in a trusted
 root cgroup which is marked as unkillable for the global oom).
 

There's no way a periodic tree scan would solve this easily, it's 
completely racy and leaves the possibility of missing oom events when they 
happen in a new memcg before the next tree scan.

Not only that, but what happens when the entire machine is oom?  The 
global watchdog can't allocate any memory to handle the event, yet it is 
supposed to always be doing entire memcg tree scans, registering oom 
handlers, and handling wakeups when notified?

How does this global watchdog know when to stop the timer?  In the 
kernel that would happen if the user expands the memcg limit or there's an 
uncharge to the memcg.  The global watchdog stores the limit at the time 
of oom and compares it before killing something?  How many memcgs do we 
have to store this value for without allocating memory in a global oom 
condition?  You expect us to run with all this memory mlocked in memory 
forever?

How does the global watchdog know that there was an uncharge and then a 
charge in the memcg so it is still oom?  This would reset the timer in the 
kernel but not in userspace and may result in unnecessary killing.  If 
the timeout is 10s, the memcg is oom for 9s, uncharges, recharges, and the 
global watchdog checks at 10s that it is still oom, we don't want the kill 
because we do get uncharge events.

This idea for a global watchdog just doesn't work.

  Meanwhile, the trusted resource has no knowledge whatsoever of these
  user subcontainers and it can't infinitely scan the memcg tree to find
  them because that requires memory that may not be available because of
  global oom or because of slab accounting.
 
 Global oom can be handled by oom_adj for the global watchdog and
 slab accounting should be non issue as the limit cannot be set for the
 root cgroup - or the watchdog can live in an unlimited group.
 

I'm referring to a global oom condition where physical RAM is exhausted, 
not anything to do with memcg.

 Besides that. How much memory are we talking about here (per
 memcg)? Have you measure that? Is it possible that your untrusted
 users could cause a DoS by creating too many groups? I would be really
 surprised and argument about global watchdog quality then.
 

We limit the number of css_id in children but not the top level of memcgs 
that are created by the trusted resource.

 David, maybe I am still missing something but it seems to me that the
 in-kernel timeout is not necessary because the user-space variant is not
 that hard to implement and I really do not want to add new knobs for
 something that can live outside.

It's vitally necessary and unless you answer the questions I've asked 
about your proposed global watchdog that exists only in your theory then 
we'll just continue to have this circular argument.  You cannot implement 
a userspace variant that works this way and insisting that you can is 
ridiculous.  These are problems that real users face and we insist on the 
problem being addressed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-10 Thread Michal Hocko

On Wed 05-06-13 17:09:17, David Rientjes wrote:
> On Wed, 5 Jun 2013, Michal Hocko wrote:
> 
> > > For the aforementioned reason that we give users the ability to 
> > > manipulate 
> > > their own memcg trees and userspace is untrusted.  Their oom notifiers 
> > > cannot be run as root, not only because of security but also because it 
> > > would not properly isolate their memory usage to their memcg tree.
> > 
> > Yes, but nothing prevents an admin - I hope you trust at least this
> > entity - to do the global watchdog for the fallback mode. So users can
> > play as they like  but if they are not able to cope with the oom
> > situation for the defined timeout then the global (trusted and running
> > in the root memcg) watchdog re-enables in-kernel oom handler.
> > 
> 
> Users have the full ability to manipulate their own memcg hierarchy 
> under the root, the global entity that schedules these jobs is not aware 
> of any user subcontainers that are created beneath the user root.  These 
> user subcontainers may be oom and our desire is for the user to be able to 
> have their own userspace handling implementation at a higher level (or 
> with memcg memory reserves).  Userspace is untrusted, they can't be 
> expected to register an oom notifier for a child memcg with a global 
> resource, they may not care that they deadlock and leave behind gigabytes 
> of memory that can't be freed if they oom.  And, if that userspace global 
> resource dies or becomes unresponsive itself for whatever reason, all oom 
> memcgs deadlock.
> 
> > > I don't think you yet understand the problem, which is probably my fault. 
> > >  
> > > I'm not insisting this be implemented in the kernel, I'm saying it's not 
> > > possible to do it in userspace.  
> > 
> > Because you still insist on having this fallback mode running inside
> > untrusted environment AFAIU.
> > 
> 
> -ENOPARSE. 

Either the global watchdog is running as a trusted code and you _can_
implement it without dealocks (this is what I claim and I haven't heard
strong arguments against that) or even the watchdog runs as an untrusted
code and then I would ask why.

> The failsafe is the kernel, it ensures that memcgs don't sit  
> completely deadlocked for days and weeks and take up resources that   
> can never be freed.   

Yes, I _understand_ what you are proposing.

Your only objection against userspace handling so far was that admin
has no control over sub-hierarchies. But that is hardly a problem. A
periodic tree scan would solve this easily (just to make sure we are on
the same page - this is still the global watchdog living in a trusted
root cgroup which is marked as unkillable for the global oom).

> The entire purpose of userspace oom notification is so 
> that users can implement their own policy, whatever is implemented in the 
> kernel may not apply (they may want to kill the largest process, the 
> newest, the youngest, one on a priority-based scale, etc).
> 
> > > This is the result of memcg allowing users to disable the oom killer 
> > > entirely for a memcg, which is still ridiculous, because if the user 
> > > doesn't respond then you've wasted all that memory and cannot get it back 
> > > without admin intervention or a reboot.  There are no other "features" in 
> > > the kernel that put such a responsibility on a userspace process such 
> > > that 
> > > if it doesn't work then the entire memcg deadlocks forever without admin 
> > > intervention.  We need a failsafe in the kernel.
> > 
> > But the memcg would deadlock within constrains assigned from somebody
> > trusted. So while the memory is basically wasted the limit assignment
> > already says that somebody (trusted) dedicated that much of memory. So I
> > think disabling oom for ever is not that ridiculous.
> > 
> 
> I don't understand what you're trying to say.  Yes, a trusted resource 
> sets the user root's limits and that is their allotted use.  To implement 
> a sane userspace oom handler, we need to give it time to respond; my 
> solution is memory.oom_delay_millisecs, your solution is disabling the oom 
> killer for that memcg. 

No! My solution is to let (trusted) userspace handle oom_delay_millisecs
as a fallback if user space is not able to handle oom in time.

> Anything else results in an instant oom kill from the kernel.  If the 
> user has their own implementation, with today's kernel it is required 
> to disable the oom killer entirely and nothing in that untrusted  
> environment is ever guaranteed to re-enable the oom killer or even
> have the memory to do so if it wanted.

> Meanwhile, the trusted resource has no knowledge whatsoever of these
> user subcontainers and it can't infinitely scan the memcg tree to find
> them because that requires memory that may not be available because of
> global oom or because of slab accounting.

Global oom can be handled by oom_adj for the

Re: [patch] mm, memcg: add oom killer delay

2013-06-10 Thread Michal Hocko

On Wed 05-06-13 17:09:17, David Rientjes wrote:
 On Wed, 5 Jun 2013, Michal Hocko wrote:
 
   For the aforementioned reason that we give users the ability to 
   manipulate 
   their own memcg trees and userspace is untrusted.  Their oom notifiers 
   cannot be run as root, not only because of security but also because it 
   would not properly isolate their memory usage to their memcg tree.
  
  Yes, but nothing prevents an admin - I hope you trust at least this
  entity - to do the global watchdog for the fallback mode. So users can
  play as they like  but if they are not able to cope with the oom
  situation for the defined timeout then the global (trusted and running
  in the root memcg) watchdog re-enables in-kernel oom handler.
  
 
 Users have the full ability to manipulate their own memcg hierarchy 
 under the root, the global entity that schedules these jobs is not aware 
 of any user subcontainers that are created beneath the user root.  These 
 user subcontainers may be oom and our desire is for the user to be able to 
 have their own userspace handling implementation at a higher level (or 
 with memcg memory reserves).  Userspace is untrusted, they can't be 
 expected to register an oom notifier for a child memcg with a global 
 resource, they may not care that they deadlock and leave behind gigabytes 
 of memory that can't be freed if they oom.  And, if that userspace global 
 resource dies or becomes unresponsive itself for whatever reason, all oom 
 memcgs deadlock.
 
   I don't think you yet understand the problem, which is probably my fault. 

   I'm not insisting this be implemented in the kernel, I'm saying it's not 
   possible to do it in userspace.  
  
  Because you still insist on having this fallback mode running inside
  untrusted environment AFAIU.
  
 
 -ENOPARSE. 

Either the global watchdog is running as a trusted code and you _can_
implement it without dealocks (this is what I claim and I haven't heard
strong arguments against that) or even the watchdog runs as an untrusted
code and then I would ask why.

 The failsafe is the kernel, it ensures that memcgs don't sit  
 completely deadlocked for days and weeks and take up resources that   
 can never be freed.   

Yes, I _understand_ what you are proposing.

Your only objection against userspace handling so far was that admin
has no control over sub-hierarchies. But that is hardly a problem. A
periodic tree scan would solve this easily (just to make sure we are on
the same page - this is still the global watchdog living in a trusted
root cgroup which is marked as unkillable for the global oom).

 The entire purpose of userspace oom notification is so 
 that users can implement their own policy, whatever is implemented in the 
 kernel may not apply (they may want to kill the largest process, the 
 newest, the youngest, one on a priority-based scale, etc).
 
   This is the result of memcg allowing users to disable the oom killer 
   entirely for a memcg, which is still ridiculous, because if the user 
   doesn't respond then you've wasted all that memory and cannot get it back 
   without admin intervention or a reboot.  There are no other features in 
   the kernel that put such a responsibility on a userspace process such 
   that 
   if it doesn't work then the entire memcg deadlocks forever without admin 
   intervention.  We need a failsafe in the kernel.
  
  But the memcg would deadlock within constrains assigned from somebody
  trusted. So while the memory is basically wasted the limit assignment
  already says that somebody (trusted) dedicated that much of memory. So I
  think disabling oom for ever is not that ridiculous.
  
 
 I don't understand what you're trying to say.  Yes, a trusted resource 
 sets the user root's limits and that is their allotted use.  To implement 
 a sane userspace oom handler, we need to give it time to respond; my 
 solution is memory.oom_delay_millisecs, your solution is disabling the oom 
 killer for that memcg. 

No! My solution is to let (trusted) userspace handle oom_delay_millisecs
as a fallback if user space is not able to handle oom in time.

 Anything else results in an instant oom kill from the kernel.  If the 
 user has their own implementation, with today's kernel it is required 
 to disable the oom killer entirely and nothing in that untrusted  
 environment is ever guaranteed to re-enable the oom killer or even
 have the memory to do so if it wanted.

 Meanwhile, the trusted resource has no knowledge whatsoever of these
 user subcontainers and it can't infinitely scan the memcg tree to find
 them because that requires memory that may not be available because of
 global oom or because of slab accounting.

Global oom can be handled by oom_adj for the global watchdog and
slab accounting should be non issue as the limit cannot be set for the
root cgroup - or the watchdog can live

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread David Rientjes

On Wed, 5 Jun 2013, Michal Hocko wrote:

> > For the aforementioned reason that we give users the ability to manipulate 
> > their own memcg trees and userspace is untrusted.  Their oom notifiers 
> > cannot be run as root, not only because of security but also because it 
> > would not properly isolate their memory usage to their memcg tree.
> 
> Yes, but nothing prevents an admin - I hope you trust at least this
> entity - to do the global watchdog for the fallback mode. So users can
> play as they like  but if they are not able to cope with the oom
> situation for the defined timeout then the global (trusted and running
> in the root memcg) watchdog re-enables in-kernel oom handler.
> 

Users have the full ability to manipulate their own memcg hierarchy 
under the root, the global entity that schedules these jobs is not aware 
of any user subcontainers that are created beneath the user root.  These 
user subcontainers may be oom and our desire is for the user to be able to 
have their own userspace handling implementation at a higher level (or 
with memcg memory reserves).  Userspace is untrusted, they can't be 
expected to register an oom notifier for a child memcg with a global 
resource, they may not care that they deadlock and leave behind gigabytes 
of memory that can't be freed if they oom.  And, if that userspace global 
resource dies or becomes unresponsive itself for whatever reason, all oom 
memcgs deadlock.

> > I don't think you yet understand the problem, which is probably my fault.  
> > I'm not insisting this be implemented in the kernel, I'm saying it's not 
> > possible to do it in userspace.  
> 
> Because you still insist on having this fallback mode running inside
> untrusted environment AFAIU.
> 

-ENOPARSE.  The failsafe is the kernel, it ensures that memcgs don't sit 
completely deadlocked for days and weeks and take up resources that can 
never be freed.  The entire purpose of userspace oom notification is so 
that users can implement their own policy, whatever is implemented in the 
kernel may not apply (they may want to kill the largest process, the 
newest, the youngest, one on a priority-based scale, etc).

> > This is the result of memcg allowing users to disable the oom killer 
> > entirely for a memcg, which is still ridiculous, because if the user 
> > doesn't respond then you've wasted all that memory and cannot get it back 
> > without admin intervention or a reboot.  There are no other "features" in 
> > the kernel that put such a responsibility on a userspace process such that 
> > if it doesn't work then the entire memcg deadlocks forever without admin 
> > intervention.  We need a failsafe in the kernel.
> 
> But the memcg would deadlock within constrains assigned from somebody
> trusted. So while the memory is basically wasted the limit assignment
> already says that somebody (trusted) dedicated that much of memory. So I
> think disabling oom for ever is not that ridiculous.
> 

I don't understand what you're trying to say.  Yes, a trusted resource 
sets the user root's limits and that is their allotted use.  To implement 
a sane userspace oom handler, we need to give it time to respond; my 
solution is memory.oom_delay_millisecs, your solution is disabling the oom 
killer for that memcg.  Anything else results in an instant oom kill from 
the kernel.  If the user has their own implementation, with today's kernel 
it is required to disable the oom killer entirely and nothing in that 
untrusted environment is ever guaranteed to re-enable the oom killer or 
even have the memory to do so if it wanted.  Meanwhile, the trusted 
resource has no knowledge whatsoever of these user subcontainers and it 
can't infinitely scan the memcg tree to find them because that requires 
memory that may not be available because of global oom or because of slab 
accounting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread Michal Hocko

On Tue 04-06-13 14:48:52, Johannes Weiner wrote:
> On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
[...]
> > Now that I am looking at this again I've realized that this
> > is not correct. The task which triggers memcg OOM will not
> > have memcg_oom.memcg set so it would trigger a global OOM in
> > pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
> > propagate it via mem_cgroup_do_charge) for need_to_kill or set up
> > current->memcg_oom also for need_to_kill.
> 
> You are absolutely right!  I fixed it with a separate bit, like so:
> 
> +   struct memcg_oom_info {
> +   unsigned int in_userfault:1;
> +   unsigned int in_memcg_oom:1;
> +   int wakeups;
> +   struct mem_cgroup *wait_on_memcg;
> +   } memcg_oom;
> 
> in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg
> OOM and that it should always return true to not invoke the global
> killer.  But wait_on_memcg might or might not be set, depending if the
> task needs to sleep.
> 
> So the task invoking the memcg OOM killer won't sleep but retry the
> fault instead (everybody else might be sleeping and the first OOM
> killer invocation might not have freed anything).

OK, I was about to say that you should just return CHARGE_RETRY but that
doesn't solve the locking issue because although the task which
called mem_cgroup_out_of_memory still holds lock which might prevent
the victim from terminating. Maybe worth a note in the changelog?

[...]
> > > @@ -3760,22 +3761,14 @@ unlock:
> > >  /*
> > >   * By the time we get here, we already hold the mm semaphore
> > >   */
> > > -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > > - unsigned long address, unsigned int flags)
> > > +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct 
> > > *vma,
> > > +  unsigned long address, unsigned int flags)
> > 
> > Is this reusable? Who would call this helper or is it just for the code
> > readability? I would probably prefer a smaller patch but I do not have a
> > strong opinion on this.
> 
> I just figured I'd move all the task-specific setup into the outer
> function and leave this inner one with the actual page table stuff.
> 
> The function is not called from somewhere else, but I don't see a
> problem with splitting it into logical parts and have the individual
> parts more readable.  The alternative would have been to make every
> return in handle_mm_fault() a { ret = -EFOO; goto out }, and that
> would have been harder to follow.

OK

[...]
> From: Johannes Weiner 
> Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
> 
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation.  Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
> 
> Do two things:
> 
> 1. When OOMing in a system call (buffered IO and friends), invoke the
>OOM killer but just return -ENOMEM, never sleep.  Userspace should
>be able to handle this.
> 
> 2. When OOMing in a page fault and somebody else is handling the
>situation, do not sleep directly in the charging code.  Instead,
>remember the OOMing memcg in the task struct and then fully unwind
>the page fault stack first before going to sleep.
> 
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer.  Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
> 
> Signed-off-by: Johannes Weiner 

OK this one looks good. Bitfields is not the nicests piece of art but
this is just a minor thing.

With the remaining archs moved to pagefault_out_of_memory you can add
Reviewed-by: Michal Hocko 

Thanks!

> ---
>  include/linux/memcontrol.h |  22 +++
>  include/linux/mm.h |   1 +
>  include/linux/sched.h  |   6 ++
>  mm/ksm.c   |   2 +-
>  mm/memcontrol.c| 146 
> -
>  mm/memory.c|  40 +
>  mm/oom_kill.c  |   7 ++-
>  7 files changed, 154 insertions(+), 70 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..b8ec9d1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -121,6 +121,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec 
> *lruvec, enum lru_list);
>  void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
>  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>   struct task_struct *p);
> +static inline void mem_cgroup_set_userfault(struct task_struct *p)
> +{
> + p->memcg_oom.in_userfault = 1;
> +}
> +static

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread Michal Hocko

On Tue 04-06-13 23:40:16, David Rientjes wrote:
> On Tue, 4 Jun 2013, Michal Hocko wrote:
> 
> > > I'm not sure a userspace oom notifier would want to keep a
> > > preallocated buffer around that is mlocked in memory for all possible
> > > lengths of this file.
> > 
> > Well, an oom handler which allocates memory under the same restricted
> > memory doesn't make much sense to me. Tracking all kmem allocations
> > makes it almost impossible to implement a non-trivial handler.
> > 
> 
> This isn't only about oom notifiers that reside in the same oom memcg, 
> they can be at a higher level or a different subtree entirely.  The point 
> is that they can be unresponsive either because userspace is flaky, its 
> oom itself, or we do track all slab.  The kernel is the only thing in the 
> position to fix the issue after a sensible user-defined amount of time has 
> elapsed, and that's what this patch is.
> 
> > OK, maybe I just wasn't clear enough or I am missing your point. Your
> > users _can_ implement and register their oom handlers. But as your
> > requirements are rather benevolent for handlers implementation you would
> > have a global watchdog which would sit on the oom_control of those
> > groups (which are allowed to have own handlers - all of them in your
> > case I guess) and trigger (user defined/global) timeout when it gets a
> > notification. If the group was under oom always during the timeout then
> > just disable oom_control until oom is settled (under_oom is 0).
> > 
> > Why wouldn't something like this work for your use case?
> > 
> 
> For the aforementioned reason that we give users the ability to manipulate 
> their own memcg trees and userspace is untrusted.  Their oom notifiers 
> cannot be run as root, not only because of security but also because it 
> would not properly isolate their memory usage to their memcg tree.

Yes, but nothing prevents an admin - I hope you trust at least this
entity - to do the global watchdog for the fallback mode. So users can
play as they like  but if they are not able to cope with the oom
situation for the defined timeout then the global (trusted and running
in the root memcg) watchdog re-enables in-kernel oom handler.

Maybe I just do not see why is this setup problem for your setup.

> > Hohmm, so you are insisting on something that can be implemented in the
> > userspace and put it into the kernel because it is more convenient for
> > you and your use case. This doesn't sound like a way for accepting a
> > feature.
> > 
> 
> I don't think you yet understand the problem, which is probably my fault.  
> I'm not insisting this be implemented in the kernel, I'm saying it's not 
> possible to do it in userspace.  

Because you still insist on having this fallback mode running inside
untrusted environment AFAIU.

> Your idea of a timeout implemented in userspace doesn't work in
> practice: userspace is both untrusted and cannot be guaranteed to be
> perfect and always wakeup, get the information it does according to
> its implementation, and issue a SIGKILL.
> 
> This is the result of memcg allowing users to disable the oom killer 
> entirely for a memcg, which is still ridiculous, because if the user 
> doesn't respond then you've wasted all that memory and cannot get it back 
> without admin intervention or a reboot.  There are no other "features" in 
> the kernel that put such a responsibility on a userspace process such that 
> if it doesn't work then the entire memcg deadlocks forever without admin 
> intervention.  We need a failsafe in the kernel.

But the memcg would deadlock within constrains assigned from somebody
trusted. So while the memory is basically wasted the limit assignment
already says that somebody (trusted) dedicated that much of memory. So I
think disabling oom for ever is not that ridiculous.

> Real users like this cannot run as root, and we cannot run in the root 
> memcg without charging that memory usage to the user's container for that 
> share of a global resource.  Real users do have to tollerate buggy and 
> flaky userspace implementations that cannot be guaranteed to run or do 
> what they are supposed to do.  It's too costly of a problem to not address 
> with a failsafe.  I speak strictly from experience on this.
> 
> > And yes we should make memcg oom handling less deadlock prone and
> > Johannes' work in this thread is a good step forward.
> 
> The long-term solution to that, which I already have patches for, is 
> something you would cringe even more at: memcg memory reserves that are 
> shared with per-zone memory reserves that get the global oom killer to 
> kill off that process without notification in the case the memcg memory 
> reserves cause a global oom.

I would have to see those patches.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread David Rientjes

On Tue, 4 Jun 2013, Michal Hocko wrote:

> > I'm not sure a userspace oom notifier would want to keep a
> > preallocated buffer around that is mlocked in memory for all possible
> > lengths of this file.
> 
> Well, an oom handler which allocates memory under the same restricted
> memory doesn't make much sense to me. Tracking all kmem allocations
> makes it almost impossible to implement a non-trivial handler.
> 

This isn't only about oom notifiers that reside in the same oom memcg, 
they can be at a higher level or a different subtree entirely.  The point 
is that they can be unresponsive either because userspace is flaky, its 
oom itself, or we do track all slab.  The kernel is the only thing in the 
position to fix the issue after a sensible user-defined amount of time has 
elapsed, and that's what this patch is.

> OK, maybe I just wasn't clear enough or I am missing your point. Your
> users _can_ implement and register their oom handlers. But as your
> requirements are rather benevolent for handlers implementation you would
> have a global watchdog which would sit on the oom_control of those
> groups (which are allowed to have own handlers - all of them in your
> case I guess) and trigger (user defined/global) timeout when it gets a
> notification. If the group was under oom always during the timeout then
> just disable oom_control until oom is settled (under_oom is 0).
> 
> Why wouldn't something like this work for your use case?
> 

For the aforementioned reason that we give users the ability to manipulate 
their own memcg trees and userspace is untrusted.  Their oom notifiers 
cannot be run as root, not only because of security but also because it 
would not properly isolate their memory usage to their memcg tree.

> Hohmm, so you are insisting on something that can be implemented in the
> userspace and put it into the kernel because it is more convenient for
> you and your use case. This doesn't sound like a way for accepting a
> feature.
> 

I don't think you yet understand the problem, which is probably my fault.  
I'm not insisting this be implemented in the kernel, I'm saying it's not 
possible to do it in userspace.  Your idea of a timeout implemented in 
userspace doesn't work in practice: userspace is both untrusted and cannot 
be guaranteed to be perfect and always wakeup, get the information it does 
according to its implementation, and issue a SIGKILL.

This is the result of memcg allowing users to disable the oom killer 
entirely for a memcg, which is still ridiculous, because if the user 
doesn't respond then you've wasted all that memory and cannot get it back 
without admin intervention or a reboot.  There are no other "features" in 
the kernel that put such a responsibility on a userspace process such that 
if it doesn't work then the entire memcg deadlocks forever without admin 
intervention.  We need a failsafe in the kernel.

Real users like this cannot run as root, and we cannot run in the root 
memcg without charging that memory usage to the user's container for that 
share of a global resource.  Real users do have to tollerate buggy and 
flaky userspace implementations that cannot be guaranteed to run or do 
what they are supposed to do.  It's too costly of a problem to not address 
with a failsafe.  I speak strictly from experience on this.

> And yes we should make memcg oom handling less deadlock prone and
> Johannes' work in this thread is a good step forward.

The long-term solution to that, which I already have patches for, is 
something you would cringe even more at: memcg memory reserves that are 
shared with per-zone memory reserves that get the global oom killer to 
kill off that process without notification in the case the memcg memory 
reserves cause a global oom.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread David Rientjes

On Tue, 4 Jun 2013, Michal Hocko wrote:

  I'm not sure a userspace oom notifier would want to keep a
  preallocated buffer around that is mlocked in memory for all possible
  lengths of this file.
 
 Well, an oom handler which allocates memory under the same restricted
 memory doesn't make much sense to me. Tracking all kmem allocations
 makes it almost impossible to implement a non-trivial handler.
 

This isn't only about oom notifiers that reside in the same oom memcg, 
they can be at a higher level or a different subtree entirely.  The point 
is that they can be unresponsive either because userspace is flaky, its 
oom itself, or we do track all slab.  The kernel is the only thing in the 
position to fix the issue after a sensible user-defined amount of time has 
elapsed, and that's what this patch is.

 OK, maybe I just wasn't clear enough or I am missing your point. Your
 users _can_ implement and register their oom handlers. But as your
 requirements are rather benevolent for handlers implementation you would
 have a global watchdog which would sit on the oom_control of those
 groups (which are allowed to have own handlers - all of them in your
 case I guess) and trigger (user defined/global) timeout when it gets a
 notification. If the group was under oom always during the timeout then
 just disable oom_control until oom is settled (under_oom is 0).
 
 Why wouldn't something like this work for your use case?
 

For the aforementioned reason that we give users the ability to manipulate 
their own memcg trees and userspace is untrusted.  Their oom notifiers 
cannot be run as root, not only because of security but also because it 
would not properly isolate their memory usage to their memcg tree.

 Hohmm, so you are insisting on something that can be implemented in the
 userspace and put it into the kernel because it is more convenient for
 you and your use case. This doesn't sound like a way for accepting a
 feature.
 

I don't think you yet understand the problem, which is probably my fault.  
I'm not insisting this be implemented in the kernel, I'm saying it's not 
possible to do it in userspace.  Your idea of a timeout implemented in 
userspace doesn't work in practice: userspace is both untrusted and cannot 
be guaranteed to be perfect and always wakeup, get the information it does 
according to its implementation, and issue a SIGKILL.

This is the result of memcg allowing users to disable the oom killer 
entirely for a memcg, which is still ridiculous, because if the user 
doesn't respond then you've wasted all that memory and cannot get it back 
without admin intervention or a reboot.  There are no other features in 
the kernel that put such a responsibility on a userspace process such that 
if it doesn't work then the entire memcg deadlocks forever without admin 
intervention.  We need a failsafe in the kernel.

Real users like this cannot run as root, and we cannot run in the root 
memcg without charging that memory usage to the user's container for that 
share of a global resource.  Real users do have to tollerate buggy and 
flaky userspace implementations that cannot be guaranteed to run or do 
what they are supposed to do.  It's too costly of a problem to not address 
with a failsafe.  I speak strictly from experience on this.

 And yes we should make memcg oom handling less deadlock prone and
 Johannes' work in this thread is a good step forward.

The long-term solution to that, which I already have patches for, is 
something you would cringe even more at: memcg memory reserves that are 
shared with per-zone memory reserves that get the global oom killer to 
kill off that process without notification in the case the memcg memory 
reserves cause a global oom.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread Michal Hocko

On Tue 04-06-13 23:40:16, David Rientjes wrote:
 On Tue, 4 Jun 2013, Michal Hocko wrote:
 
   I'm not sure a userspace oom notifier would want to keep a
   preallocated buffer around that is mlocked in memory for all possible
   lengths of this file.
  
  Well, an oom handler which allocates memory under the same restricted
  memory doesn't make much sense to me. Tracking all kmem allocations
  makes it almost impossible to implement a non-trivial handler.
  
 
 This isn't only about oom notifiers that reside in the same oom memcg, 
 they can be at a higher level or a different subtree entirely.  The point 
 is that they can be unresponsive either because userspace is flaky, its 
 oom itself, or we do track all slab.  The kernel is the only thing in the 
 position to fix the issue after a sensible user-defined amount of time has 
 elapsed, and that's what this patch is.
 
  OK, maybe I just wasn't clear enough or I am missing your point. Your
  users _can_ implement and register their oom handlers. But as your
  requirements are rather benevolent for handlers implementation you would
  have a global watchdog which would sit on the oom_control of those
  groups (which are allowed to have own handlers - all of them in your
  case I guess) and trigger (user defined/global) timeout when it gets a
  notification. If the group was under oom always during the timeout then
  just disable oom_control until oom is settled (under_oom is 0).
  
  Why wouldn't something like this work for your use case?
  
 
 For the aforementioned reason that we give users the ability to manipulate 
 their own memcg trees and userspace is untrusted.  Their oom notifiers 
 cannot be run as root, not only because of security but also because it 
 would not properly isolate their memory usage to their memcg tree.

Yes, but nothing prevents an admin - I hope you trust at least this
entity - to do the global watchdog for the fallback mode. So users can
play as they like  but if they are not able to cope with the oom
situation for the defined timeout then the global (trusted and running
in the root memcg) watchdog re-enables in-kernel oom handler.

Maybe I just do not see why is this setup problem for your setup.

  Hohmm, so you are insisting on something that can be implemented in the
  userspace and put it into the kernel because it is more convenient for
  you and your use case. This doesn't sound like a way for accepting a
  feature.
  
 
 I don't think you yet understand the problem, which is probably my fault.  
 I'm not insisting this be implemented in the kernel, I'm saying it's not 
 possible to do it in userspace.  

Because you still insist on having this fallback mode running inside
untrusted environment AFAIU.

 Your idea of a timeout implemented in userspace doesn't work in
 practice: userspace is both untrusted and cannot be guaranteed to be
 perfect and always wakeup, get the information it does according to
 its implementation, and issue a SIGKILL.
 
 This is the result of memcg allowing users to disable the oom killer 
 entirely for a memcg, which is still ridiculous, because if the user 
 doesn't respond then you've wasted all that memory and cannot get it back 
 without admin intervention or a reboot.  There are no other features in 
 the kernel that put such a responsibility on a userspace process such that 
 if it doesn't work then the entire memcg deadlocks forever without admin 
 intervention.  We need a failsafe in the kernel.

But the memcg would deadlock within constrains assigned from somebody
trusted. So while the memory is basically wasted the limit assignment
already says that somebody (trusted) dedicated that much of memory. So I
think disabling oom for ever is not that ridiculous.

 Real users like this cannot run as root, and we cannot run in the root 
 memcg without charging that memory usage to the user's container for that 
 share of a global resource.  Real users do have to tollerate buggy and 
 flaky userspace implementations that cannot be guaranteed to run or do 
 what they are supposed to do.  It's too costly of a problem to not address 
 with a failsafe.  I speak strictly from experience on this.
 
  And yes we should make memcg oom handling less deadlock prone and
  Johannes' work in this thread is a good step forward.
 
 The long-term solution to that, which I already have patches for, is 
 something you would cringe even more at: memcg memory reserves that are 
 shared with per-zone memory reserves that get the global oom killer to 
 kill off that process without notification in the case the memcg memory 
 reserves cause a global oom.

I would have to see those patches.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread Michal Hocko

On Tue 04-06-13 14:48:52, Johannes Weiner wrote:
 On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
[...]
  Now that I am looking at this again I've realized that this
  is not correct. The task which triggers memcg OOM will not
  have memcg_oom.memcg set so it would trigger a global OOM in
  pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
  propagate it via mem_cgroup_do_charge) for need_to_kill or set up
  current-memcg_oom also for need_to_kill.
 
 You are absolutely right!  I fixed it with a separate bit, like so:
 
 +   struct memcg_oom_info {
 +   unsigned int in_userfault:1;
 +   unsigned int in_memcg_oom:1;
 +   int wakeups;
 +   struct mem_cgroup *wait_on_memcg;
 +   } memcg_oom;
 
 in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg
 OOM and that it should always return true to not invoke the global
 killer.  But wait_on_memcg might or might not be set, depending if the
 task needs to sleep.
 
 So the task invoking the memcg OOM killer won't sleep but retry the
 fault instead (everybody else might be sleeping and the first OOM
 killer invocation might not have freed anything).

OK, I was about to say that you should just return CHARGE_RETRY but that
doesn't solve the locking issue because although the task which
called mem_cgroup_out_of_memory still holds lock which might prevent
the victim from terminating. Maybe worth a note in the changelog?

[...]
   @@ -3760,22 +3761,14 @@ unlock:
/*
 * By the time we get here, we already hold the mm semaphore
 */
   -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
   - unsigned long address, unsigned int flags)
   +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct 
   *vma,
   +  unsigned long address, unsigned int flags)
  
  Is this reusable? Who would call this helper or is it just for the code
  readability? I would probably prefer a smaller patch but I do not have a
  strong opinion on this.
 
 I just figured I'd move all the task-specific setup into the outer
 function and leave this inner one with the actual page table stuff.
 
 The function is not called from somewhere else, but I don't see a
 problem with splitting it into logical parts and have the individual
 parts more readable.  The alternative would have been to make every
 return in handle_mm_fault() a { ret = -EFOO; goto out }, and that
 would have been harder to follow.

OK

[...]
 From: Johannes Weiner han...@cmpxchg.org
 Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
 
 The memcg OOM handling is incredibly fragile because once a memcg goes
 OOM, one task (kernel or userspace) is responsible for resolving the
 situation.  Every other task that gets caught trying to charge memory
 gets stuck in a waitqueue while potentially holding various filesystem
 and mm locks on which the OOM handling task may now deadlock.
 
 Do two things:
 
 1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but just return -ENOMEM, never sleep.  Userspace should
be able to handle this.
 
 2. When OOMing in a page fault and somebody else is handling the
situation, do not sleep directly in the charging code.  Instead,
remember the OOMing memcg in the task struct and then fully unwind
the page fault stack first before going to sleep.
 
 While reworking the OOM routine, also remove a needless OOM waitqueue
 wakeup when invoking the killer.  Only uncharges and limit increases,
 things that actually change the memory situation, should do wakeups.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org

OK this one looks good. Bitfields is not the nicests piece of art but
this is just a minor thing.

With the remaining archs moved to pagefault_out_of_memory you can add
Reviewed-by: Michal Hocko mho...@suse.cz

Thanks!

 ---
  include/linux/memcontrol.h |  22 +++
  include/linux/mm.h |   1 +
  include/linux/sched.h  |   6 ++
  mm/ksm.c   |   2 +-
  mm/memcontrol.c| 146 
 -
  mm/memory.c|  40 +
  mm/oom_kill.c  |   7 ++-
  7 files changed, 154 insertions(+), 70 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index d6183f0..b8ec9d1 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -121,6 +121,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec 
 *lruvec, enum lru_list);
  void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
   struct task_struct *p);
 +static inline void mem_cgroup_set_userfault(struct task_struct *p)
 +{
 + p-memcg_oom.in_userfault = 1;
 +}
 +static inline void mem_cgroup_clear_userfault(struct task_struct *p)
 +{
 +

Re: [patch] mm, memcg: add oom killer delay

2013-06-05 Thread David Rientjes

On Wed, 5 Jun 2013, Michal Hocko wrote:

  For the aforementioned reason that we give users the ability to manipulate 
  their own memcg trees and userspace is untrusted.  Their oom notifiers 
  cannot be run as root, not only because of security but also because it 
  would not properly isolate their memory usage to their memcg tree.
 
 Yes, but nothing prevents an admin - I hope you trust at least this
 entity - to do the global watchdog for the fallback mode. So users can
 play as they like  but if they are not able to cope with the oom
 situation for the defined timeout then the global (trusted and running
 in the root memcg) watchdog re-enables in-kernel oom handler.
 

Users have the full ability to manipulate their own memcg hierarchy 
under the root, the global entity that schedules these jobs is not aware 
of any user subcontainers that are created beneath the user root.  These 
user subcontainers may be oom and our desire is for the user to be able to 
have their own userspace handling implementation at a higher level (or 
with memcg memory reserves).  Userspace is untrusted, they can't be 
expected to register an oom notifier for a child memcg with a global 
resource, they may not care that they deadlock and leave behind gigabytes 
of memory that can't be freed if they oom.  And, if that userspace global 
resource dies or becomes unresponsive itself for whatever reason, all oom 
memcgs deadlock.

  I don't think you yet understand the problem, which is probably my fault.  
  I'm not insisting this be implemented in the kernel, I'm saying it's not 
  possible to do it in userspace.  
 
 Because you still insist on having this fallback mode running inside
 untrusted environment AFAIU.
 

-ENOPARSE.  The failsafe is the kernel, it ensures that memcgs don't sit 
completely deadlocked for days and weeks and take up resources that can 
never be freed.  The entire purpose of userspace oom notification is so 
that users can implement their own policy, whatever is implemented in the 
kernel may not apply (they may want to kill the largest process, the 
newest, the youngest, one on a priority-based scale, etc).

  This is the result of memcg allowing users to disable the oom killer 
  entirely for a memcg, which is still ridiculous, because if the user 
  doesn't respond then you've wasted all that memory and cannot get it back 
  without admin intervention or a reboot.  There are no other features in 
  the kernel that put such a responsibility on a userspace process such that 
  if it doesn't work then the entire memcg deadlocks forever without admin 
  intervention.  We need a failsafe in the kernel.
 
 But the memcg would deadlock within constrains assigned from somebody
 trusted. So while the memory is basically wasted the limit assignment
 already says that somebody (trusted) dedicated that much of memory. So I
 think disabling oom for ever is not that ridiculous.
 

I don't understand what you're trying to say.  Yes, a trusted resource 
sets the user root's limits and that is their allotted use.  To implement 
a sane userspace oom handler, we need to give it time to respond; my 
solution is memory.oom_delay_millisecs, your solution is disabling the oom 
killer for that memcg.  Anything else results in an instant oom kill from 
the kernel.  If the user has their own implementation, with today's kernel 
it is required to disable the oom killer entirely and nothing in that 
untrusted environment is ever guaranteed to re-enable the oom killer or 
even have the memory to do so if it wanted.  Meanwhile, the trusted 
resource has no knowledge whatsoever of these user subcontainers and it 
can't infinitely scan the memcg tree to find them because that requires 
memory that may not be available because of global oom or because of slab 
accounting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Michal Hocko

On Tue 04-06-13 14:48:52, Johannes Weiner wrote:
> On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
[...]
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6dc1882..ff5e2d7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, 
> > > struct mm_struct *mm,
> > >   while (!(page = follow_page_mask(vma, start,
> > >   foll_flags, _mask))) {
> > >   int ret;
> > > - unsigned int fault_flags = 0;
> > > + unsigned int fault_flags = FAULT_FLAG_KERNEL;
> > >  
> > >   /* For mlock, just skip the stack guard page. */
> > >   if (foll_flags & FOLL_MLOCK) {
> > 
> > This is also a bit tricky. Say there is an unlikely situation when a
> > task fails to charge because of memcg OOM, it couldn't lock the oom
> > so it ended up with current->memcg_oom set and __get_user_pages will
> > turn VM_FAULT_OOM into ENOMEM but memcg_oom is still there. Then the
> > following global OOM condition gets confused (well the oom will be
> > triggered by somebody else so it shouldn't end up in the endless loop
> > but still...), doesn't it?
> 
> But current->memcg_oom is not set up unless current->in_userfault.
> And get_user_pages does not set this flag.

And my selective blindness strikes again :/ For some reason I have read
those places as they enable the fault flag. Which would make some sense
if there was a post handling...

Anyway, I will get back to the updated patch tomorrow with a clean and
fresh head.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Johannes Weiner

On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
> On Mon 03-06-13 14:30:18, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> [...]
> > > > I am just afraid about all the other archs that do not support (from
> > > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > > > openrisc, score and tile). What would be an alternative for them?
> > > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > > > would be acceptable for me.
> > > 
> > > blackfin is NOMMU but I guess the others should be converted to the
> > > proper OOM protocol anyway and not just kill the faulting task.  I can
> > > update them in the next version of the patch (series).
> > 
> > It's no longer necessary since I remove the arch-specific flag
> > setting, but I converted them anyway while I was at it.  Will send
> > them as a separate patch.
> 
> I am still not sure doing this unconditionally is the right way. Say
> that a new arch will be added. How the poor implementer knows that memcg
> oom handling requires an arch specific code to work properly?

It doesn't.  All that's required is that it follows the generic OOM
killer protocol and calls pagefault_out_of_memory().  Killing the
faulting task right there in the page fault handler is simply a bug
and needs to be fixed so I don't think that any extra care from our
side is necessary.

> > @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup 
> > *memcg)
> >  }
> >  
> >  /*
> > - * try to call OOM killer. returns false if we should exit memory-reclaim 
> > loop.
> > + * try to call OOM killer
> >   */
> > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > - int order)
> > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  {
> > -   struct oom_wait_info owait;
> > -   bool locked, need_to_kill;
> > -
> > -   owait.memcg = memcg;
> > -   owait.wait.flags = 0;
> > -   owait.wait.func = memcg_oom_wake_function;
> > -   owait.wait.private = current;
> > -   INIT_LIST_HEAD(_list);
> > -   need_to_kill = true;
> > -   mem_cgroup_mark_under_oom(memcg);
> > +   bool locked, need_to_kill = true;
> >  
> > /* At first, try to OOM lock hierarchy under memcg.*/
> > spin_lock(_oom_lock);
> > locked = mem_cgroup_oom_lock(memcg);
> > -   /*
> > -* Even if signal_pending(), we can't quit charge() loop without
> > -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> > -* under OOM is always welcomed, use TASK_KILLABLE here.
> > -*/
> > -   prepare_to_wait(_oom_waitq, , TASK_KILLABLE);
> > -   if (!locked || memcg->oom_kill_disable)
> > +   if (!locked || memcg->oom_kill_disable) {
> > need_to_kill = false;
> > +   if (current->in_userfault) {
> > +   /*
> > +* We start sleeping on the OOM waitqueue only
> > +* after unwinding the page fault stack, so
> > +* make sure we detect wakeups that happen
> > +* between now and then.
> > +*/
> > +   mem_cgroup_mark_under_oom(memcg);
> > +   current->memcg_oom.wakeups =
> > +   atomic_read(>oom_wakeups);
> > +   css_get(>css);
> > +   current->memcg_oom.memcg = memcg;
> > +   }
> > +   }
> > if (locked)
> > mem_cgroup_oom_notify(memcg);
> > spin_unlock(_oom_lock);
> >  
> > -   if (need_to_kill) {
> > -   finish_wait(_oom_waitq, );
> > +   if (need_to_kill)
> > mem_cgroup_out_of_memory(memcg, mask, order);
> 
> Now that I am looking at this again I've realized that this
> is not correct. The task which triggers memcg OOM will not
> have memcg_oom.memcg set so it would trigger a global OOM in
> pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
> propagate it via mem_cgroup_do_charge) for need_to_kill or set up
> current->memcg_oom also for need_to_kill.

You are absolutely right!  I fixed it with a separate bit, like so:

+   struct memcg_oom_info {
+   unsigned int in_userfault:1;
+   unsigned int in_memcg_oom:1;
+   int wakeups;
+   struct mem_cgroup *wait_on_memcg;
+   } memcg_oom;

in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg
OOM and that it should always return true to not invoke the global
killer.  But wait_on_memcg might or might not be set, depending if the
task needs to sleep.

So the task invoking the memcg OOM killer won't sleep but retry the
fault instead (everybody else might be sleeping and the first OOM
killer invocation might not have freed anything).

> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6dc1882..ff5e2d7 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Michal Hocko

On Mon 03-06-13 14:17:54, David Rientjes wrote:
> On Mon, 3 Jun 2013, Michal Hocko wrote:
> 
> > > What do you suggest when you read the "tasks" file and it returns -ENOMEM 
> > > because kmalloc() fails because the userspace oom handler's memcg is also 
> > > oom? 
> > 
> > That would require that you track kernel allocations which is currently
> > done only for explicit caches.
> > 
> 
> That will not always be the case, and I think this could be a prerequisite 
> patch for such support that we have internally. 

> I'm not sure a userspace oom notifier would want to keep a
> preallocated buffer around that is mlocked in memory for all possible
> lengths of this file.

Well, an oom handler which allocates memory under the same restricted
memory doesn't make much sense to me. Tracking all kmem allocations
makes it almost impossible to implement a non-trivial handler.

> > > Obviously it's not a situation we want to get into, but unless you 
> > > know that handler's exact memory usage across multiple versions, nothing 
> > > else is sharing that memcg, and it's a perfect implementation, you can't 
> > > guarantee it.  We need to address real world problems that occur in 
> > > practice.
> > 
> > If you really need to have such a guarantee then you can have a _global_
> > watchdog observing oom_control of all groups that provide such a vague
> > requirements for oom user handlers.
> > 
> 
> The whole point is to allow the user to implement their own oom policy.

OK, maybe I just wasn't clear enough or I am missing your point. Your
users _can_ implement and register their oom handlers. But as your
requirements are rather benevolent for handlers implementation you would
have a global watchdog which would sit on the oom_control of those
groups (which are allowed to have own handlers - all of them in your
case I guess) and trigger (user defined/global) timeout when it gets a
notification. If the group was under oom always during the timeout then
just disable oom_control until oom is settled (under_oom is 0).

Why wouldn't something like this work for your use case?

> If the policy was completely encapsulated in kernel code, we don't need to 
> ever disable the oom killer even with memory.oom_control.  Users may 
> choose to kill the largest process, the newest process, the oldest 
> process, sacrifice children instead of parents, prevent forkbombs, 
> implement their own priority scoring (which is what we do), kill the 
> allocating task, etc.
> 
> To not merge this patch, I'd ask that you show an alternative that allows 
> users to implement their own userspace oom handlers and not require admin 
> intervention when things go wrong.

Hohmm, so you are insisting on something that can be implemented in the
userspace and put it into the kernel because it is more convenient for
you and your use case. This doesn't sound like a way for accepting a
feature.

To make this absolutely clear. I do understand your requirements but you
haven't shown any _argument_ why the timeout you are proposing cannot be
implemented in the userspace. I will not ack this without this
reasoning.

And yes we should make memcg oom handling less deadlock prone and
Johannes' work in this thread is a good step forward.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Michal Hocko

On Mon 03-06-13 14:30:18, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
[...]
> > > I am just afraid about all the other archs that do not support (from
> > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > > openrisc, score and tile). What would be an alternative for them?
> > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > > would be acceptable for me.
> > 
> > blackfin is NOMMU but I guess the others should be converted to the
> > proper OOM protocol anyway and not just kill the faulting task.  I can
> > update them in the next version of the patch (series).
> 
> It's no longer necessary since I remove the arch-specific flag
> setting, but I converted them anyway while I was at it.  Will send
> them as a separate patch.

I am still not sure doing this unconditionally is the right way. Say
that a new arch will be added. How the poor implementer knows that memcg
oom handling requires an arch specific code to work properly?

So while I obviously do not have anything agains your conversion of
other archs that are in the tree currently I think we need something
like CONFIG_OLD_VERSION_MEMCG_OOM which depends on ARCH_HAS_FAULT_OOM_RETRY.

[...]
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index e692a02..cf60aef 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1282,6 +1282,8 @@ struct task_struct {
> > > >  * execve */
> > > > unsigned in_iowait:1;
> > > >  
> > > > +   unsigned in_userfault:1;
> > > > +
> > > 
> > > [This is more a nit pick but before I forget while I am reading through
> > > the rest of the patch.]
> > > 
> > > OK there is a lot of room around those bit fields but as this is only
> > > for memcg and you are enlarging the structure by the pointer then you
> > > can reuse bottom bit of memcg pointer.
> > 
> > I just didn't want to put anything in the arch code that looks too
> > memcgish, even though it's the only user right now.  But granted, it
> > will also probably remain the only user for a while.
> 
> I tried a couple of variants, including using the lowest memcg bit,
> but it all turned into more ugliness.  So that .in_userfault is still
> there in v2, but it's now set in handle_mm_fault() in a generic manner
> depending on a fault flag, please reconsider if you can live with it.

Sure thing.

[...]
> From: Johannes Weiner 
> Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
> 
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation.  Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
> 
> Do two things:
> 
> 1. When OOMing in a system call (buffered IO and friends), invoke the
>OOM killer but do not trap other tasks and just return -ENOMEM for
>everyone.  Userspace should be able to handle this... right?
> 
> 2. When OOMing in a page fault, invoke the OOM killer but do not trap
>other chargers directly in the charging code.  Instead, remember
>the OOMing memcg in the task struct and then fully unwind the page
>fault stack first.  Then synchronize the memcg OOM from
>pagefault_out_of_memory().
> 
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer.  Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
> 
> Signed-off-by: Johannes Weiner 
> ---
>  include/linux/memcontrol.h |   6 +++
>  include/linux/mm.h |   1 +
>  include/linux/sched.h  |   6 +++
>  mm/ksm.c   |   2 +-
>  mm/memcontrol.c| 117 
> +++--
>  mm/memory.c|  40 +++-
>  mm/oom_kill.c  |   7 ++-
>  7 files changed, 108 insertions(+), 71 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index de22292..97cf32b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup 
> *memcg)
>  }
>  
>  /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim 
> loop.
> + * try to call OOM killer
>   */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> -   int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> - struct oom_wait_info owait;
> - bool locked, need_to_kill;
> -
> - owait.memcg = memcg;
> - owait.wait.flags = 0;
> - owait.wait.func = memcg_oom_wake_function;
> - owait.wait.private = current;
> -

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Michal Hocko

On Mon 03-06-13 14:30:18, Johannes Weiner wrote:
 On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
  On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
[...]
   I am just afraid about all the other archs that do not support (from
   quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
   openrisc, score and tile). What would be an alternative for them?
   #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
   would be acceptable for me.
  
  blackfin is NOMMU but I guess the others should be converted to the
  proper OOM protocol anyway and not just kill the faulting task.  I can
  update them in the next version of the patch (series).
 
 It's no longer necessary since I remove the arch-specific flag
 setting, but I converted them anyway while I was at it.  Will send
 them as a separate patch.

I am still not sure doing this unconditionally is the right way. Say
that a new arch will be added. How the poor implementer knows that memcg
oom handling requires an arch specific code to work properly?

So while I obviously do not have anything agains your conversion of
other archs that are in the tree currently I think we need something
like CONFIG_OLD_VERSION_MEMCG_OOM which depends on ARCH_HAS_FAULT_OOM_RETRY.

[...]
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..cf60aef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1282,6 +1282,8 @@ struct task_struct {
 * execve */
unsigned in_iowait:1;
 
+   unsigned in_userfault:1;
+
   
   [This is more a nit pick but before I forget while I am reading through
   the rest of the patch.]
   
   OK there is a lot of room around those bit fields but as this is only
   for memcg and you are enlarging the structure by the pointer then you
   can reuse bottom bit of memcg pointer.
  
  I just didn't want to put anything in the arch code that looks too
  memcgish, even though it's the only user right now.  But granted, it
  will also probably remain the only user for a while.
 
 I tried a couple of variants, including using the lowest memcg bit,
 but it all turned into more ugliness.  So that .in_userfault is still
 there in v2, but it's now set in handle_mm_fault() in a generic manner
 depending on a fault flag, please reconsider if you can live with it.

Sure thing.

[...]
 From: Johannes Weiner han...@cmpxchg.org
 Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
 
 The memcg OOM handling is incredibly fragile because once a memcg goes
 OOM, one task (kernel or userspace) is responsible for resolving the
 situation.  Every other task that gets caught trying to charge memory
 gets stuck in a waitqueue while potentially holding various filesystem
 and mm locks on which the OOM handling task may now deadlock.
 
 Do two things:
 
 1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but do not trap other tasks and just return -ENOMEM for
everyone.  Userspace should be able to handle this... right?
 
 2. When OOMing in a page fault, invoke the OOM killer but do not trap
other chargers directly in the charging code.  Instead, remember
the OOMing memcg in the task struct and then fully unwind the page
fault stack first.  Then synchronize the memcg OOM from
pagefault_out_of_memory().
 
 While reworking the OOM routine, also remove a needless OOM waitqueue
 wakeup when invoking the killer.  Only uncharges and limit increases,
 things that actually change the memory situation, should do wakeups.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/memcontrol.h |   6 +++
  include/linux/mm.h |   1 +
  include/linux/sched.h  |   6 +++
  mm/ksm.c   |   2 +-
  mm/memcontrol.c| 117 
 +++--
  mm/memory.c|  40 +++-
  mm/oom_kill.c  |   7 ++-
  7 files changed, 108 insertions(+), 71 deletions(-)
 
[...]
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index de22292..97cf32b 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
[...]
 @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup 
 *memcg)
  }
  
  /*
 - * try to call OOM killer. returns false if we should exit memory-reclaim 
 loop.
 + * try to call OOM killer
   */
 -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
 -   int order)
 +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
  {
 - struct oom_wait_info owait;
 - bool locked, need_to_kill;
 -
 - owait.memcg = memcg;
 - owait.wait.flags = 0;
 - owait.wait.func = memcg_oom_wake_function;
 - owait.wait.private = current;
 - INIT_LIST_HEAD(owait.wait.task_list);
 - need_to_kill = true;
 - mem_cgroup_mark_under_oom(memcg);
 + bool locked,

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Michal Hocko

On Mon 03-06-13 14:17:54, David Rientjes wrote:
 On Mon, 3 Jun 2013, Michal Hocko wrote:
 
   What do you suggest when you read the tasks file and it returns -ENOMEM 
   because kmalloc() fails because the userspace oom handler's memcg is also 
   oom? 
  
  That would require that you track kernel allocations which is currently
  done only for explicit caches.
  
 
 That will not always be the case, and I think this could be a prerequisite 
 patch for such support that we have internally. 

 I'm not sure a userspace oom notifier would want to keep a
 preallocated buffer around that is mlocked in memory for all possible
 lengths of this file.

Well, an oom handler which allocates memory under the same restricted
memory doesn't make much sense to me. Tracking all kmem allocations
makes it almost impossible to implement a non-trivial handler.

   Obviously it's not a situation we want to get into, but unless you 
   know that handler's exact memory usage across multiple versions, nothing 
   else is sharing that memcg, and it's a perfect implementation, you can't 
   guarantee it.  We need to address real world problems that occur in 
   practice.
  
  If you really need to have such a guarantee then you can have a _global_
  watchdog observing oom_control of all groups that provide such a vague
  requirements for oom user handlers.
  
 
 The whole point is to allow the user to implement their own oom policy.

OK, maybe I just wasn't clear enough or I am missing your point. Your
users _can_ implement and register their oom handlers. But as your
requirements are rather benevolent for handlers implementation you would
have a global watchdog which would sit on the oom_control of those
groups (which are allowed to have own handlers - all of them in your
case I guess) and trigger (user defined/global) timeout when it gets a
notification. If the group was under oom always during the timeout then
just disable oom_control until oom is settled (under_oom is 0).

Why wouldn't something like this work for your use case?

 If the policy was completely encapsulated in kernel code, we don't need to 
 ever disable the oom killer even with memory.oom_control.  Users may 
 choose to kill the largest process, the newest process, the oldest 
 process, sacrifice children instead of parents, prevent forkbombs, 
 implement their own priority scoring (which is what we do), kill the 
 allocating task, etc.
 
 To not merge this patch, I'd ask that you show an alternative that allows 
 users to implement their own userspace oom handlers and not require admin 
 intervention when things go wrong.

Hohmm, so you are insisting on something that can be implemented in the
userspace and put it into the kernel because it is more convenient for
you and your use case. This doesn't sound like a way for accepting a
feature.

To make this absolutely clear. I do understand your requirements but you
haven't shown any _argument_ why the timeout you are proposing cannot be
implemented in the userspace. I will not ack this without this
reasoning.

And yes we should make memcg oom handling less deadlock prone and
Johannes' work in this thread is a good step forward.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Johannes Weiner

On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
 On Mon 03-06-13 14:30:18, Johannes Weiner wrote:
  On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
   On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
 [...]
I am just afraid about all the other archs that do not support (from
quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
openrisc, score and tile). What would be an alternative for them?
#ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
would be acceptable for me.
   
   blackfin is NOMMU but I guess the others should be converted to the
   proper OOM protocol anyway and not just kill the faulting task.  I can
   update them in the next version of the patch (series).
  
  It's no longer necessary since I remove the arch-specific flag
  setting, but I converted them anyway while I was at it.  Will send
  them as a separate patch.
 
 I am still not sure doing this unconditionally is the right way. Say
 that a new arch will be added. How the poor implementer knows that memcg
 oom handling requires an arch specific code to work properly?

It doesn't.  All that's required is that it follows the generic OOM
killer protocol and calls pagefault_out_of_memory().  Killing the
faulting task right there in the page fault handler is simply a bug
and needs to be fixed so I don't think that any extra care from our
side is necessary.

  @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup 
  *memcg)
   }
   
   /*
  - * try to call OOM killer. returns false if we should exit memory-reclaim 
  loop.
  + * try to call OOM killer
*/
  -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
  - int order)
  +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
   {
  -   struct oom_wait_info owait;
  -   bool locked, need_to_kill;
  -
  -   owait.memcg = memcg;
  -   owait.wait.flags = 0;
  -   owait.wait.func = memcg_oom_wake_function;
  -   owait.wait.private = current;
  -   INIT_LIST_HEAD(owait.wait.task_list);
  -   need_to_kill = true;
  -   mem_cgroup_mark_under_oom(memcg);
  +   bool locked, need_to_kill = true;
   
  /* At first, try to OOM lock hierarchy under memcg.*/
  spin_lock(memcg_oom_lock);
  locked = mem_cgroup_oom_lock(memcg);
  -   /*
  -* Even if signal_pending(), we can't quit charge() loop without
  -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
  -* under OOM is always welcomed, use TASK_KILLABLE here.
  -*/
  -   prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE);
  -   if (!locked || memcg-oom_kill_disable)
  +   if (!locked || memcg-oom_kill_disable) {
  need_to_kill = false;
  +   if (current-in_userfault) {
  +   /*
  +* We start sleeping on the OOM waitqueue only
  +* after unwinding the page fault stack, so
  +* make sure we detect wakeups that happen
  +* between now and then.
  +*/
  +   mem_cgroup_mark_under_oom(memcg);
  +   current-memcg_oom.wakeups =
  +   atomic_read(memcg-oom_wakeups);
  +   css_get(memcg-css);
  +   current-memcg_oom.memcg = memcg;
  +   }
  +   }
  if (locked)
  mem_cgroup_oom_notify(memcg);
  spin_unlock(memcg_oom_lock);
   
  -   if (need_to_kill) {
  -   finish_wait(memcg_oom_waitq, owait.wait);
  +   if (need_to_kill)
  mem_cgroup_out_of_memory(memcg, mask, order);
 
 Now that I am looking at this again I've realized that this
 is not correct. The task which triggers memcg OOM will not
 have memcg_oom.memcg set so it would trigger a global OOM in
 pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
 propagate it via mem_cgroup_do_charge) for need_to_kill or set up
 current-memcg_oom also for need_to_kill.

You are absolutely right!  I fixed it with a separate bit, like so:

+   struct memcg_oom_info {
+   unsigned int in_userfault:1;
+   unsigned int in_memcg_oom:1;
+   int wakeups;
+   struct mem_cgroup *wait_on_memcg;
+   } memcg_oom;

in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg
OOM and that it should always return true to not invoke the global
killer.  But wait_on_memcg might or might not be set, depending if the
task needs to sleep.

So the task invoking the memcg OOM killer won't sleep but retry the
fault instead (everybody else might be sleeping and the first OOM
killer invocation might not have freed anything).

  diff --git a/mm/memory.c b/mm/memory.c
  index 6dc1882..ff5e2d7 100644
  --- a/mm/memory.c
  +++ b/mm/memory.c
  @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct 
  mm_struct *mm,
  while (!(page =

Re: [patch] mm, memcg: add oom killer delay

2013-06-04 Thread Michal Hocko

On Tue 04-06-13 14:48:52, Johannes Weiner wrote:
 On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
[...]
   diff --git a/mm/memory.c b/mm/memory.c
   index 6dc1882..ff5e2d7 100644
   --- a/mm/memory.c
   +++ b/mm/memory.c
   @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, 
   struct mm_struct *mm,
 while (!(page = follow_page_mask(vma, start,
 foll_flags, page_mask))) {
 int ret;
   - unsigned int fault_flags = 0;
   + unsigned int fault_flags = FAULT_FLAG_KERNEL;

 /* For mlock, just skip the stack guard page. */
 if (foll_flags  FOLL_MLOCK) {
  
  This is also a bit tricky. Say there is an unlikely situation when a
  task fails to charge because of memcg OOM, it couldn't lock the oom
  so it ended up with current-memcg_oom set and __get_user_pages will
  turn VM_FAULT_OOM into ENOMEM but memcg_oom is still there. Then the
  following global OOM condition gets confused (well the oom will be
  triggered by somebody else so it shouldn't end up in the endless loop
  but still...), doesn't it?
 
 But current-memcg_oom is not set up unless current-in_userfault.
 And get_user_pages does not set this flag.

And my selective blindness strikes again :/ For some reason I have read
those places as they enable the fault flag. Which would make some sense
if there was a post handling...

Anyway, I will get back to the updated patch tomorrow with a clean and
fresh head.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 12:09:22PM -0700, David Rientjes wrote:
> On Mon, 3 Jun 2013, Johannes Weiner wrote:
> 
> > > It's not necessarily harder if you assign the userspace oom handlers to 
> > > the root of your subtree with access to more memory than the children.  
> > > There is no "inability" to write a proper handler, but when you have 
> > > dozens of individual users implementing their own userspace handlers with 
> > > changing memcg limits over time, then you might find it hard to have 
> > > perfection every time.  If we had perfection, we wouldn't have to worry 
> > > about oom in the first place.  We can't just let these gazillion memcgs 
> > > sit spinning forever because they get stuck, either.  That's why we've 
> > > used this solution for years as a failsafe.  Disabling the oom killer 
> > > entirely, even for a memcg, is ridiculous, and if you don't have a grace 
> > > period then oom handlers themselves just don't work.
> > 
> > It's only ridiculous if your OOM handler is subject to the OOM
> > situation it's trying to handle.
> > 
> 
> You're suggesting the oom handler can't be subject to its own memcg 
> limits, independent of the memcg it is handling?  If we demand that such a 
> handler be attached to the root memcg, that breaks the memory isolation 
> that memcg provides.  We constrain processes to a memcg for the purposes 
> of that isolation, so it cannot use more resources than allotted.

I guess the traditional use case is that you have a job manager that
you trust, that sets up the groups, configures their limits etc. and
would also know more about the jobs than the kernel to act as the OOM
killer as well.  Since it's trusted and not expected to consume any
significant amounts of memory by itself, the memcg code assumes that
it does not run in a cgroup itself, it's just not thought of as being
part of the application class.

I'm not saying it should necessarily stay that way, but it's also not
a completely ridiculous model.

> > What we could do is allow one task in the group to be the dedicated
> > OOM handler.  If we catch this task in the charge path during an OOM
> > situation, we fall back to the kernel OOM handler.
> > 
> 
> I'm not sure it even makes sense to have more than one oom handler per 
> memcg and the synchronization that requires in userspace to get the right 
> result, so I didn't consider dedicating a single oom handler.  That would 
> be an entirely new interface, though, since we may have multiple processes 
> waiting on memory.oom_control that aren't necessarily handlers; they grab 
> a snapshot of memory, do logging, etc.

It will probably be hard to extend the existing oom_control interface.

But we could add a separate one where you put in a pid or 0 (kernel)
instead of a boolean value, which then enables or disables the
userspace OOM handling task for the whole subtree.  If that task
enters the OOM path, the kernel handler is invoked.  If the task dies,
the kernel handler is permanently re-enabled.  Everybody is free to
poll that interface for OOM notifications, not only that one task.

Combined with the "enter waitqueue after unwinding page fault stack"
patch, would this fully cover your usecase?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread KOSAKI Motohiro

> From: Johannes Weiner 
> Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
>
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation.  Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
>
> Do two things:
>
> 1. When OOMing in a system call (buffered IO and friends), invoke the
>OOM killer but do not trap other tasks and just return -ENOMEM for
>everyone.  Userspace should be able to handle this... right?
>
> 2. When OOMing in a page fault, invoke the OOM killer but do not trap
>other chargers directly in the charging code.  Instead, remember
>the OOMing memcg in the task struct and then fully unwind the page
>fault stack first.  Then synchronize the memcg OOM from
>pagefault_out_of_memory().
>
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer.  Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
>
> Signed-off-by: Johannes Weiner 

>From point of the memcg oom notification view, it is NOT supported on the case
that an oom handler process is subjected its own memcg limit. All
memcg developers
clearly agreed it since it began. Even though, anyway, people have a
right to shoot their own foot. :)
However, this patch fixes more than that. OK, I prefer it. Good fix!

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Mon, 3 Jun 2013, Michal Hocko wrote:

> > What do you suggest when you read the "tasks" file and it returns -ENOMEM 
> > because kmalloc() fails because the userspace oom handler's memcg is also 
> > oom? 
> 
> That would require that you track kernel allocations which is currently
> done only for explicit caches.
> 

That will not always be the case, and I think this could be a prerequisite 
patch for such support that we have internally.  I'm not sure a userspace 
oom notifier would want to keep a preallocated buffer around that is 
mlocked in memory for all possible lengths of this file.

> > Obviously it's not a situation we want to get into, but unless you 
> > know that handler's exact memory usage across multiple versions, nothing 
> > else is sharing that memcg, and it's a perfect implementation, you can't 
> > guarantee it.  We need to address real world problems that occur in 
> > practice.
> 
> If you really need to have such a guarantee then you can have a _global_
> watchdog observing oom_control of all groups that provide such a vague
> requirements for oom user handlers.
> 

The whole point is to allow the user to implement their own oom policy.  
If the policy was completely encapsulated in kernel code, we don't need to 
ever disable the oom killer even with memory.oom_control.  Users may 
choose to kill the largest process, the newest process, the oldest 
process, sacrifice children instead of parents, prevent forkbombs, 
implement their own priority scoring (which is what we do), kill the 
allocating task, etc.

To not merge this patch, I'd ask that you show an alternative that allows 
users to implement their own userspace oom handlers and not require admin 
intervention when things go wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Mon 03-06-13 11:18:09, David Rientjes wrote:
> On Sat, 1 Jun 2013, Michal Hocko wrote:
[...]
> > I still do not see why you cannot simply read tasks file into a
> > preallocated buffer. This would be few pages even for thousands of pids.
> > You do not have to track processes as they come and go.
> > 
> 
> What do you suggest when you read the "tasks" file and it returns -ENOMEM 
> because kmalloc() fails because the userspace oom handler's memcg is also 
> oom? 

That would require that you track kernel allocations which is currently
done only for explicit caches.

> Obviously it's not a situation we want to get into, but unless you 
> know that handler's exact memory usage across multiple versions, nothing 
> else is sharing that memcg, and it's a perfect implementation, you can't 
> guarantee it.  We need to address real world problems that occur in 
> practice.

If you really need to have such a guarantee then you can have a _global_
watchdog observing oom_control of all groups that provide such a vague
requirements for oom user handlers.

> > As I said before. oom_delay_millisecs is actually really easy to be done
> > from userspace. If you really need a safety break then you can register
> > such a handler as a fallback. I am not familiar with eventfd internals
> > much but I guess that multiple handlers are possible. The fallback might
> > be enforeced by the admin (when a new group is created) or by the
> > container itself. Would something like this work for your use case?
> > 
> 
> You're suggesting another userspace process that solely waits for a set 
> duration and then reenables the oom killer?

Yes which kicks the oom killer.

> It faces all the same problems as the true userspace oom handler: it's
> own perfect implementation and it's own memcg constraints.

But that solution might be implemented as a global policy living in a
group with some reservations.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Mon, 3 Jun 2013, Johannes Weiner wrote:

> > It's not necessarily harder if you assign the userspace oom handlers to 
> > the root of your subtree with access to more memory than the children.  
> > There is no "inability" to write a proper handler, but when you have 
> > dozens of individual users implementing their own userspace handlers with 
> > changing memcg limits over time, then you might find it hard to have 
> > perfection every time.  If we had perfection, we wouldn't have to worry 
> > about oom in the first place.  We can't just let these gazillion memcgs 
> > sit spinning forever because they get stuck, either.  That's why we've 
> > used this solution for years as a failsafe.  Disabling the oom killer 
> > entirely, even for a memcg, is ridiculous, and if you don't have a grace 
> > period then oom handlers themselves just don't work.
> 
> It's only ridiculous if your OOM handler is subject to the OOM
> situation it's trying to handle.
> 

You're suggesting the oom handler can't be subject to its own memcg 
limits, independent of the memcg it is handling?  If we demand that such a 
handler be attached to the root memcg, that breaks the memory isolation 
that memcg provides.  We constrain processes to a memcg for the purposes 
of that isolation, so it cannot use more resources than allotted.

> Don't act as if the oom disabling semantics were unreasonable or
> inconsistent with the rest of the system, memcgs were not really meant
> to be self-policed by the tasks running in them.  That's why we have
> the waitqueue, so that everybody sits there and waits until an outside
> force resolves the situation.  There is nothing wrong with that, you
> just have a new requirement.
> 

The waitqueue doesn't solve anything with regard to the memory, if the 
memcg sits there and deadlocks forever then it is using resources (memory, 
not cpu) that will never be freed.

> > I'm talking about the memory the kernel allocates when reading the "tasks" 
> > file, not userspace.  This can, and will, return -ENOMEM.
> 
> Do you mean due to kmem limitations?
> 

Yes.

> What we could do is allow one task in the group to be the dedicated
> OOM handler.  If we catch this task in the charge path during an OOM
> situation, we fall back to the kernel OOM handler.
> 

I'm not sure it even makes sense to have more than one oom handler per 
memcg and the synchronization that requires in userspace to get the right 
result, so I didn't consider dedicating a single oom handler.  That would 
be an entirely new interface, though, since we may have multiple processes 
waiting on memory.oom_control that aren't necessarily handlers; they grab 
a snapshot of memory, do logging, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 11:18:09AM -0700, David Rientjes wrote:
> On Sat, 1 Jun 2013, Michal Hocko wrote:
> > OK, I assume those groups are generally untrusted, right? So you cannot
> > let them register their oom handler even via an admin interface. This
> > makes it a bit complicated because it makes much harder demands on the
> > handler itself as it has to run under restricted environment.
> 
> That's the point of the patch.  We want to allow users to register their 
> own oom handler in a subtree (they may attach it to their own subtree root 
> and wait on memory.oom_control of a child memcg with a limit less than 
> that root) but not insist on an absolutely perfect implementation that can 
> never fail when you run on many, many servers.  Userspace implementations 
> do fail sometimes, we just accept that.
> 
> > I still do not see why you cannot simply read tasks file into a
> > preallocated buffer. This would be few pages even for thousands of pids.
> > You do not have to track processes as they come and go.
> 
> What do you suggest when you read the "tasks" file and it returns -ENOMEM 
> because kmalloc() fails because the userspace oom handler's memcg is also 
> oom?  Obviously it's not a situation we want to get into, but unless you 
> know that handler's exact memory usage across multiple versions, nothing 
> else is sharing that memcg, and it's a perfect implementation, you can't 
> guarantee it.  We need to address real world problems that occur in 
> practice.
> 
> > As I said before. oom_delay_millisecs is actually really easy to be done
> > from userspace. If you really need a safety break then you can register
> > such a handler as a fallback. I am not familiar with eventfd internals
> > much but I guess that multiple handlers are possible. The fallback might
> > be enforeced by the admin (when a new group is created) or by the
> > container itself. Would something like this work for your use case?
> 
> You're suggesting another userspace process that solely waits for a set 
> duration and then reenables the oom killer?  It faces all the same 
> problems as the true userspace oom handler: it's own perfect 
> implementation and it's own memcg constraints.
> 
> > > If that user is constrained to his or her own subtree, as previously
> > > stated, there's also no way to login and rectify the situation at that
> > > point and requires admin intervention or a reboot.
> > 
> > Yes, insisting on the same subtree makes the life much harder for oom
> > handlers. I totally agree with you on that. I just feel that introducing
> > a new knob to workaround user "inability" to write a proper handler
> > (what ever that means) is not justified.
> 
> It's not necessarily harder if you assign the userspace oom handlers to 
> the root of your subtree with access to more memory than the children.  
> There is no "inability" to write a proper handler, but when you have 
> dozens of individual users implementing their own userspace handlers with 
> changing memcg limits over time, then you might find it hard to have 
> perfection every time.  If we had perfection, we wouldn't have to worry 
> about oom in the first place.  We can't just let these gazillion memcgs 
> sit spinning forever because they get stuck, either.  That's why we've 
> used this solution for years as a failsafe.  Disabling the oom killer 
> entirely, even for a memcg, is ridiculous, and if you don't have a grace 
> period then oom handlers themselves just don't work.

It's only ridiculous if your OOM handler is subject to the OOM
situation it's trying to handle.

Don't act as if the oom disabling semantics were unreasonable or
inconsistent with the rest of the system, memcgs were not really meant
to be self-policed by the tasks running in them.  That's why we have
the waitqueue, so that everybody sits there and waits until an outside
force resolves the situation.  There is nothing wrong with that, you
just have a new requirement.

> > > Then why does "cat tasks" stall when my memcg is totally depleted of all 
> > > memory?
> > 
> > if you run it like this then cat obviously needs some charged
> > allocations. If you had a proper handler which mlocks its buffer for the
> > read syscall then you shouldn't require any allocation at the oom time.
> > This shouldn't be that hard to do without too much memory overhead. As I
> > said we are talking about few (dozens) of pages per handler.
> > 
> 
> I'm talking about the memory the kernel allocates when reading the "tasks" 
> file, not userspace.  This can, and will, return -ENOMEM.

Do you mean due to kmem limitations?

What we could do is allow one task in the group to be the dedicated
OOM handler.  If we catch this task in the charge path during an OOM
situation, we fall back to the kernel OOM handler.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> > On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> > [...]
> > > From: Johannes Weiner 
> > > Subject: [PATCH] memcg: more robust oom handling
> > > 
> > > The memcg OOM handling is incredibly fragile because once a memcg goes
> > > OOM, one task (kernel or userspace) is responsible for resolving the
> > > situation.  Every other task that gets caught trying to charge memory
> > > gets stuck in a waitqueue while potentially holding various filesystem
> > > and mm locks on which the OOM handling task may now deadlock.
> > > 
> > > Do two things to charge attempts under OOM:
> > > 
> > > 1. Do not trap system calls (buffered IO and friends), just return
> > >-ENOMEM.  Userspace should be able to handle this... right?
> > > 
> > > 2. Do not trap page faults directly in the charging context.  Instead,
> > >remember the OOMing memcg in the task struct and fully unwind the
> > >page fault stack first.  Then synchronize the memcg OOM from
> > >pagefault_out_of_memory()
> > 
> > I think this should work and I really like it! Nice work Johannes, I
> > never dared to go that deep and my opposite approach was also much more
> > fragile.
> > 
> > I am just afraid about all the other archs that do not support (from
> > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > openrisc, score and tile). What would be an alternative for them?
> > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > would be acceptable for me.
> 
> blackfin is NOMMU but I guess the others should be converted to the
> proper OOM protocol anyway and not just kill the faulting task.  I can
> update them in the next version of the patch (series).

It's no longer necessary since I remove the arch-specific flag
setting, but I converted them anyway while I was at it.  Will send
them as a separate patch.

> > > Not-quite-signed-off-by: Johannes Weiner 
> > > ---
> > >  arch/x86/mm/fault.c|   2 +
> > >  include/linux/memcontrol.h |   6 +++
> > >  include/linux/sched.h  |   6 +++
> > >  mm/memcontrol.c| 104 
> > > +
> > >  mm/oom_kill.c  |   7 ++-
> > >  5 files changed, 78 insertions(+), 47 deletions(-)
> > > 
> > [...]
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index e692a02..cf60aef 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1282,6 +1282,8 @@ struct task_struct {
> > >* execve */
> > >   unsigned in_iowait:1;
> > >  
> > > + unsigned in_userfault:1;
> > > +
> > 
> > [This is more a nit pick but before I forget while I am reading through
> > the rest of the patch.]
> > 
> > OK there is a lot of room around those bit fields but as this is only
> > for memcg and you are enlarging the structure by the pointer then you
> > can reuse bottom bit of memcg pointer.
> 
> I just didn't want to put anything in the arch code that looks too
> memcgish, even though it's the only user right now.  But granted, it
> will also probably remain the only user for a while.

I tried a couple of variants, including using the lowest memcg bit,
but it all turned into more ugliness.  So that .in_userfault is still
there in v2, but it's now set in handle_mm_fault() in a generic manner
depending on a fault flag, please reconsider if you can live with it.

> > > @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup 
> > > *memcg)
> > >  }
> > >  
> > >  /*
> > > - * try to call OOM killer. returns false if we should exit 
> > > memory-reclaim loop.
> > > + * try to call OOM killer
> > >   */
> > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > > -   int order)
> > > +static void mem_cgroup_oom(struct mem_cgroup *memcg,
> > > +gfp_t mask, int order,
> > > +bool in_userfault)
> > >  {
> > > - struct oom_wait_info owait;
> > > - bool locked, need_to_kill;
> > > -
> > > - owait.memcg = memcg;
> > > - owait.wait.flags = 0;
> > > - owait.wait.func = memcg_oom_wake_function;
> > > - owait.wait.private = current;
> > > - INIT_LIST_HEAD(_list);
> > > - need_to_kill = true;
> > > - mem_cgroup_mark_under_oom(memcg);
> > > + bool locked, need_to_kill = true;
> > >  
> > >   /* At first, try to OOM lock hierarchy under memcg.*/
> > >   spin_lock(_oom_lock);
> > >   locked = mem_cgroup_oom_lock(memcg);
> > > - /*
> > > -  * Even if signal_pending(), we can't quit charge() loop without
> > > -  * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> > > -  * under OOM is always welcomed, use TASK_KILLABLE here.
> > > -  */
> > > - prepare_to_wait(_oom_waitq, , TASK_KILLABLE);
> > > - if (!locked || memcg->oom_kill_disable)
> > > + if (!locked || memcg->oom_kill_disable) {
> > >   need_to_kill = false;
> > > +

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Sat, 1 Jun 2013, Michal Hocko wrote:

> > Users obviously don't have the ability to attach processes to the root 
> > memcg.  They are constrained to their own subtree of memcgs.
> 
> OK, I assume those groups are generally untrusted, right? So you cannot
> let them register their oom handler even via an admin interface. This
> makes it a bit complicated because it makes much harder demands on the
> handler itself as it has to run under restricted environment.
> 

That's the point of the patch.  We want to allow users to register their 
own oom handler in a subtree (they may attach it to their own subtree root 
and wait on memory.oom_control of a child memcg with a limit less than 
that root) but not insist on an absolutely perfect implementation that can 
never fail when you run on many, many servers.  Userspace implementations 
do fail sometimes, we just accept that.

> I still do not see why you cannot simply read tasks file into a
> preallocated buffer. This would be few pages even for thousands of pids.
> You do not have to track processes as they come and go.
> 

What do you suggest when you read the "tasks" file and it returns -ENOMEM 
because kmalloc() fails because the userspace oom handler's memcg is also 
oom?  Obviously it's not a situation we want to get into, but unless you 
know that handler's exact memory usage across multiple versions, nothing 
else is sharing that memcg, and it's a perfect implementation, you can't 
guarantee it.  We need to address real world problems that occur in 
practice.

> As I said before. oom_delay_millisecs is actually really easy to be done
> from userspace. If you really need a safety break then you can register
> such a handler as a fallback. I am not familiar with eventfd internals
> much but I guess that multiple handlers are possible. The fallback might
> be enforeced by the admin (when a new group is created) or by the
> container itself. Would something like this work for your use case?
> 

You're suggesting another userspace process that solely waits for a set 
duration and then reenables the oom killer?  It faces all the same 
problems as the true userspace oom handler: it's own perfect 
implementation and it's own memcg constraints.

> > If that user is constrained to his or her own subtree, as previously
> > stated, there's also no way to login and rectify the situation at that
> > point and requires admin intervention or a reboot.
> 
> Yes, insisting on the same subtree makes the life much harder for oom
> handlers. I totally agree with you on that. I just feel that introducing
> a new knob to workaround user "inability" to write a proper handler
> (what ever that means) is not justified.
>  

It's not necessarily harder if you assign the userspace oom handlers to 
the root of your subtree with access to more memory than the children.  
There is no "inability" to write a proper handler, but when you have 
dozens of individual users implementing their own userspace handlers with 
changing memcg limits over time, then you might find it hard to have 
perfection every time.  If we had perfection, we wouldn't have to worry 
about oom in the first place.  We can't just let these gazillion memcgs 
sit spinning forever because they get stuck, either.  That's why we've 
used this solution for years as a failsafe.  Disabling the oom killer 
entirely, even for a memcg, is ridiculous, and if you don't have a grace 
period then oom handlers themselves just don't work.

> > Then why does "cat tasks" stall when my memcg is totally depleted of all 
> > memory?
> 
> if you run it like this then cat obviously needs some charged
> allocations. If you had a proper handler which mlocks its buffer for the
> read syscall then you shouldn't require any allocation at the oom time.
> This shouldn't be that hard to do without too much memory overhead. As I
> said we are talking about few (dozens) of pages per handler.
> 

I'm talking about the memory the kernel allocates when reading the "tasks" 
file, not userspace.  This can, and will, return -ENOMEM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Mon 03-06-13 12:48:39, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> > On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> > [...]
> > > From: Johannes Weiner 
> > > Subject: [PATCH] memcg: more robust oom handling
> > > 
> > > The memcg OOM handling is incredibly fragile because once a memcg goes
> > > OOM, one task (kernel or userspace) is responsible for resolving the
> > > situation.  Every other task that gets caught trying to charge memory
> > > gets stuck in a waitqueue while potentially holding various filesystem
> > > and mm locks on which the OOM handling task may now deadlock.
> > > 
> > > Do two things to charge attempts under OOM:
> > > 
> > > 1. Do not trap system calls (buffered IO and friends), just return
> > >-ENOMEM.  Userspace should be able to handle this... right?
> > > 
> > > 2. Do not trap page faults directly in the charging context.  Instead,
> > >remember the OOMing memcg in the task struct and fully unwind the
> > >page fault stack first.  Then synchronize the memcg OOM from
> > >pagefault_out_of_memory()
> > 
> > I think this should work and I really like it! Nice work Johannes, I
> > never dared to go that deep and my opposite approach was also much more
> > fragile.
> > 
> > I am just afraid about all the other archs that do not support (from
> > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > openrisc, score and tile). What would be an alternative for them?
> > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > would be acceptable for me.
> 
> blackfin is NOMMU but I guess the others should be converted to the
> proper OOM protocol anyway and not just kill the faulting task.  I can
> update them in the next version of the patch (series).

OK, if you are willing to convert them all then even better.

> > > Not-quite-signed-off-by: Johannes Weiner 
> > > ---
> > >  arch/x86/mm/fault.c|   2 +
> > >  include/linux/memcontrol.h |   6 +++
> > >  include/linux/sched.h  |   6 +++
> > >  mm/memcontrol.c| 104 
> > > +
> > >  mm/oom_kill.c  |   7 ++-
> > >  5 files changed, 78 insertions(+), 47 deletions(-)
> > > 
> > [...]
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index e692a02..cf60aef 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1282,6 +1282,8 @@ struct task_struct {
> > >* execve */
> > >   unsigned in_iowait:1;
> > >  
> > > + unsigned in_userfault:1;
> > > +
> > 
> > [This is more a nit pick but before I forget while I am reading through
> > the rest of the patch.]
> > 
> > OK there is a lot of room around those bit fields but as this is only
> > for memcg and you are enlarging the structure by the pointer then you
> > can reuse bottom bit of memcg pointer.
> 
> I just didn't want to put anything in the arch code that looks too
> memcgish, even though it's the only user right now.  But granted, it
> will also probably remain the only user for a while.

OK, no objection. I would just use a more generic name. Something like
memcg_oom_can_block.

[...`]
> > >   if (locked)
> > >   mem_cgroup_oom_notify(memcg);
> > >   spin_unlock(_oom_lock);
> > >  
> > >   if (need_to_kill) {
> > > - finish_wait(_oom_waitq, );
> > >   mem_cgroup_out_of_memory(memcg, mask, order);
> > > - } else {
> > > - schedule();
> > > - finish_wait(_oom_waitq, );
> > > + memcg_oom_recover(memcg);
> > 
> > Why do we need to call memcg_oom_recover here? We do not know that any
> > charges have been released. Say mem_cgroup_out_of_memory selected a task
> > which migrated to our group (without its charges) so we would kill the
> > poor guy and free no memory from this group.
> > Now you wake up oom waiters to refault but they will end up in the same
> > situation. I think it should be sufficient to wait for memcg_oom_recover
> > until the memory is uncharged (which we do already).
> 
> It's a leftover from how it was before (see the memcg_wakeup_oom
> below), but you are right, we can get rid of it.

right, I have missed that. Then it would deserve a note in the
changelog. Something like
"
memcg_wakeup_oom should be removed because there is no guarantee that
mem_cgroup_out_of_memory was able to free any memory. It could have
killed a task which doesn't have any charges from the group. Waiters
should be woken up by memcg_oom_recover during uncharge or a limit
change.
"
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Fri, 31 May 2013, Andrew Morton wrote:

> > Admins may set the oom killer delay using the new interface:
> > 
> > # echo 6 > memory.oom_delay_millisecs
> > 
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed by putting the task to sleep for 60 seconds.
> 
> How often is that delay actually useful, in the real world?
> 
> IOW, in what proportion of cases does the system just remain stuck for
> 60 seconds and then get an oom-killing?
> 

It wouldn't be the system, it would just be the oom memcg that would be 
stuck.  We actually use 10s by default, but it's adjustable for users in 
their own memcg hierarchies.  It gives just enough time for userspace to 
deal with the situation and then defer to the kernel if it's unresponsive, 
this tends to happen quite regularly when you have many, many servers.  
Same situation if the userspace oom handler has died and isn't running, 
perhaps because of its own memory constraints (everything on our systems 
is memory constrained).  Obviously it isn't going to reenable the oom 
killer before it dies from SIGSEGV.

I'd argue that the current functionality that allows users to disable the 
oom killer for a memcg indefinitely is a bit ridiculous.  It requires 
admin intervention to fix such a state and it would be pointless to have 
an oom memcg for a week, a month, a year, just completely deadlocked on 
making forward progress and consuming resources.

memory.oom_delay_millisecs in my patch is limited to MAX_SCHEDULE_TIMEOUT 
just as a sanity check since we currently allow indefinite oom killer 
disabling.  I think if we were to rethink disabling the oom killer 
entirely via memory.oom_control and realize such a condition over a 
prolonged period is insane then this memory.oom_delay_millisecs ceiling 
would be better defined as something in minutes.

At the same time, we really like userspace oom notifications so users can 
implement their own handlers.  So where's the compromise between instantly 
oom killing something and waiting forever for userspace to respond?  My 
suggestion is memory.oom_delay_millisecs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 06:31:34PM +0200, Michal Hocko wrote:
> On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> > @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
> >  {
> > /* for filtering, pass "memcg" as argument. */
> > __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg);
> > +   atomic_inc(>oom_wakeups);
> >  }
> >  
> >  static void memcg_oom_recover(struct mem_cgroup *memcg)
> [...]
> > +   prepare_to_wait(_oom_waitq, , TASK_KILLABLE);
> > +   /* Only sleep if we didn't miss any wakeups since OOM */
> > +   if (atomic_read(>oom_wakeups) == current->memcg_oom.wakeups)
> > +   schedule();
> 
> On the way home it occured to me that the ordering might be wrong here.
> The wake up can be lost here.
>   __wake_up(memcg_oom_waitq)
>   
> prepare_to_wait
> atomic_read(>oom_wakeups)
>   atomic_inc(oom_wakeups)
> 
> I guess we want atomic_inc before __wake_up, right?

I think you are right, thanks for spotting this.  Will be fixed in
version 2.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> [...]
> > From: Johannes Weiner 
> > Subject: [PATCH] memcg: more robust oom handling
> > 
> > The memcg OOM handling is incredibly fragile because once a memcg goes
> > OOM, one task (kernel or userspace) is responsible for resolving the
> > situation.  Every other task that gets caught trying to charge memory
> > gets stuck in a waitqueue while potentially holding various filesystem
> > and mm locks on which the OOM handling task may now deadlock.
> > 
> > Do two things to charge attempts under OOM:
> > 
> > 1. Do not trap system calls (buffered IO and friends), just return
> >-ENOMEM.  Userspace should be able to handle this... right?
> > 
> > 2. Do not trap page faults directly in the charging context.  Instead,
> >remember the OOMing memcg in the task struct and fully unwind the
> >page fault stack first.  Then synchronize the memcg OOM from
> >pagefault_out_of_memory()
> 
> I think this should work and I really like it! Nice work Johannes, I
> never dared to go that deep and my opposite approach was also much more
> fragile.
> 
> I am just afraid about all the other archs that do not support (from
> quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> openrisc, score and tile). What would be an alternative for them?
> #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> would be acceptable for me.

blackfin is NOMMU but I guess the others should be converted to the
proper OOM protocol anyway and not just kill the faulting task.  I can
update them in the next version of the patch (series).

> > Not-quite-signed-off-by: Johannes Weiner 
> > ---
> >  arch/x86/mm/fault.c|   2 +
> >  include/linux/memcontrol.h |   6 +++
> >  include/linux/sched.h  |   6 +++
> >  mm/memcontrol.c| 104 
> > +
> >  mm/oom_kill.c  |   7 ++-
> >  5 files changed, 78 insertions(+), 47 deletions(-)
> > 
> [...]
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e692a02..cf60aef 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1282,6 +1282,8 @@ struct task_struct {
> >  * execve */
> > unsigned in_iowait:1;
> >  
> > +   unsigned in_userfault:1;
> > +
> 
> [This is more a nit pick but before I forget while I am reading through
> the rest of the patch.]
> 
> OK there is a lot of room around those bit fields but as this is only
> for memcg and you are enlarging the structure by the pointer then you
> can reuse bottom bit of memcg pointer.

I just didn't want to put anything in the arch code that looks too
memcgish, even though it's the only user right now.  But granted, it
will also probably remain the only user for a while.

> > @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup 
> > *memcg)
> >  }
> >  
> >  /*
> > - * try to call OOM killer. returns false if we should exit memory-reclaim 
> > loop.
> > + * try to call OOM killer
> >   */
> > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > - int order)
> > +static void mem_cgroup_oom(struct mem_cgroup *memcg,
> > +  gfp_t mask, int order,
> > +  bool in_userfault)
> >  {
> > -   struct oom_wait_info owait;
> > -   bool locked, need_to_kill;
> > -
> > -   owait.memcg = memcg;
> > -   owait.wait.flags = 0;
> > -   owait.wait.func = memcg_oom_wake_function;
> > -   owait.wait.private = current;
> > -   INIT_LIST_HEAD(_list);
> > -   need_to_kill = true;
> > -   mem_cgroup_mark_under_oom(memcg);
> > +   bool locked, need_to_kill = true;
> >  
> > /* At first, try to OOM lock hierarchy under memcg.*/
> > spin_lock(_oom_lock);
> > locked = mem_cgroup_oom_lock(memcg);
> > -   /*
> > -* Even if signal_pending(), we can't quit charge() loop without
> > -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> > -* under OOM is always welcomed, use TASK_KILLABLE here.
> > -*/
> > -   prepare_to_wait(_oom_waitq, , TASK_KILLABLE);
> > -   if (!locked || memcg->oom_kill_disable)
> > +   if (!locked || memcg->oom_kill_disable) {
> > need_to_kill = false;
> > +   if (in_userfault) {
> > +   /*
> > +* We start sleeping on the OOM waitqueue only
> > +* after unwinding the page fault stack, so
> > +* make sure we detect wakeups that happen
> > +* between now and then.
> > +*/
> > +   mem_cgroup_mark_under_oom(memcg);
> > +   current->memcg_oom.wakeups =
> > +   atomic_read(>oom_wakeups);
> > +   css_get(>css);
> > +   current->memcg_oom.memcg = memcg;
> > +   }
> > +   }
> > if (locked)
> >

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
>  {
>   /* for filtering, pass "memcg" as argument. */
>   __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg);
> + atomic_inc(>oom_wakeups);
>  }
>  
>  static void memcg_oom_recover(struct mem_cgroup *memcg)
[...]
> + prepare_to_wait(_oom_waitq, , TASK_KILLABLE);
> + /* Only sleep if we didn't miss any wakeups since OOM */
> + if (atomic_read(>oom_wakeups) == current->memcg_oom.wakeups)
> + schedule();

On the way home it occured to me that the ordering might be wrong here.
The wake up can be lost here.
__wake_up(memcg_oom_waitq)

prepare_to_wait
atomic_read(>oom_wakeups)
atomic_inc(oom_wakeups)

I guess we want atomic_inc before __wake_up, right?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
[...]
> From: Johannes Weiner 
> Subject: [PATCH] memcg: more robust oom handling
> 
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation.  Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
> 
> Do two things to charge attempts under OOM:
> 
> 1. Do not trap system calls (buffered IO and friends), just return
>-ENOMEM.  Userspace should be able to handle this... right?
> 
> 2. Do not trap page faults directly in the charging context.  Instead,
>remember the OOMing memcg in the task struct and fully unwind the
>page fault stack first.  Then synchronize the memcg OOM from
>pagefault_out_of_memory()

I think this should work and I really like it! Nice work Johannes, I
never dared to go that deep and my opposite approach was also much more
fragile.

I am just afraid about all the other archs that do not support (from
quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
openrisc, score and tile). What would be an alternative for them?
#ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
would be acceptable for me.

Few comments bellow.
 
> Not-quite-signed-off-by: Johannes Weiner 
> ---
>  arch/x86/mm/fault.c|   2 +
>  include/linux/memcontrol.h |   6 +++
>  include/linux/sched.h  |   6 +++
>  mm/memcontrol.c| 104 
> +
>  mm/oom_kill.c  |   7 ++-
>  5 files changed, 78 insertions(+), 47 deletions(-)
> 
[...]
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e692a02..cf60aef 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1282,6 +1282,8 @@ struct task_struct {
>* execve */
>   unsigned in_iowait:1;
>  
> + unsigned in_userfault:1;
> +

[This is more a nit pick but before I forget while I am reading through
the rest of the patch.]

OK there is a lot of room around those bit fields but as this is only
for memcg and you are enlarging the structure by the pointer then you
can reuse bottom bit of memcg pointer.

>   /* task may not gain privileges */
>   unsigned no_new_privs:1;
>  
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cc3026a..6e13ebe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup 
> *memcg)
>  }
>  
>  /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim 
> loop.
> + * try to call OOM killer
>   */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> -   int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg,
> +gfp_t mask, int order,
> +bool in_userfault)
>  {
> - struct oom_wait_info owait;
> - bool locked, need_to_kill;
> -
> - owait.memcg = memcg;
> - owait.wait.flags = 0;
> - owait.wait.func = memcg_oom_wake_function;
> - owait.wait.private = current;
> - INIT_LIST_HEAD(_list);
> - need_to_kill = true;
> - mem_cgroup_mark_under_oom(memcg);
> + bool locked, need_to_kill = true;
>  
>   /* At first, try to OOM lock hierarchy under memcg.*/
>   spin_lock(_oom_lock);
>   locked = mem_cgroup_oom_lock(memcg);
> - /*
> -  * Even if signal_pending(), we can't quit charge() loop without
> -  * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> -  * under OOM is always welcomed, use TASK_KILLABLE here.
> -  */
> - prepare_to_wait(_oom_waitq, , TASK_KILLABLE);
> - if (!locked || memcg->oom_kill_disable)
> + if (!locked || memcg->oom_kill_disable) {
>   need_to_kill = false;
> + if (in_userfault) {
> + /*
> +  * We start sleeping on the OOM waitqueue only
> +  * after unwinding the page fault stack, so
> +  * make sure we detect wakeups that happen
> +  * between now and then.
> +  */
> + mem_cgroup_mark_under_oom(memcg);
> + current->memcg_oom.wakeups =
> + atomic_read(>oom_wakeups);
> + css_get(>css);
> + current->memcg_oom.memcg = memcg;
> + }
> + }
>   if (locked)
>   mem_cgroup_oom_notify(memcg);
>   spin_unlock(_oom_lock);
>  
>   if (need_to_kill) {
> - finish_wait(_oom_waitq, );
>   mem_cgroup_out_of_memory(memcg, mask, order);
> - } else {
> - schedule();
> - finish_wait(_oom_waitq, );
> + memcg_oom_recover(memcg);

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
[...]
 From: Johannes Weiner han...@cmpxchg.org
 Subject: [PATCH] memcg: more robust oom handling
 
 The memcg OOM handling is incredibly fragile because once a memcg goes
 OOM, one task (kernel or userspace) is responsible for resolving the
 situation.  Every other task that gets caught trying to charge memory
 gets stuck in a waitqueue while potentially holding various filesystem
 and mm locks on which the OOM handling task may now deadlock.
 
 Do two things to charge attempts under OOM:
 
 1. Do not trap system calls (buffered IO and friends), just return
-ENOMEM.  Userspace should be able to handle this... right?
 
 2. Do not trap page faults directly in the charging context.  Instead,
remember the OOMing memcg in the task struct and fully unwind the
page fault stack first.  Then synchronize the memcg OOM from
pagefault_out_of_memory()

I think this should work and I really like it! Nice work Johannes, I
never dared to go that deep and my opposite approach was also much more
fragile.

I am just afraid about all the other archs that do not support (from
quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
openrisc, score and tile). What would be an alternative for them?
#ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
would be acceptable for me.

Few comments bellow.
 
 Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org
 ---
  arch/x86/mm/fault.c|   2 +
  include/linux/memcontrol.h |   6 +++
  include/linux/sched.h  |   6 +++
  mm/memcontrol.c| 104 
 +
  mm/oom_kill.c  |   7 ++-
  5 files changed, 78 insertions(+), 47 deletions(-)
 
[...]
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index e692a02..cf60aef 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1282,6 +1282,8 @@ struct task_struct {
* execve */
   unsigned in_iowait:1;
  
 + unsigned in_userfault:1;
 +

[This is more a nit pick but before I forget while I am reading through
the rest of the patch.]

OK there is a lot of room around those bit fields but as this is only
for memcg and you are enlarging the structure by the pointer then you
can reuse bottom bit of memcg pointer.

   /* task may not gain privileges */
   unsigned no_new_privs:1;
  
[...]
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index cc3026a..6e13ebe 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
[...]
 @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup 
 *memcg)
  }
  
  /*
 - * try to call OOM killer. returns false if we should exit memory-reclaim 
 loop.
 + * try to call OOM killer
   */
 -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
 -   int order)
 +static void mem_cgroup_oom(struct mem_cgroup *memcg,
 +gfp_t mask, int order,
 +bool in_userfault)
  {
 - struct oom_wait_info owait;
 - bool locked, need_to_kill;
 -
 - owait.memcg = memcg;
 - owait.wait.flags = 0;
 - owait.wait.func = memcg_oom_wake_function;
 - owait.wait.private = current;
 - INIT_LIST_HEAD(owait.wait.task_list);
 - need_to_kill = true;
 - mem_cgroup_mark_under_oom(memcg);
 + bool locked, need_to_kill = true;
  
   /* At first, try to OOM lock hierarchy under memcg.*/
   spin_lock(memcg_oom_lock);
   locked = mem_cgroup_oom_lock(memcg);
 - /*
 -  * Even if signal_pending(), we can't quit charge() loop without
 -  * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
 -  * under OOM is always welcomed, use TASK_KILLABLE here.
 -  */
 - prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE);
 - if (!locked || memcg-oom_kill_disable)
 + if (!locked || memcg-oom_kill_disable) {
   need_to_kill = false;
 + if (in_userfault) {
 + /*
 +  * We start sleeping on the OOM waitqueue only
 +  * after unwinding the page fault stack, so
 +  * make sure we detect wakeups that happen
 +  * between now and then.
 +  */
 + mem_cgroup_mark_under_oom(memcg);
 + current-memcg_oom.wakeups =
 + atomic_read(memcg-oom_wakeups);
 + css_get(memcg-css);
 + current-memcg_oom.memcg = memcg;
 + }
 + }
   if (locked)
   mem_cgroup_oom_notify(memcg);
   spin_unlock(memcg_oom_lock);
  
   if (need_to_kill) {
 - finish_wait(memcg_oom_waitq, owait.wait);
   mem_cgroup_out_of_memory(memcg, mask, order);
 - } else {
 - schedule();
 - finish_wait(memcg_oom_waitq, owait.wait);
 +

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
 @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
  {
   /* for filtering, pass memcg as argument. */
   __wake_up(memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 + atomic_inc(memcg-oom_wakeups);
  }
  
  static void memcg_oom_recover(struct mem_cgroup *memcg)
[...]
 + prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE);
 + /* Only sleep if we didn't miss any wakeups since OOM */
 + if (atomic_read(memcg-oom_wakeups) == current-memcg_oom.wakeups)
 + schedule();

On the way home it occured to me that the ordering might be wrong here.
The wake up can be lost here.
__wake_up(memcg_oom_waitq)
preempted
prepare_to_wait
atomic_read(memcg-oom_wakeups)
atomic_inc(oom_wakeups)

I guess we want atomic_inc before __wake_up, right?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
 On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
 [...]
  From: Johannes Weiner han...@cmpxchg.org
  Subject: [PATCH] memcg: more robust oom handling

  The memcg OOM handling is incredibly fragile because once a memcg goes
  OOM, one task (kernel or userspace) is responsible for resolving the
  situation.  Every other task that gets caught trying to charge memory
  gets stuck in a waitqueue while potentially holding various filesystem
  and mm locks on which the OOM handling task may now deadlock.

  Do two things to charge attempts under OOM:

  1. Do not trap system calls (buffered IO and friends), just return
 -ENOMEM.  Userspace should be able to handle this... right?

  2. Do not trap page faults directly in the charging context.  Instead,
 remember the OOMing memcg in the task struct and fully unwind the
 page fault stack first.  Then synchronize the memcg OOM from
 pagefault_out_of_memory()

 I think this should work and I really like it! Nice work Johannes, I
 never dared to go that deep and my opposite approach was also much more
 fragile.

 I am just afraid about all the other archs that do not support (from
 quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
 openrisc, score and tile). What would be an alternative for them?
 #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
 would be acceptable for me.

blackfin is NOMMU but I guess the others should be converted to the
proper OOM protocol anyway and not just kill the faulting task.  I can
update them in the next version of the patch (series).

  Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org
  ---
   arch/x86/mm/fault.c|   2 +
   include/linux/memcontrol.h |   6 +++
   include/linux/sched.h  |   6 +++
   mm/memcontrol.c| 104 
  +
   mm/oom_kill.c  |   7 ++-
   5 files changed, 78 insertions(+), 47 deletions(-)

 [...]
  diff --git a/include/linux/sched.h b/include/linux/sched.h
  index e692a02..cf60aef 100644
  --- a/include/linux/sched.h
  +++ b/include/linux/sched.h
  @@ -1282,6 +1282,8 @@ struct task_struct {
   * execve */
  unsigned in_iowait:1;

  +   unsigned in_userfault:1;
  +

 [This is more a nit pick but before I forget while I am reading through
 the rest of the patch.]

 OK there is a lot of room around those bit fields but as this is only
 for memcg and you are enlarging the structure by the pointer then you
 can reuse bottom bit of memcg pointer.

I just didn't want to put anything in the arch code that looks too
memcgish, even though it's the only user right now.  But granted, it
will also probably remain the only user for a while.

  @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup 
  *memcg)
   }

   /*
  - * try to call OOM killer. returns false if we should exit memory-reclaim 
  loop.
  + * try to call OOM killer
*/
  -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
  - int order)
  +static void mem_cgroup_oom(struct mem_cgroup *memcg,
  +  gfp_t mask, int order,
  +  bool in_userfault)
   {
  -   struct oom_wait_info owait;
  -   bool locked, need_to_kill;
  -
  -   owait.memcg = memcg;
  -   owait.wait.flags = 0;
  -   owait.wait.func = memcg_oom_wake_function;
  -   owait.wait.private = current;
  -   INIT_LIST_HEAD(owait.wait.task_list);
  -   need_to_kill = true;
  -   mem_cgroup_mark_under_oom(memcg);
  +   bool locked, need_to_kill = true;

  /* At first, try to OOM lock hierarchy under memcg.*/
  spin_lock(memcg_oom_lock);
  locked = mem_cgroup_oom_lock(memcg);
  -   /*
  -* Even if signal_pending(), we can't quit charge() loop without
  -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
  -* under OOM is always welcomed, use TASK_KILLABLE here.
  -*/
  -   prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE);
  -   if (!locked || memcg-oom_kill_disable)
  +   if (!locked || memcg-oom_kill_disable) {
  need_to_kill = false;
  +   if (in_userfault) {
  +   /*
  +* We start sleeping on the OOM waitqueue only
  +* after unwinding the page fault stack, so
  +* make sure we detect wakeups that happen
  +* between now and then.
  +*/
  +   mem_cgroup_mark_under_oom(memcg);
  +   current-memcg_oom.wakeups =
  +   atomic_read(memcg-oom_wakeups);
  +   css_get(memcg-css);
  +   current-memcg_oom.memcg = memcg;
  +   }
  +   }
  if (locked)
  mem_cgroup_oom_notify(memcg);
  spin_unlock(memcg_oom_lock);

  if (need_to_kill) {
  -

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 06:31:34PM +0200, Michal Hocko wrote:
 On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
  @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
   {
  /* for filtering, pass memcg as argument. */
  __wake_up(memcg_oom_waitq, TASK_NORMAL, 0, memcg);
  +   atomic_inc(memcg-oom_wakeups);
   }
   
   static void memcg_oom_recover(struct mem_cgroup *memcg)
 [...]
  +   prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE);
  +   /* Only sleep if we didn't miss any wakeups since OOM */
  +   if (atomic_read(memcg-oom_wakeups) == current-memcg_oom.wakeups)
  +   schedule();
 
 On the way home it occured to me that the ordering might be wrong here.
 The wake up can be lost here.
   __wake_up(memcg_oom_waitq)
   preempted
 prepare_to_wait
 atomic_read(memcg-oom_wakeups)
   atomic_inc(oom_wakeups)
 
 I guess we want atomic_inc before __wake_up, right?

I think you are right, thanks for spotting this.  Will be fixed in
version 2.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Fri, 31 May 2013, Andrew Morton wrote:

  Admins may set the oom killer delay using the new interface:
  
  # echo 6  memory.oom_delay_millisecs
  
  This will defer oom killing to the kernel only after 60 seconds has
  elapsed by putting the task to sleep for 60 seconds.
 
 How often is that delay actually useful, in the real world?
 
 IOW, in what proportion of cases does the system just remain stuck for
 60 seconds and then get an oom-killing?
 

It wouldn't be the system, it would just be the oom memcg that would be 
stuck.  We actually use 10s by default, but it's adjustable for users in 
their own memcg hierarchies.  It gives just enough time for userspace to 
deal with the situation and then defer to the kernel if it's unresponsive, 
this tends to happen quite regularly when you have many, many servers.  
Same situation if the userspace oom handler has died and isn't running, 
perhaps because of its own memory constraints (everything on our systems 
is memory constrained).  Obviously it isn't going to reenable the oom 
killer before it dies from SIGSEGV.

I'd argue that the current functionality that allows users to disable the 
oom killer for a memcg indefinitely is a bit ridiculous.  It requires 
admin intervention to fix such a state and it would be pointless to have 
an oom memcg for a week, a month, a year, just completely deadlocked on 
making forward progress and consuming resources.

memory.oom_delay_millisecs in my patch is limited to MAX_SCHEDULE_TIMEOUT 
just as a sanity check since we currently allow indefinite oom killer 
disabling.  I think if we were to rethink disabling the oom killer 
entirely via memory.oom_control and realize such a condition over a 
prolonged period is insane then this memory.oom_delay_millisecs ceiling 
would be better defined as something in minutes.

At the same time, we really like userspace oom notifications so users can 
implement their own handlers.  So where's the compromise between instantly 
oom killing something and waiting forever for userspace to respond?  My 
suggestion is memory.oom_delay_millisecs.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Mon 03-06-13 12:48:39, Johannes Weiner wrote:
 On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
  On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
  [...]
   From: Johannes Weiner han...@cmpxchg.org
   Subject: [PATCH] memcg: more robust oom handling

   The memcg OOM handling is incredibly fragile because once a memcg goes
   OOM, one task (kernel or userspace) is responsible for resolving the
   situation.  Every other task that gets caught trying to charge memory
   gets stuck in a waitqueue while potentially holding various filesystem
   and mm locks on which the OOM handling task may now deadlock.

   Do two things to charge attempts under OOM:

   1. Do not trap system calls (buffered IO and friends), just return
  -ENOMEM.  Userspace should be able to handle this... right?

   2. Do not trap page faults directly in the charging context.  Instead,
  remember the OOMing memcg in the task struct and fully unwind the
  page fault stack first.  Then synchronize the memcg OOM from
  pagefault_out_of_memory()

  I think this should work and I really like it! Nice work Johannes, I
  never dared to go that deep and my opposite approach was also much more
  fragile.

  I am just afraid about all the other archs that do not support (from
  quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
  openrisc, score and tile). What would be an alternative for them?
  #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
  would be acceptable for me.

 blackfin is NOMMU but I guess the others should be converted to the
 proper OOM protocol anyway and not just kill the faulting task.  I can
 update them in the next version of the patch (series).

OK, if you are willing to convert them all then even better.

   Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org
   ---
arch/x86/mm/fault.c|   2 +
include/linux/memcontrol.h |   6 +++
include/linux/sched.h  |   6 +++
mm/memcontrol.c| 104 
   +
mm/oom_kill.c  |   7 ++-
5 files changed, 78 insertions(+), 47 deletions(-)

  [...]
   diff --git a/include/linux/sched.h b/include/linux/sched.h
   index e692a02..cf60aef 100644
   --- a/include/linux/sched.h
   +++ b/include/linux/sched.h
   @@ -1282,6 +1282,8 @@ struct task_struct {
  * execve */
 unsigned in_iowait:1;

   + unsigned in_userfault:1;
   +

  [This is more a nit pick but before I forget while I am reading through
  the rest of the patch.]

  OK there is a lot of room around those bit fields but as this is only
  for memcg and you are enlarging the structure by the pointer then you
  can reuse bottom bit of memcg pointer.

 I just didn't want to put anything in the arch code that looks too
 memcgish, even though it's the only user right now.  But granted, it
 will also probably remain the only user for a while.

OK, no objection. I would just use a more generic name. Something like
memcg_oom_can_block.

[...`]
 if (locked)
 mem_cgroup_oom_notify(memcg);
 spin_unlock(memcg_oom_lock);

 if (need_to_kill) {
   - finish_wait(memcg_oom_waitq, owait.wait);
 mem_cgroup_out_of_memory(memcg, mask, order);
   - } else {
   - schedule();
   - finish_wait(memcg_oom_waitq, owait.wait);
   + memcg_oom_recover(memcg);

  Why do we need to call memcg_oom_recover here? We do not know that any
  charges have been released. Say mem_cgroup_out_of_memory selected a task
  which migrated to our group (without its charges) so we would kill the
  poor guy and free no memory from this group.
  Now you wake up oom waiters to refault but they will end up in the same
  situation. I think it should be sufficient to wait for memcg_oom_recover
  until the memory is uncharged (which we do already).

 It's a leftover from how it was before (see the memcg_wakeup_oom
 below), but you are right, we can get rid of it.

right, I have missed that. Then it would deserve a note in the
changelog. Something like

memcg_wakeup_oom should be removed because there is no guarantee that
mem_cgroup_out_of_memory was able to free any memory. It could have
killed a task which doesn't have any charges from the group. Waiters
should be woken up by memcg_oom_recover during uncharge or a limit
change.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Sat, 1 Jun 2013, Michal Hocko wrote:

  Users obviously don't have the ability to attach processes to the root 
  memcg.  They are constrained to their own subtree of memcgs.
 
 OK, I assume those groups are generally untrusted, right? So you cannot
 let them register their oom handler even via an admin interface. This
 makes it a bit complicated because it makes much harder demands on the
 handler itself as it has to run under restricted environment.
 

That's the point of the patch.  We want to allow users to register their 
own oom handler in a subtree (they may attach it to their own subtree root 
and wait on memory.oom_control of a child memcg with a limit less than 
that root) but not insist on an absolutely perfect implementation that can 
never fail when you run on many, many servers.  Userspace implementations 
do fail sometimes, we just accept that.

 I still do not see why you cannot simply read tasks file into a
 preallocated buffer. This would be few pages even for thousands of pids.
 You do not have to track processes as they come and go.
 

What do you suggest when you read the tasks file and it returns -ENOMEM 
because kmalloc() fails because the userspace oom handler's memcg is also 
oom?  Obviously it's not a situation we want to get into, but unless you 
know that handler's exact memory usage across multiple versions, nothing 
else is sharing that memcg, and it's a perfect implementation, you can't 
guarantee it.  We need to address real world problems that occur in 
practice.

 As I said before. oom_delay_millisecs is actually really easy to be done
 from userspace. If you really need a safety break then you can register
 such a handler as a fallback. I am not familiar with eventfd internals
 much but I guess that multiple handlers are possible. The fallback might
 be enforeced by the admin (when a new group is created) or by the
 container itself. Would something like this work for your use case?
 

You're suggesting another userspace process that solely waits for a set 
duration and then reenables the oom killer?  It faces all the same 
problems as the true userspace oom handler: it's own perfect 
implementation and it's own memcg constraints.

  If that user is constrained to his or her own subtree, as previously
  stated, there's also no way to login and rectify the situation at that
  point and requires admin intervention or a reboot.
 
 Yes, insisting on the same subtree makes the life much harder for oom
 handlers. I totally agree with you on that. I just feel that introducing
 a new knob to workaround user inability to write a proper handler
 (what ever that means) is not justified.
  

It's not necessarily harder if you assign the userspace oom handlers to 
the root of your subtree with access to more memory than the children.  
There is no inability to write a proper handler, but when you have 
dozens of individual users implementing their own userspace handlers with 
changing memcg limits over time, then you might find it hard to have 
perfection every time.  If we had perfection, we wouldn't have to worry 
about oom in the first place.  We can't just let these gazillion memcgs 
sit spinning forever because they get stuck, either.  That's why we've 
used this solution for years as a failsafe.  Disabling the oom killer 
entirely, even for a memcg, is ridiculous, and if you don't have a grace 
period then oom handlers themselves just don't work.

  Then why does cat tasks stall when my memcg is totally depleted of all 
  memory?
 
 if you run it like this then cat obviously needs some charged
 allocations. If you had a proper handler which mlocks its buffer for the
 read syscall then you shouldn't require any allocation at the oom time.
 This shouldn't be that hard to do without too much memory overhead. As I
 said we are talking about few (dozens) of pages per handler.
 

I'm talking about the memory the kernel allocates when reading the tasks 
file, not userspace.  This can, and will, return -ENOMEM.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
 On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
  On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
  [...]
   From: Johannes Weiner han...@cmpxchg.org
   Subject: [PATCH] memcg: more robust oom handling

   The memcg OOM handling is incredibly fragile because once a memcg goes
   OOM, one task (kernel or userspace) is responsible for resolving the
   situation.  Every other task that gets caught trying to charge memory
   gets stuck in a waitqueue while potentially holding various filesystem
   and mm locks on which the OOM handling task may now deadlock.

   Do two things to charge attempts under OOM:

   1. Do not trap system calls (buffered IO and friends), just return
  -ENOMEM.  Userspace should be able to handle this... right?

   2. Do not trap page faults directly in the charging context.  Instead,
  remember the OOMing memcg in the task struct and fully unwind the
  page fault stack first.  Then synchronize the memcg OOM from
  pagefault_out_of_memory()

  I think this should work and I really like it! Nice work Johannes, I
  never dared to go that deep and my opposite approach was also much more
  fragile.

  I am just afraid about all the other archs that do not support (from
  quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
  openrisc, score and tile). What would be an alternative for them?
  #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
  would be acceptable for me.

 blackfin is NOMMU but I guess the others should be converted to the
 proper OOM protocol anyway and not just kill the faulting task.  I can
 update them in the next version of the patch (series).

It's no longer necessary since I remove the arch-specific flag
setting, but I converted them anyway while I was at it.  Will send
them as a separate patch.

   Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org
   ---
arch/x86/mm/fault.c|   2 +
include/linux/memcontrol.h |   6 +++
include/linux/sched.h  |   6 +++
mm/memcontrol.c| 104 
   +
mm/oom_kill.c  |   7 ++-
5 files changed, 78 insertions(+), 47 deletions(-)

  [...]
   diff --git a/include/linux/sched.h b/include/linux/sched.h
   index e692a02..cf60aef 100644
   --- a/include/linux/sched.h
   +++ b/include/linux/sched.h
   @@ -1282,6 +1282,8 @@ struct task_struct {
  * execve */
 unsigned in_iowait:1;

   + unsigned in_userfault:1;
   +

  [This is more a nit pick but before I forget while I am reading through
  the rest of the patch.]

  OK there is a lot of room around those bit fields but as this is only
  for memcg and you are enlarging the structure by the pointer then you
  can reuse bottom bit of memcg pointer.

 I just didn't want to put anything in the arch code that looks too
 memcgish, even though it's the only user right now.  But granted, it
 will also probably remain the only user for a while.

I tried a couple of variants, including using the lowest memcg bit,
but it all turned into more ugliness.  So that .in_userfault is still
there in v2, but it's now set in handle_mm_fault() in a generic manner
depending on a fault flag, please reconsider if you can live with it.

   @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup 
   *memcg)
}

/*
   - * try to call OOM killer. returns false if we should exit 
   memory-reclaim loop.
   + * try to call OOM killer
 */
   -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
   -   int order)
   +static void mem_cgroup_oom(struct mem_cgroup *memcg,
   +gfp_t mask, int order,
   +bool in_userfault)
{
   - struct oom_wait_info owait;
   - bool locked, need_to_kill;
   -
   - owait.memcg = memcg;
   - owait.wait.flags = 0;
   - owait.wait.func = memcg_oom_wake_function;
   - owait.wait.private = current;
   - INIT_LIST_HEAD(owait.wait.task_list);
   - need_to_kill = true;
   - mem_cgroup_mark_under_oom(memcg);
   + bool locked, need_to_kill = true;

 /* At first, try to OOM lock hierarchy under memcg.*/
 spin_lock(memcg_oom_lock);
 locked = mem_cgroup_oom_lock(memcg);
   - /*
   -  * Even if signal_pending(), we can't quit charge() loop without
   -  * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
   -  * under OOM is always welcomed, use TASK_KILLABLE here.
   -  */
   - prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE);
   - if (!locked || memcg-oom_kill_disable)
   + if (!locked || memcg-oom_kill_disable) {
 need_to_kill = false;
   + if (in_userfault) {
   + /*
   +  * We start sleeping on the OOM waitqueue only
   +  * after unwinding the page fault stack, so
   +  * make

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 11:18:09AM -0700, David Rientjes wrote:
 On Sat, 1 Jun 2013, Michal Hocko wrote:
  OK, I assume those groups are generally untrusted, right? So you cannot
  let them register their oom handler even via an admin interface. This
  makes it a bit complicated because it makes much harder demands on the
  handler itself as it has to run under restricted environment.
 
 That's the point of the patch.  We want to allow users to register their 
 own oom handler in a subtree (they may attach it to their own subtree root 
 and wait on memory.oom_control of a child memcg with a limit less than 
 that root) but not insist on an absolutely perfect implementation that can 
 never fail when you run on many, many servers.  Userspace implementations 
 do fail sometimes, we just accept that.
 
  I still do not see why you cannot simply read tasks file into a
  preallocated buffer. This would be few pages even for thousands of pids.
  You do not have to track processes as they come and go.
 
 What do you suggest when you read the tasks file and it returns -ENOMEM 
 because kmalloc() fails because the userspace oom handler's memcg is also 
 oom?  Obviously it's not a situation we want to get into, but unless you 
 know that handler's exact memory usage across multiple versions, nothing 
 else is sharing that memcg, and it's a perfect implementation, you can't 
 guarantee it.  We need to address real world problems that occur in 
 practice.
 
  As I said before. oom_delay_millisecs is actually really easy to be done
  from userspace. If you really need a safety break then you can register
  such a handler as a fallback. I am not familiar with eventfd internals
  much but I guess that multiple handlers are possible. The fallback might
  be enforeced by the admin (when a new group is created) or by the
  container itself. Would something like this work for your use case?
 
 You're suggesting another userspace process that solely waits for a set 
 duration and then reenables the oom killer?  It faces all the same 
 problems as the true userspace oom handler: it's own perfect 
 implementation and it's own memcg constraints.
 
   If that user is constrained to his or her own subtree, as previously
   stated, there's also no way to login and rectify the situation at that
   point and requires admin intervention or a reboot.
  
  Yes, insisting on the same subtree makes the life much harder for oom
  handlers. I totally agree with you on that. I just feel that introducing
  a new knob to workaround user inability to write a proper handler
  (what ever that means) is not justified.
 
 It's not necessarily harder if you assign the userspace oom handlers to 
 the root of your subtree with access to more memory than the children.  
 There is no inability to write a proper handler, but when you have 
 dozens of individual users implementing their own userspace handlers with 
 changing memcg limits over time, then you might find it hard to have 
 perfection every time.  If we had perfection, we wouldn't have to worry 
 about oom in the first place.  We can't just let these gazillion memcgs 
 sit spinning forever because they get stuck, either.  That's why we've 
 used this solution for years as a failsafe.  Disabling the oom killer 
 entirely, even for a memcg, is ridiculous, and if you don't have a grace 
 period then oom handlers themselves just don't work.

It's only ridiculous if your OOM handler is subject to the OOM
situation it's trying to handle.

Don't act as if the oom disabling semantics were unreasonable or
inconsistent with the rest of the system, memcgs were not really meant
to be self-policed by the tasks running in them.  That's why we have
the waitqueue, so that everybody sits there and waits until an outside
force resolves the situation.  There is nothing wrong with that, you
just have a new requirement.

   Then why does cat tasks stall when my memcg is totally depleted of all 
   memory?
  
  if you run it like this then cat obviously needs some charged
  allocations. If you had a proper handler which mlocks its buffer for the
  read syscall then you shouldn't require any allocation at the oom time.
  This shouldn't be that hard to do without too much memory overhead. As I
  said we are talking about few (dozens) of pages per handler.
  
 
 I'm talking about the memory the kernel allocates when reading the tasks 
 file, not userspace.  This can, and will, return -ENOMEM.

Do you mean due to kmem limitations?

What we could do is allow one task in the group to be the dedicated
OOM handler.  If we catch this task in the charge path during an OOM
situation, we fall back to the kernel OOM handler.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Mon, 3 Jun 2013, Johannes Weiner wrote:

  It's not necessarily harder if you assign the userspace oom handlers to 
  the root of your subtree with access to more memory than the children.  
  There is no inability to write a proper handler, but when you have 
  dozens of individual users implementing their own userspace handlers with 
  changing memcg limits over time, then you might find it hard to have 
  perfection every time.  If we had perfection, we wouldn't have to worry 
  about oom in the first place.  We can't just let these gazillion memcgs 
  sit spinning forever because they get stuck, either.  That's why we've 
  used this solution for years as a failsafe.  Disabling the oom killer 
  entirely, even for a memcg, is ridiculous, and if you don't have a grace 
  period then oom handlers themselves just don't work.
 
 It's only ridiculous if your OOM handler is subject to the OOM
 situation it's trying to handle.
 

You're suggesting the oom handler can't be subject to its own memcg 
limits, independent of the memcg it is handling?  If we demand that such a 
handler be attached to the root memcg, that breaks the memory isolation 
that memcg provides.  We constrain processes to a memcg for the purposes 
of that isolation, so it cannot use more resources than allotted.

 Don't act as if the oom disabling semantics were unreasonable or
 inconsistent with the rest of the system, memcgs were not really meant
 to be self-policed by the tasks running in them.  That's why we have
 the waitqueue, so that everybody sits there and waits until an outside
 force resolves the situation.  There is nothing wrong with that, you
 just have a new requirement.
 

The waitqueue doesn't solve anything with regard to the memory, if the 
memcg sits there and deadlocks forever then it is using resources (memory, 
not cpu) that will never be freed.

  I'm talking about the memory the kernel allocates when reading the tasks 
  file, not userspace.  This can, and will, return -ENOMEM.
 
 Do you mean due to kmem limitations?
 

Yes.

 What we could do is allow one task in the group to be the dedicated
 OOM handler.  If we catch this task in the charge path during an OOM
 situation, we fall back to the kernel OOM handler.
 

I'm not sure it even makes sense to have more than one oom handler per 
memcg and the synchronization that requires in userspace to get the right 
result, so I didn't consider dedicating a single oom handler.  That would 
be an entirely new interface, though, since we may have multiple processes 
waiting on memory.oom_control that aren't necessarily handlers; they grab 
a snapshot of memory, do logging, etc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Michal Hocko

On Mon 03-06-13 11:18:09, David Rientjes wrote:
 On Sat, 1 Jun 2013, Michal Hocko wrote:
[...]
  I still do not see why you cannot simply read tasks file into a
  preallocated buffer. This would be few pages even for thousands of pids.
  You do not have to track processes as they come and go.
  
 
 What do you suggest when you read the tasks file and it returns -ENOMEM 
 because kmalloc() fails because the userspace oom handler's memcg is also 
 oom? 

That would require that you track kernel allocations which is currently
done only for explicit caches.

 Obviously it's not a situation we want to get into, but unless you 
 know that handler's exact memory usage across multiple versions, nothing 
 else is sharing that memcg, and it's a perfect implementation, you can't 
 guarantee it.  We need to address real world problems that occur in 
 practice.

If you really need to have such a guarantee then you can have a _global_
watchdog observing oom_control of all groups that provide such a vague
requirements for oom user handlers.

  As I said before. oom_delay_millisecs is actually really easy to be done
  from userspace. If you really need a safety break then you can register
  such a handler as a fallback. I am not familiar with eventfd internals
  much but I guess that multiple handlers are possible. The fallback might
  be enforeced by the admin (when a new group is created) or by the
  container itself. Would something like this work for your use case?
  
 
 You're suggesting another userspace process that solely waits for a set 
 duration and then reenables the oom killer?

Yes which kicks the oom killer.

 It faces all the same problems as the true userspace oom handler: it's
 own perfect implementation and it's own memcg constraints.

But that solution might be implemented as a global policy living in a
group with some reservations.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread David Rientjes

On Mon, 3 Jun 2013, Michal Hocko wrote:

  What do you suggest when you read the tasks file and it returns -ENOMEM 
  because kmalloc() fails because the userspace oom handler's memcg is also 
  oom? 
 
 That would require that you track kernel allocations which is currently
 done only for explicit caches.
 

That will not always be the case, and I think this could be a prerequisite 
patch for such support that we have internally.  I'm not sure a userspace 
oom notifier would want to keep a preallocated buffer around that is 
mlocked in memory for all possible lengths of this file.

  Obviously it's not a situation we want to get into, but unless you 
  know that handler's exact memory usage across multiple versions, nothing 
  else is sharing that memcg, and it's a perfect implementation, you can't 
  guarantee it.  We need to address real world problems that occur in 
  practice.
 
 If you really need to have such a guarantee then you can have a _global_
 watchdog observing oom_control of all groups that provide such a vague
 requirements for oom user handlers.
 

The whole point is to allow the user to implement their own oom policy.  
If the policy was completely encapsulated in kernel code, we don't need to 
ever disable the oom killer even with memory.oom_control.  Users may 
choose to kill the largest process, the newest process, the oldest 
process, sacrifice children instead of parents, prevent forkbombs, 
implement their own priority scoring (which is what we do), kill the 
allocating task, etc.

To not merge this patch, I'd ask that you show an alternative that allows 
users to implement their own userspace oom handlers and not require admin 
intervention when things go wrong.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread KOSAKI Motohiro

 From: Johannes Weiner han...@cmpxchg.org
 Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context

 The memcg OOM handling is incredibly fragile because once a memcg goes
 OOM, one task (kernel or userspace) is responsible for resolving the
 situation.  Every other task that gets caught trying to charge memory
 gets stuck in a waitqueue while potentially holding various filesystem
 and mm locks on which the OOM handling task may now deadlock.

 Do two things:

 1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but do not trap other tasks and just return -ENOMEM for
everyone.  Userspace should be able to handle this... right?

 2. When OOMing in a page fault, invoke the OOM killer but do not trap
other chargers directly in the charging code.  Instead, remember
the OOMing memcg in the task struct and then fully unwind the page
fault stack first.  Then synchronize the memcg OOM from
pagefault_out_of_memory().

 While reworking the OOM routine, also remove a needless OOM waitqueue
 wakeup when invoking the killer.  Only uncharges and limit increases,
 things that actually change the memory situation, should do wakeups.

 Signed-off-by: Johannes Weiner han...@cmpxchg.org

From point of the memcg oom notification view, it is NOT supported on the case
that an oom handler process is subjected its own memcg limit. All
memcg developers
clearly agreed it since it began. Even though, anyway, people have a
right to shoot their own foot. :)
However, this patch fixes more than that. OK, I prefer it. Good fix!

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-03 Thread Johannes Weiner

On Mon, Jun 03, 2013 at 12:09:22PM -0700, David Rientjes wrote:
 On Mon, 3 Jun 2013, Johannes Weiner wrote:
 
   It's not necessarily harder if you assign the userspace oom handlers to 
   the root of your subtree with access to more memory than the children.  
   There is no inability to write a proper handler, but when you have 
   dozens of individual users implementing their own userspace handlers with 
   changing memcg limits over time, then you might find it hard to have 
   perfection every time.  If we had perfection, we wouldn't have to worry 
   about oom in the first place.  We can't just let these gazillion memcgs 
   sit spinning forever because they get stuck, either.  That's why we've 
   used this solution for years as a failsafe.  Disabling the oom killer 
   entirely, even for a memcg, is ridiculous, and if you don't have a grace 
   period then oom handlers themselves just don't work.
  
  It's only ridiculous if your OOM handler is subject to the OOM
  situation it's trying to handle.
  
 
 You're suggesting the oom handler can't be subject to its own memcg 
 limits, independent of the memcg it is handling?  If we demand that such a 
 handler be attached to the root memcg, that breaks the memory isolation 
 that memcg provides.  We constrain processes to a memcg for the purposes 
 of that isolation, so it cannot use more resources than allotted.

I guess the traditional use case is that you have a job manager that
you trust, that sets up the groups, configures their limits etc. and
would also know more about the jobs than the kernel to act as the OOM
killer as well.  Since it's trusted and not expected to consume any
significant amounts of memory by itself, the memcg code assumes that
it does not run in a cgroup itself, it's just not thought of as being
part of the application class.

I'm not saying it should necessarily stay that way, but it's also not
a completely ridiculous model.

  What we could do is allow one task in the group to be the dedicated
  OOM handler.  If we catch this task in the charge path during an OOM
  situation, we fall back to the kernel OOM handler.
  
 
 I'm not sure it even makes sense to have more than one oom handler per 
 memcg and the synchronization that requires in userspace to get the right 
 result, so I didn't consider dedicating a single oom handler.  That would 
 be an entirely new interface, though, since we may have multiple processes 
 waiting on memory.oom_control that aren't necessarily handlers; they grab 
 a snapshot of memory, do logging, etc.

It will probably be hard to extend the existing oom_control interface.

But we could add a separate one where you put in a pid or 0 (kernel)
instead of a boolean value, which then enables or disables the
userspace OOM handling task for the whole subtree.  If that task
enters the OOM path, the kernel handler is invoked.  If the task dies,
the kernel handler is permanently re-enabled.  Everybody is free to
poll that interface for OOM notifications, not only that one task.

Combined with the enter waitqueue after unwinding page fault stack
patch, would this fully cover your usecase?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Johannes Weiner

On Sat, Jun 01, 2013 at 12:29:05PM +0200, Michal Hocko wrote:
> On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> [...]
> > I'm currently messing around with the below patch.  When a task faults
> > and charges under OOM, the memcg is remembered in the task struct and
> > then made to sleep on the memcg's OOM waitqueue only after unwinding
> > the page fault stack.  With the kernel OOM killer disabled, all tasks
> > in the OOMing group sit nicely in
> > 
> >   mem_cgroup_oom_synchronize
> >   pagefault_out_of_memory
> >   mm_fault_error
> >   __do_page_fault
> >   page_fault
> >   0x
> > 
> > regardless of whether they were faulting anon or file.  They do not
> > even hold the mmap_sem anymore at this point.
> > 
> > [ I kept syscalls really simple for now and just have them return
> >   -ENOMEM, never trap them at all (just like the global OOM case).
> >   It would be more work to have them wait from a flatter stack too,
> >   but it should be doable if necessary. ]
> > 
> > I suggested this at the MM summit and people were essentially asking
> > if I was feeling well, so maybe I'm still missing a gaping hole in
> > this idea.
> 
> I didn't get to look at the patch (will do on Monday) but it doesn't
> sounds entirely crazy. Well, we would have to drop mmap_sem so things
> have to be rechecked but we are doing that already with VM_FAULT_RETRY
> in some archs so it should work.

The global OOM case has been doing this for a long time (1c0fe6e mm:
invoke oom-killer from page fault), way before VM_FAULT_RETRY.  The
fault is aborted with VM_FAULT_OOM and the oom killer is invoked.
Either the task gets killed or it'll just retrigger the fault.  The
only difference is that a memcg OOM kill may take longer because of
userspace handling, so memcg needs a waitqueue where the global case
simply does a trylock (try_set_zonelist_oom) and restarts the fault
immediately if somebody else is handling the situation.

In fact, when Nick added the page fault OOM invocation, KAME merged
something similar to my patch, which tried to catch memcg OOM kills
from pagefault_out_of_memory() (a636b32 memcg: avoid unnecessary
system-wide-oom-killer).

Only when he reworked the whole memcg OOM synchronization, added the
ability to disable OOM and the waitqueues etc, the OOMs were trapped
right there in the charge context (867578c memcg: fix oom kill
behavior).  But I see no reason why we shouldn't be able to keep the
waitqueues and still go back to synchronizing from the bottom of the
page fault stack.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Michal Hocko

On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
[...]
> I'm currently messing around with the below patch.  When a task faults
> and charges under OOM, the memcg is remembered in the task struct and
> then made to sleep on the memcg's OOM waitqueue only after unwinding
> the page fault stack.  With the kernel OOM killer disabled, all tasks
> in the OOMing group sit nicely in
> 
>   mem_cgroup_oom_synchronize
>   pagefault_out_of_memory
>   mm_fault_error
>   __do_page_fault
>   page_fault
>   0x
> 
> regardless of whether they were faulting anon or file.  They do not
> even hold the mmap_sem anymore at this point.
> 
> [ I kept syscalls really simple for now and just have them return
>   -ENOMEM, never trap them at all (just like the global OOM case).
>   It would be more work to have them wait from a flatter stack too,
>   but it should be doable if necessary. ]
> 
> I suggested this at the MM summit and people were essentially asking
> if I was feeling well, so maybe I'm still missing a gaping hole in
> this idea.

I didn't get to look at the patch (will do on Monday) but it doesn't
sounds entirely crazy. Well, we would have to drop mmap_sem so things
have to be rechecked but we are doing that already with VM_FAULT_RETRY
in some archs so it should work.

> Patch only works on x86 as of now, on other architectures memcg OOM
> will invoke the global OOM killer.
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Michal Hocko

On Fri 31-05-13 12:29:17, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
> 
> > > We allow users to control their own memcgs by chowning them, so they must 
> > > be run in the same hierarchy if they want to run their own userspace oom 
> > > handler.  There's nothing in the kernel that prevents that and the user 
> > > has no other option but to run in a parent cgroup.
> > 
> > If the access to the oom_control file is controlled by the file
> > permissions then the oom handler can live inside root cgroup. Why do you
> > need "must be in the same hierarchy" requirement?
> > 
> 
> Users obviously don't have the ability to attach processes to the root 
> memcg.  They are constrained to their own subtree of memcgs.

OK, I assume those groups are generally untrusted, right? So you cannot
let them register their oom handler even via an admin interface. This
makes it a bit complicated because it makes much harder demands on the
handler itself as it has to run under restricted environment.

> > > It's too easy to simply do even a "ps ax" in an oom memcg and make that 
> > > thread completely unresponsive because it allocates memory.
> > 
> > Yes, but we are talking about oom handler and that one has to be really
> > careful about what it does. So doing something that simply allocates is
> > dangerous.
> > 
> 
> Show me a userspace oom handler that doesn't get notified of every fork() 
> in a memcg, causing a performance degradation of its own for a complete 
> and utter slowpath, that will know the entire process tree of its own 
> memcg or a child memcg.

I still do not see why you cannot simply read tasks file into a
preallocated buffer. This would be few pages even for thousands of pids.
You do not have to track processes as they come and go.

> This discussion is all fine and good from a theoretical point of view 
> until you actually have to implement one of these yourself.  Multiple 
> users are going to be running their own oom notifiers and without some 
> sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too 
> easily deadlock looking for memory.

As I said before. oom_delay_millisecs is actually really easy to be done
from userspace. If you really need a safety break then you can register
such a handler as a fallback. I am not familiar with eventfd internals
much but I guess that multiple handlers are possible. The fallback might
be enforeced by the admin (when a new group is created) or by the
container itself. Would something like this work for your use case?

> If that user is constrained to his or her own subtree, as previously
> stated, there's also no way to login and rectify the situation at that
> point and requires admin intervention or a reboot.

Yes, insisting on the same subtree makes the life much harder for oom
handlers. I totally agree with you on that. I just feel that introducing
a new knob to workaround user "inability" to write a proper handler
(what ever that means) is not justified.
 
> > > Then perhaps I'm raising constraints that you've never worked with, I 
> > > don't know.  We choose to have a priority-based approach that is 
> > > inherited 
> > > by children; this priority is kept in userspace and and the oom handler 
> > > would naturally need to know the set of tasks in the oom memcg at the 
> > > time 
> > > of oom and their parent-child relationship.  These priorities are 
> > > completely independent of memory usage.
> > 
> > OK, but both reading tasks file and readdir should be doable without
> > userspace (aka charged) allocations. Moreover if you run those oom
> > handlers under the root cgroup then it should be even easier.
> 
> Then why does "cat tasks" stall when my memcg is totally depleted of all 
> memory?

if you run it like this then cat obviously needs some charged
allocations. If you had a proper handler which mlocks its buffer for the
read syscall then you shouldn't require any allocation at the oom time.
This shouldn't be that hard to do without too much memory overhead. As I
said we are talking about few (dozens) of pages per handler.

> This isn't even the argument because memory.oom_delay_millisecs isn't 
> going to help that situation.  I'm talking about a failsafe that ensures a 
> memcg can't deadlock.  The global oom killer will always have to exist in 
> the kernel, at least in the most simplistic of forms, solely for this 
> reason; you can't move all of the logic to userspace and expect it to 
> react 100% of the time.

The global case is even more complicated because every allocation
matters - not just those that are charged as for memcg case. Btw. at
last LSF there was a discussion about enabling oom_control for the root
cgroup but this is off-topic here and should be discussed separately
when somebody actually tries to implement this.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Johannes Weiner

On Fri, May 31, 2013 at 12:29:17PM -0700, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
> > > It's too easy to simply do even a "ps ax" in an oom memcg and make that 
> > > thread completely unresponsive because it allocates memory.
> > 
> > Yes, but we are talking about oom handler and that one has to be really
> > careful about what it does. So doing something that simply allocates is
> > dangerous.
> > 
> 
> Show me a userspace oom handler that doesn't get notified of every fork() 
> in a memcg, causing a performance degradation of its own for a complete 
> and utter slowpath, that will know the entire process tree of its own 
> memcg or a child memcg.
> 
> This discussion is all fine and good from a theoretical point of view 
> until you actually have to implement one of these yourself.  Multiple 
> users are going to be running their own oom notifiers and without some 
> sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too 
> easily deadlock looking for memory.  If that user is constrained to his or 
> her own subtree, as previously stated, there's also no way to login and 
> rectify the situation at that point and requires admin intervention or a 
> reboot.

Memcg OOM killing is fragile, doing it in userspace does not help.
But a userspace OOM handler that is subject to the OOM situation it is
supposed to resolve?  That's kind of... cruel.

However, I agree with your other point: as long as chargers can get
stuck during OOM while sitting on an unpredictable bunch of locks, we
can not claim that there is a right way to write a reliable userspace
OOM handler.  Userspace has no way of knowing which operations are
safe, and it might even vary between kernel versions.

But I would prefer a better solution than the timeout.  Michal in the
past already tried to disable the OOM killer for page cache charges in
general due to deadlock scenarios.  Could we invoke the killer just
the same but make sure we never trap tasks in the charge context?

I'm currently messing around with the below patch.  When a task faults
and charges under OOM, the memcg is remembered in the task struct and
then made to sleep on the memcg's OOM waitqueue only after unwinding
the page fault stack.  With the kernel OOM killer disabled, all tasks
in the OOMing group sit nicely in

  mem_cgroup_oom_synchronize
  pagefault_out_of_memory
  mm_fault_error
  __do_page_fault
  page_fault
  0x

regardless of whether they were faulting anon or file.  They do not
even hold the mmap_sem anymore at this point.

[ I kept syscalls really simple for now and just have them return
  -ENOMEM, never trap them at all (just like the global OOM case).
  It would be more work to have them wait from a flatter stack too,
  but it should be doable if necessary. ]

I suggested this at the MM summit and people were essentially asking
if I was feeling well, so maybe I'm still missing a gaping hole in
this idea.

Patch only works on x86 as of now, on other architectures memcg OOM
will invoke the global OOM killer.

---
From: Johannes Weiner 
Subject: [PATCH] memcg: more robust oom handling

The memcg OOM handling is incredibly fragile because once a memcg goes
OOM, one task (kernel or userspace) is responsible for resolving the
situation.  Every other task that gets caught trying to charge memory
gets stuck in a waitqueue while potentially holding various filesystem
and mm locks on which the OOM handling task may now deadlock.

Do two things to charge attempts under OOM:

1. Do not trap system calls (buffered IO and friends), just return
   -ENOMEM.  Userspace should be able to handle this... right?

2. Do not trap page faults directly in the charging context.  Instead,
   remember the OOMing memcg in the task struct and fully unwind the
   page fault stack first.  Then synchronize the memcg OOM from
   pagefault_out_of_memory()

Not-quite-signed-off-by: Johannes Weiner 
---
 arch/x86/mm/fault.c|   2 +
 include/linux/memcontrol.h |   6 +++
 include/linux/sched.h  |   6 +++
 mm/memcontrol.c| 104 +
 mm/oom_kill.c  |   7 ++-
 5 files changed, 78 insertions(+), 47 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 0e88336..df043a0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1185,7 +1185,9 @@ good_area:
 * make sure we exit gracefully rather than endlessly redo
 * the fault:
 */
+   current->in_userfault = 1;
fault = handle_mm_fault(mm, vma, address, flags);
+   current->in_userfault = 0;

if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..e1b84ea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Johannes Weiner

On Fri, May 31, 2013 at 12:29:17PM -0700, David Rientjes wrote:
 On Fri, 31 May 2013, Michal Hocko wrote:
   It's too easy to simply do even a ps ax in an oom memcg and make that 
   thread completely unresponsive because it allocates memory.
  
  Yes, but we are talking about oom handler and that one has to be really
  careful about what it does. So doing something that simply allocates is
  dangerous.
  
 
 Show me a userspace oom handler that doesn't get notified of every fork() 
 in a memcg, causing a performance degradation of its own for a complete 
 and utter slowpath, that will know the entire process tree of its own 
 memcg or a child memcg.
 
 This discussion is all fine and good from a theoretical point of view 
 until you actually have to implement one of these yourself.  Multiple 
 users are going to be running their own oom notifiers and without some 
 sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too 
 easily deadlock looking for memory.  If that user is constrained to his or 
 her own subtree, as previously stated, there's also no way to login and 
 rectify the situation at that point and requires admin intervention or a 
 reboot.

Memcg OOM killing is fragile, doing it in userspace does not help.
But a userspace OOM handler that is subject to the OOM situation it is
supposed to resolve?  That's kind of... cruel.

However, I agree with your other point: as long as chargers can get
stuck during OOM while sitting on an unpredictable bunch of locks, we
can not claim that there is a right way to write a reliable userspace
OOM handler.  Userspace has no way of knowing which operations are
safe, and it might even vary between kernel versions.

But I would prefer a better solution than the timeout.  Michal in the
past already tried to disable the OOM killer for page cache charges in
general due to deadlock scenarios.  Could we invoke the killer just
the same but make sure we never trap tasks in the charge context?

I'm currently messing around with the below patch.  When a task faults
and charges under OOM, the memcg is remembered in the task struct and
then made to sleep on the memcg's OOM waitqueue only after unwinding
the page fault stack.  With the kernel OOM killer disabled, all tasks
in the OOMing group sit nicely in

  mem_cgroup_oom_synchronize
  pagefault_out_of_memory
  mm_fault_error
  __do_page_fault
  page_fault
  0x

regardless of whether they were faulting anon or file.  They do not
even hold the mmap_sem anymore at this point.

[ I kept syscalls really simple for now and just have them return
  -ENOMEM, never trap them at all (just like the global OOM case).
  It would be more work to have them wait from a flatter stack too,
  but it should be doable if necessary. ]

I suggested this at the MM summit and people were essentially asking
if I was feeling well, so maybe I'm still missing a gaping hole in
this idea.

Patch only works on x86 as of now, on other architectures memcg OOM
will invoke the global OOM killer.

---
From: Johannes Weiner han...@cmpxchg.org
Subject: [PATCH] memcg: more robust oom handling

The memcg OOM handling is incredibly fragile because once a memcg goes
OOM, one task (kernel or userspace) is responsible for resolving the
situation.  Every other task that gets caught trying to charge memory
gets stuck in a waitqueue while potentially holding various filesystem
and mm locks on which the OOM handling task may now deadlock.

Do two things to charge attempts under OOM:

1. Do not trap system calls (buffered IO and friends), just return
   -ENOMEM.  Userspace should be able to handle this... right?

2. Do not trap page faults directly in the charging context.  Instead,
   remember the OOMing memcg in the task struct and fully unwind the
   page fault stack first.  Then synchronize the memcg OOM from
   pagefault_out_of_memory()

Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 arch/x86/mm/fault.c|   2 +
 include/linux/memcontrol.h |   6 +++
 include/linux/sched.h  |   6 +++
 mm/memcontrol.c| 104 +
 mm/oom_kill.c  |   7 ++-
 5 files changed, 78 insertions(+), 47 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 0e88336..df043a0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1185,7 +1185,9 @@ good_area:
 * make sure we exit gracefully rather than endlessly redo
 * the fault:
 */
+   current-in_userfault = 1;
fault = handle_mm_fault(mm, vma, address, flags);
+   current-in_userfault = 0;
 
if (unlikely(fault  (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..e1b84ea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Michal Hocko

On Fri 31-05-13 12:29:17, David Rientjes wrote:
 On Fri, 31 May 2013, Michal Hocko wrote:
 
   We allow users to control their own memcgs by chowning them, so they must 
   be run in the same hierarchy if they want to run their own userspace oom 
   handler.  There's nothing in the kernel that prevents that and the user 
   has no other option but to run in a parent cgroup.
  
  If the access to the oom_control file is controlled by the file
  permissions then the oom handler can live inside root cgroup. Why do you
  need must be in the same hierarchy requirement?
  
 
 Users obviously don't have the ability to attach processes to the root 
 memcg.  They are constrained to their own subtree of memcgs.

OK, I assume those groups are generally untrusted, right? So you cannot
let them register their oom handler even via an admin interface. This
makes it a bit complicated because it makes much harder demands on the
handler itself as it has to run under restricted environment.

   It's too easy to simply do even a ps ax in an oom memcg and make that 
   thread completely unresponsive because it allocates memory.
  
  Yes, but we are talking about oom handler and that one has to be really
  careful about what it does. So doing something that simply allocates is
  dangerous.
  
 
 Show me a userspace oom handler that doesn't get notified of every fork() 
 in a memcg, causing a performance degradation of its own for a complete 
 and utter slowpath, that will know the entire process tree of its own 
 memcg or a child memcg.

I still do not see why you cannot simply read tasks file into a
preallocated buffer. This would be few pages even for thousands of pids.
You do not have to track processes as they come and go.

 This discussion is all fine and good from a theoretical point of view 
 until you actually have to implement one of these yourself.  Multiple 
 users are going to be running their own oom notifiers and without some 
 sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too 
 easily deadlock looking for memory.

As I said before. oom_delay_millisecs is actually really easy to be done
from userspace. If you really need a safety break then you can register
such a handler as a fallback. I am not familiar with eventfd internals
much but I guess that multiple handlers are possible. The fallback might
be enforeced by the admin (when a new group is created) or by the
container itself. Would something like this work for your use case?

 If that user is constrained to his or her own subtree, as previously
 stated, there's also no way to login and rectify the situation at that
 point and requires admin intervention or a reboot.

Yes, insisting on the same subtree makes the life much harder for oom
handlers. I totally agree with you on that. I just feel that introducing
a new knob to workaround user inability to write a proper handler
(what ever that means) is not justified.
 
   Then perhaps I'm raising constraints that you've never worked with, I 
   don't know.  We choose to have a priority-based approach that is 
   inherited 
   by children; this priority is kept in userspace and and the oom handler 
   would naturally need to know the set of tasks in the oom memcg at the 
   time 
   of oom and their parent-child relationship.  These priorities are 
   completely independent of memory usage.
  
  OK, but both reading tasks file and readdir should be doable without
  userspace (aka charged) allocations. Moreover if you run those oom
  handlers under the root cgroup then it should be even easier.
 
 Then why does cat tasks stall when my memcg is totally depleted of all 
 memory?

if you run it like this then cat obviously needs some charged
allocations. If you had a proper handler which mlocks its buffer for the
read syscall then you shouldn't require any allocation at the oom time.
This shouldn't be that hard to do without too much memory overhead. As I
said we are talking about few (dozens) of pages per handler.

 This isn't even the argument because memory.oom_delay_millisecs isn't 
 going to help that situation.  I'm talking about a failsafe that ensures a 
 memcg can't deadlock.  The global oom killer will always have to exist in 
 the kernel, at least in the most simplistic of forms, solely for this 
 reason; you can't move all of the logic to userspace and expect it to 
 react 100% of the time.

The global case is even more complicated because every allocation
matters - not just those that are charged as for memcg case. Btw. at
last LSF there was a discussion about enabling oom_control for the root
cgroup but this is off-topic here and should be discussed separately
when somebody actually tries to implement this.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Michal Hocko

On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
[...]
 I'm currently messing around with the below patch.  When a task faults
 and charges under OOM, the memcg is remembered in the task struct and
 then made to sleep on the memcg's OOM waitqueue only after unwinding
 the page fault stack.  With the kernel OOM killer disabled, all tasks
 in the OOMing group sit nicely in
 
   mem_cgroup_oom_synchronize
   pagefault_out_of_memory
   mm_fault_error
   __do_page_fault
   page_fault
   0x
 
 regardless of whether they were faulting anon or file.  They do not
 even hold the mmap_sem anymore at this point.
 
 [ I kept syscalls really simple for now and just have them return
   -ENOMEM, never trap them at all (just like the global OOM case).
   It would be more work to have them wait from a flatter stack too,
   but it should be doable if necessary. ]
 
 I suggested this at the MM summit and people were essentially asking
 if I was feeling well, so maybe I'm still missing a gaping hole in
 this idea.

I didn't get to look at the patch (will do on Monday) but it doesn't
sounds entirely crazy. Well, we would have to drop mmap_sem so things
have to be rechecked but we are doing that already with VM_FAULT_RETRY
in some archs so it should work.

 Patch only works on x86 as of now, on other architectures memcg OOM
 will invoke the global OOM killer.
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-06-01 Thread Johannes Weiner

On Sat, Jun 01, 2013 at 12:29:05PM +0200, Michal Hocko wrote:
 On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
 [...]
  I'm currently messing around with the below patch.  When a task faults
  and charges under OOM, the memcg is remembered in the task struct and
  then made to sleep on the memcg's OOM waitqueue only after unwinding
  the page fault stack.  With the kernel OOM killer disabled, all tasks
  in the OOMing group sit nicely in
  
mem_cgroup_oom_synchronize
pagefault_out_of_memory
mm_fault_error
__do_page_fault
page_fault
0x
  
  regardless of whether they were faulting anon or file.  They do not
  even hold the mmap_sem anymore at this point.
  
  [ I kept syscalls really simple for now and just have them return
-ENOMEM, never trap them at all (just like the global OOM case).
It would be more work to have them wait from a flatter stack too,
but it should be doable if necessary. ]
  
  I suggested this at the MM summit and people were essentially asking
  if I was feeling well, so maybe I'm still missing a gaping hole in
  this idea.
 
 I didn't get to look at the patch (will do on Monday) but it doesn't
 sounds entirely crazy. Well, we would have to drop mmap_sem so things
 have to be rechecked but we are doing that already with VM_FAULT_RETRY
 in some archs so it should work.

The global OOM case has been doing this for a long time (1c0fe6e mm:
invoke oom-killer from page fault), way before VM_FAULT_RETRY.  The
fault is aborted with VM_FAULT_OOM and the oom killer is invoked.
Either the task gets killed or it'll just retrigger the fault.  The
only difference is that a memcg OOM kill may take longer because of
userspace handling, so memcg needs a waitqueue where the global case
simply does a trylock (try_set_zonelist_oom) and restarts the fault
immediately if somebody else is handling the situation.

In fact, when Nick added the page fault OOM invocation, KAME merged
something similar to my patch, which tried to catch memcg OOM kills
from pagefault_out_of_memory() (a636b32 memcg: avoid unnecessary
system-wide-oom-killer).

Only when he reworked the whole memcg OOM synchronization, added the
ability to disable OOM and the waitqueues etc, the OOMs were trapped
right there in the charge context (867578c memcg: fix oom kill
behavior).  But I see no reason why we shouldn't be able to keep the
waitqueues and still go back to synchronizing from the bottom of the
page fault stack.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Andrew Morton

On Wed, 29 May 2013 18:18:10 -0700 (PDT) David Rientjes  
wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive.  This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
> 
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the oom
> killer.  If the oom condition persists for this number of milliseconds,
> the oom killer will be called the next time the memory controller
> attempts to charge a page (and memory.oom_control is set to 0).  This
> allows userspace to have a short period of time to respond to the
> condition before deferring to the kernel to kill a task.
> 
> Admins may set the oom killer delay using the new interface:
> 
>   # echo 6 > memory.oom_delay_millisecs
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed by putting the task to sleep for 60 seconds.

How often is that delay actually useful, in the real world?

IOW, in what proportion of cases does the system just remain stuck for
60 seconds and then get an oom-killing?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread David Rientjes

On Fri, 31 May 2013, Michal Hocko wrote:

> > We allow users to control their own memcgs by chowning them, so they must 
> > be run in the same hierarchy if they want to run their own userspace oom 
> > handler.  There's nothing in the kernel that prevents that and the user 
> > has no other option but to run in a parent cgroup.
> 
> If the access to the oom_control file is controlled by the file
> permissions then the oom handler can live inside root cgroup. Why do you
> need "must be in the same hierarchy" requirement?
> 

Users obviously don't have the ability to attach processes to the root 
memcg.  They are constrained to their own subtree of memcgs.

> > It's too easy to simply do even a "ps ax" in an oom memcg and make that 
> > thread completely unresponsive because it allocates memory.
> 
> Yes, but we are talking about oom handler and that one has to be really
> careful about what it does. So doing something that simply allocates is
> dangerous.
> 

Show me a userspace oom handler that doesn't get notified of every fork() 
in a memcg, causing a performance degradation of its own for a complete 
and utter slowpath, that will know the entire process tree of its own 
memcg or a child memcg.

This discussion is all fine and good from a theoretical point of view 
until you actually have to implement one of these yourself.  Multiple 
users are going to be running their own oom notifiers and without some 
sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too 
easily deadlock looking for memory.  If that user is constrained to his or 
her own subtree, as previously stated, there's also no way to login and 
rectify the situation at that point and requires admin intervention or a 
reboot.

> > Then perhaps I'm raising constraints that you've never worked with, I 
> > don't know.  We choose to have a priority-based approach that is inherited 
> > by children; this priority is kept in userspace and and the oom handler 
> > would naturally need to know the set of tasks in the oom memcg at the time 
> > of oom and their parent-child relationship.  These priorities are 
> > completely independent of memory usage.
> 
> OK, but both reading tasks file and readdir should be doable without
> userspace (aka charged) allocations. Moreover if you run those oom
> handlers under the root cgroup then it should be even easier.

Then why does "cat tasks" stall when my memcg is totally depleted of all 
memory?

This isn't even the argument because memory.oom_delay_millisecs isn't 
going to help that situation.  I'm talking about a failsafe that ensures a 
memcg can't deadlock.  The global oom killer will always have to exist in 
the kernel, at least in the most simplistic of forms, solely for this 
reason; you can't move all of the logic to userspace and expect it to 
react 100% of the time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Michal Hocko

On Fri 31-05-13 03:22:59, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
> 
> > I have always discouraged people from running oom handler in the same
> > memcg (or even in the same hierarchy).
> > 
> 
> We allow users to control their own memcgs by chowning them, so they must 
> be run in the same hierarchy if they want to run their own userspace oom 
> handler.  There's nothing in the kernel that prevents that and the user 
> has no other option but to run in a parent cgroup.

If the access to the oom_control file is controlled by the file
permissions then the oom handler can live inside root cgroup. Why do you
need "must be in the same hierarchy" requirement?

> > Yes, mmap_sem is tricky. Nothing in the proc code should take it for
> > writing and charges are done with mmap_sem held for reading but that
> > doesn't prevent from non-oom thread to try to get it for writing which
> > would block also all future readers. We have also seen i_mutex being
> > held during charge so you have to be careful about that one as well but
> > I am not aware of other locks that could be a problem.
> > 
> > The question is, do you really need to open any /proc// files which
> > depend on mmap_sem (e.g. maps, smaps). /proc//status should tell you
> > about used memory. Or put it another way around. What kind of data you
> > need for your OOM handler?
> > 
> 
> It's too easy to simply do even a "ps ax" in an oom memcg and make that 
> thread completely unresponsive because it allocates memory.

Yes, but we are talking about oom handler and that one has to be really
careful about what it does. So doing something that simply allocates is
dangerous.

> > I might be thinking about different use cases but user space OOM
> > handlers I have seen so far had quite a good idea what is going on
> > in the group and who to kill.
> 
> Then perhaps I'm raising constraints that you've never worked with, I 
> don't know.  We choose to have a priority-based approach that is inherited 
> by children; this priority is kept in userspace and and the oom handler 
> would naturally need to know the set of tasks in the oom memcg at the time 
> of oom and their parent-child relationship.  These priorities are 
> completely independent of memory usage.

OK, but both reading tasks file and readdir should be doable without
userspace (aka charged) allocations. Moreover if you run those oom
handlers under the root cgroup then it should be even easier.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Michal Hocko

On Fri 31-05-13 03:22:59, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
[...]
> > > If the oom notifier is in the oom cgroup, it may not be able to   
> > > successfully read the memcg "tasks" file to even determine the set of 
> > > eligible processes.
> > 
> > It would have to use preallocated buffer and have mlocked all the memory
> > that will be used during oom event.
> > 
> 
> Wrong, the kernel itself allocates memory when reading this information 
> and that would fail in an oom memcg.

But that memory is not charged to a memcg, is it? So unless you are
heading towards global OOM you should be safe.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread David Rientjes

On Fri, 31 May 2013, Michal Hocko wrote:

> I have always discouraged people from running oom handler in the same
> memcg (or even in the same hierarchy).
> 

We allow users to control their own memcgs by chowning them, so they must 
be run in the same hierarchy if they want to run their own userspace oom 
handler.  There's nothing in the kernel that prevents that and the user 
has no other option but to run in a parent cgroup.

> Yes, mmap_sem is tricky. Nothing in the proc code should take it for
> writing and charges are done with mmap_sem held for reading but that
> doesn't prevent from non-oom thread to try to get it for writing which
> would block also all future readers. We have also seen i_mutex being
> held during charge so you have to be careful about that one as well but
> I am not aware of other locks that could be a problem.
> 
> The question is, do you really need to open any /proc// files which
> depend on mmap_sem (e.g. maps, smaps). /proc//status should tell you
> about used memory. Or put it another way around. What kind of data you
> need for your OOM handler?
> 

It's too easy to simply do even a "ps ax" in an oom memcg and make that 
thread completely unresponsive because it allocates memory.

> I might be thinking about different use cases but user space OOM
> handlers I have seen so far had quite a good idea what is going on
> in the group and who to kill.

Then perhaps I'm raising constraints that you've never worked with, I 
don't know.  We choose to have a priority-based approach that is inherited 
by children; this priority is kept in userspace and and the oom handler 
would naturally need to know the set of tasks in the oom memcg at the time 
of oom and their parent-child relationship.  These priorities are 
completely independent of memory usage.

> > If the oom notifier is in the oom cgroup, it may not be able to   
> > successfully read the memcg "tasks" file to even determine the set of 
> > eligible processes.
> 
> It would have to use preallocated buffer and have mlocked all the memory
> that will be used during oom event.
> 

Wrong, the kernel itself allocates memory when reading this information 
and that would fail in an oom memcg.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Michal Hocko

On Thu 30-05-13 13:47:30, David Rientjes wrote:
> On Thu, 30 May 2013, Michal Hocko wrote:
> 
> > > Completely disabling the oom killer for a memcg is problematic if
> > > userspace is unable to address the condition itself, usually because it
> > > is unresponsive. 
> > 
> > Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
> > shouldn't allocated any memory so nothing should prevent it from running (if
> > other tasks are preempting it permanently then the priority of the handler
> > should be increased).
> > 
> 
> Unresponsiveness isn't necessarily only because of memory constraints, you 
> may have your oom notifier in a parent cgroup that isn't oom. 

I have always discouraged people from running oom handler in the same
memcg (or even in the same hierarchy).

> If a process is stuck on mm->mmap_sem in the oom cgroup, though, the
> oom notifier may not be able to scrape /proc/pid and attain necessary
> information in making an oom kill decision.

Yes, mmap_sem is tricky. Nothing in the proc code should take it for
writing and charges are done with mmap_sem held for reading but that
doesn't prevent from non-oom thread to try to get it for writing which
would block also all future readers. We have also seen i_mutex being
held during charge so you have to be careful about that one as well but
I am not aware of other locks that could be a problem.

The question is, do you really need to open any /proc// files which
depend on mmap_sem (e.g. maps, smaps). /proc//status should tell you
about used memory. Or put it another way around. What kind of data you
need for your OOM handler?

I might be thinking about different use cases but user space OOM
handlers I have seen so far had quite a good idea what is going on
in the group and who to kill. So it was more a decision based on the
workload and its semantic rather than based on the used memory (which
is done quite sensibly with the in kernel handler already) or something
that would trip over mmap_sem when trying to get information.

> If the oom notifier is in the oom cgroup, it may not be able to   
> successfully read the memcg "tasks" file to even determine the set of 
> eligible processes.

It would have to use preallocated buffer and have mlocked all the memory
that will be used during oom event.

> There is also no guarantee that the userspace oom handler will have
> the necessary memory to even re-enable the oom killer in the memcg
> under oom which would allow the kernel to make forward progress.

Why it wouldn't have enough memory to write to the file? Assuming that
the oom handler has the file handle (for limit_in_bytes) open, it has
mlocked all the necessary memory (code + buffers) then I do not see what
would prevent it from writing it to limit_in_bytes.

> We've used this for a few years as a complement to oom notifiers so that a 
> process would have a grace period to deal with the oom condition itself 
> before allowing the kernel to terminate a process and free memory.  We've 
> simply had no alternative in the presence of kernel constraints that 
> prevent it from being done in any other way.  We _want_ userspace to deal 
> with the issue but when it cannot collect the necessary information (and 
> we're not tracing every fork() that every process in a potentially oom 
> memcg does) to deal with the condition, we want the kernel to step in 
> instead of relying on an admin to login or a global oom condition.
> 
> If you'd like to debate this issue, I'd be more than happy to do so and 
> show why this patch is absolutely necessary for inclusion, but I'd ask 
> that you'd present the code from your userspace oom handler so I can 
> understand how it works without needing such backup support.

I usually do not write those things myself as I am supporting others so
I do not have any code handy right now. But I can come up with a simple
handler which implements your timeout based killer for peak workloads
you have mentioned earlier. That one should be quite easy to do.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Michal Hocko

On Thu 30-05-13 13:47:30, David Rientjes wrote:
 On Thu, 30 May 2013, Michal Hocko wrote:
 
   Completely disabling the oom killer for a memcg is problematic if
   userspace is unable to address the condition itself, usually because it
   is unresponsive. 
  
  Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
  shouldn't allocated any memory so nothing should prevent it from running (if
  other tasks are preempting it permanently then the priority of the handler
  should be increased).
  
 
 Unresponsiveness isn't necessarily only because of memory constraints, you 
 may have your oom notifier in a parent cgroup that isn't oom. 

I have always discouraged people from running oom handler in the same
memcg (or even in the same hierarchy).

 If a process is stuck on mm-mmap_sem in the oom cgroup, though, the
 oom notifier may not be able to scrape /proc/pid and attain necessary
 information in making an oom kill decision.

Yes, mmap_sem is tricky. Nothing in the proc code should take it for
writing and charges are done with mmap_sem held for reading but that
doesn't prevent from non-oom thread to try to get it for writing which
would block also all future readers. We have also seen i_mutex being
held during charge so you have to be careful about that one as well but
I am not aware of other locks that could be a problem.

The question is, do you really need to open any /proc/pid/ files which
depend on mmap_sem (e.g. maps, smaps). /proc/pid/status should tell you
about used memory. Or put it another way around. What kind of data you
need for your OOM handler?

I might be thinking about different use cases but user space OOM
handlers I have seen so far had quite a good idea what is going on
in the group and who to kill. So it was more a decision based on the
workload and its semantic rather than based on the used memory (which
is done quite sensibly with the in kernel handler already) or something
that would trip over mmap_sem when trying to get information.

 If the oom notifier is in the oom cgroup, it may not be able to   
 successfully read the memcg tasks file to even determine the set of 
 eligible processes.

It would have to use preallocated buffer and have mlocked all the memory
that will be used during oom event.

 There is also no guarantee that the userspace oom handler will have
 the necessary memory to even re-enable the oom killer in the memcg
 under oom which would allow the kernel to make forward progress.

Why it wouldn't have enough memory to write to the file? Assuming that
the oom handler has the file handle (for limit_in_bytes) open, it has
mlocked all the necessary memory (code + buffers) then I do not see what
would prevent it from writing it to limit_in_bytes.

 We've used this for a few years as a complement to oom notifiers so that a 
 process would have a grace period to deal with the oom condition itself 
 before allowing the kernel to terminate a process and free memory.  We've 
 simply had no alternative in the presence of kernel constraints that 
 prevent it from being done in any other way.  We _want_ userspace to deal 
 with the issue but when it cannot collect the necessary information (and 
 we're not tracing every fork() that every process in a potentially oom 
 memcg does) to deal with the condition, we want the kernel to step in 
 instead of relying on an admin to login or a global oom condition.
 
 If you'd like to debate this issue, I'd be more than happy to do so and 
 show why this patch is absolutely necessary for inclusion, but I'd ask 
 that you'd present the code from your userspace oom handler so I can 
 understand how it works without needing such backup support.

I usually do not write those things myself as I am supporting others so
I do not have any code handy right now. But I can come up with a simple
handler which implements your timeout based killer for peak workloads
you have mentioned earlier. That one should be quite easy to do.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread David Rientjes

On Fri, 31 May 2013, Michal Hocko wrote:

 I have always discouraged people from running oom handler in the same
 memcg (or even in the same hierarchy).
 

We allow users to control their own memcgs by chowning them, so they must 
be run in the same hierarchy if they want to run their own userspace oom 
handler.  There's nothing in the kernel that prevents that and the user 
has no other option but to run in a parent cgroup.

 Yes, mmap_sem is tricky. Nothing in the proc code should take it for
 writing and charges are done with mmap_sem held for reading but that
 doesn't prevent from non-oom thread to try to get it for writing which
 would block also all future readers. We have also seen i_mutex being
 held during charge so you have to be careful about that one as well but
 I am not aware of other locks that could be a problem.
 
 The question is, do you really need to open any /proc/pid/ files which
 depend on mmap_sem (e.g. maps, smaps). /proc/pid/status should tell you
 about used memory. Or put it another way around. What kind of data you
 need for your OOM handler?
 

It's too easy to simply do even a ps ax in an oom memcg and make that 
thread completely unresponsive because it allocates memory.

 I might be thinking about different use cases but user space OOM
 handlers I have seen so far had quite a good idea what is going on
 in the group and who to kill.

Then perhaps I'm raising constraints that you've never worked with, I 
don't know.  We choose to have a priority-based approach that is inherited 
by children; this priority is kept in userspace and and the oom handler 
would naturally need to know the set of tasks in the oom memcg at the time 
of oom and their parent-child relationship.  These priorities are 
completely independent of memory usage.

  If the oom notifier is in the oom cgroup, it may not be able to   
  successfully read the memcg tasks file to even determine the set of 
  eligible processes.
 
 It would have to use preallocated buffer and have mlocked all the memory
 that will be used during oom event.
 

Wrong, the kernel itself allocates memory when reading this information 
and that would fail in an oom memcg.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Michal Hocko

On Fri 31-05-13 03:22:59, David Rientjes wrote:
 On Fri, 31 May 2013, Michal Hocko wrote:
[...]
   If the oom notifier is in the oom cgroup, it may not be able to   
   successfully read the memcg tasks file to even determine the set of 
   eligible processes.
  
  It would have to use preallocated buffer and have mlocked all the memory
  that will be used during oom event.
  
 
 Wrong, the kernel itself allocates memory when reading this information 
 and that would fail in an oom memcg.

But that memory is not charged to a memcg, is it? So unless you are
heading towards global OOM you should be safe.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Michal Hocko

On Fri 31-05-13 03:22:59, David Rientjes wrote:
 On Fri, 31 May 2013, Michal Hocko wrote:
 
  I have always discouraged people from running oom handler in the same
  memcg (or even in the same hierarchy).
  
 
 We allow users to control their own memcgs by chowning them, so they must 
 be run in the same hierarchy if they want to run their own userspace oom 
 handler.  There's nothing in the kernel that prevents that and the user 
 has no other option but to run in a parent cgroup.

If the access to the oom_control file is controlled by the file
permissions then the oom handler can live inside root cgroup. Why do you
need must be in the same hierarchy requirement?

  Yes, mmap_sem is tricky. Nothing in the proc code should take it for
  writing and charges are done with mmap_sem held for reading but that
  doesn't prevent from non-oom thread to try to get it for writing which
  would block also all future readers. We have also seen i_mutex being
  held during charge so you have to be careful about that one as well but
  I am not aware of other locks that could be a problem.
  
  The question is, do you really need to open any /proc/pid/ files which
  depend on mmap_sem (e.g. maps, smaps). /proc/pid/status should tell you
  about used memory. Or put it another way around. What kind of data you
  need for your OOM handler?
  
 
 It's too easy to simply do even a ps ax in an oom memcg and make that 
 thread completely unresponsive because it allocates memory.

Yes, but we are talking about oom handler and that one has to be really
careful about what it does. So doing something that simply allocates is
dangerous.

  I might be thinking about different use cases but user space OOM
  handlers I have seen so far had quite a good idea what is going on
  in the group and who to kill.
 
 Then perhaps I'm raising constraints that you've never worked with, I 
 don't know.  We choose to have a priority-based approach that is inherited 
 by children; this priority is kept in userspace and and the oom handler 
 would naturally need to know the set of tasks in the oom memcg at the time 
 of oom and their parent-child relationship.  These priorities are 
 completely independent of memory usage.

OK, but both reading tasks file and readdir should be doable without
userspace (aka charged) allocations. Moreover if you run those oom
handlers under the root cgroup then it should be even easier.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread David Rientjes

On Fri, 31 May 2013, Michal Hocko wrote:

  We allow users to control their own memcgs by chowning them, so they must 
  be run in the same hierarchy if they want to run their own userspace oom 
  handler.  There's nothing in the kernel that prevents that and the user 
  has no other option but to run in a parent cgroup.
 
 If the access to the oom_control file is controlled by the file
 permissions then the oom handler can live inside root cgroup. Why do you
 need must be in the same hierarchy requirement?
 

Users obviously don't have the ability to attach processes to the root 
memcg.  They are constrained to their own subtree of memcgs.

  It's too easy to simply do even a ps ax in an oom memcg and make that 
  thread completely unresponsive because it allocates memory.
 
 Yes, but we are talking about oom handler and that one has to be really
 careful about what it does. So doing something that simply allocates is
 dangerous.
 

Show me a userspace oom handler that doesn't get notified of every fork() 
in a memcg, causing a performance degradation of its own for a complete 
and utter slowpath, that will know the entire process tree of its own 
memcg or a child memcg.

This discussion is all fine and good from a theoretical point of view 
until you actually have to implement one of these yourself.  Multiple 
users are going to be running their own oom notifiers and without some 
sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too 
easily deadlock looking for memory.  If that user is constrained to his or 
her own subtree, as previously stated, there's also no way to login and 
rectify the situation at that point and requires admin intervention or a 
reboot.

  Then perhaps I'm raising constraints that you've never worked with, I 
  don't know.  We choose to have a priority-based approach that is inherited 
  by children; this priority is kept in userspace and and the oom handler 
  would naturally need to know the set of tasks in the oom memcg at the time 
  of oom and their parent-child relationship.  These priorities are 
  completely independent of memory usage.
 
 OK, but both reading tasks file and readdir should be doable without
 userspace (aka charged) allocations. Moreover if you run those oom
 handlers under the root cgroup then it should be even easier.

Then why does cat tasks stall when my memcg is totally depleted of all 
memory?

This isn't even the argument because memory.oom_delay_millisecs isn't 
going to help that situation.  I'm talking about a failsafe that ensures a 
memcg can't deadlock.  The global oom killer will always have to exist in 
the kernel, at least in the most simplistic of forms, solely for this 
reason; you can't move all of the logic to userspace and expect it to 
react 100% of the time.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-31 Thread Andrew Morton

On Wed, 29 May 2013 18:18:10 -0700 (PDT) David Rientjes rient...@google.com 
wrote:

 Completely disabling the oom killer for a memcg is problematic if
 userspace is unable to address the condition itself, usually because it
 is unresponsive.  This scenario creates a memcg deadlock: tasks are
 sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
 exit or move, or the oom killer reenabled and userspace is unable to do
 so.
 
 An additional possible use case is to defer oom killing within a memcg
 for a set period of time, probably to prevent unnecessary kills due to
 temporary memory spikes, before allowing the kernel to handle the
 condition.
 
 This patch adds an oom killer delay so that a memcg may be configured to
 wait at least a pre-defined number of milliseconds before calling the oom
 killer.  If the oom condition persists for this number of milliseconds,
 the oom killer will be called the next time the memory controller
 attempts to charge a page (and memory.oom_control is set to 0).  This
 allows userspace to have a short period of time to respond to the
 condition before deferring to the kernel to kill a task.
 
 Admins may set the oom killer delay using the new interface:
 
   # echo 6  memory.oom_delay_millisecs
 
 This will defer oom killing to the kernel only after 60 seconds has
 elapsed by putting the task to sleep for 60 seconds.

How often is that delay actually useful, in the real world?

IOW, in what proportion of cases does the system just remain stuck for
60 seconds and then get an oom-killing?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-30 Thread David Rientjes

On Thu, 30 May 2013, Michal Hocko wrote:

> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because it
> > is unresponsive. 
> 
> Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
> shouldn't allocated any memory so nothing should prevent it from running (if
> other tasks are preempting it permanently then the priority of the handler
> should be increased).
> 

Unresponsiveness isn't necessarily only because of memory constraints, you 
may have your oom notifier in a parent cgroup that isn't oom.  If a 
process is stuck on mm->mmap_sem in the oom cgroup, though, the oom 
notifier may not be able to scrape /proc/pid and attain necessary 
information in making an oom kill decision.  If the oom notifier is in the 
oom cgroup, it may not be able to successfully read the memcg "tasks" 
file to even determine the set of eligible processes.  There is also no 
guarantee that the userspace oom handler will have the necessary memory to 
even re-enable the oom killer in the memcg under oom which would allow the 
kernel to make forward progress.

We've used this for a few years as a complement to oom notifiers so that a 
process would have a grace period to deal with the oom condition itself 
before allowing the kernel to terminate a process and free memory.  We've 
simply had no alternative in the presence of kernel constraints that 
prevent it from being done in any other way.  We _want_ userspace to deal 
with the issue but when it cannot collect the necessary information (and 
we're not tracing every fork() that every process in a potentially oom 
memcg does) to deal with the condition, we want the kernel to step in 
instead of relying on an admin to login or a global oom condition.

If you'd like to debate this issue, I'd be more than happy to do so and 
show why this patch is absolutely necessary for inclusion, but I'd ask 
that you'd present the code from your userspace oom handler so I can 
understand how it works without needing such backup support.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-30 Thread Michal Hocko

On Wed 29-05-13 18:18:10, David Rientjes wrote:
> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive. 

Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
shouldn't allocated any memory so nothing should prevent it from running (if
other tasks are preempting it permanently then the priority of the handler
should be increased).

> This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
>
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.

I am not sure I like the idea. How does an admin decide what is the right value
of the timeout? And why he cannot use userspace oom handler to do the same
thing?

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-30 Thread Michal Hocko

On Wed 29-05-13 18:18:10, David Rientjes wrote:
 Completely disabling the oom killer for a memcg is problematic if
 userspace is unable to address the condition itself, usually because it
 is unresponsive. 

Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
shouldn't allocated any memory so nothing should prevent it from running (if
other tasks are preempting it permanently then the priority of the handler
should be increased).

 This scenario creates a memcg deadlock: tasks are
 sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
 exit or move, or the oom killer reenabled and userspace is unable to do
 so.

 An additional possible use case is to defer oom killing within a memcg
 for a set period of time, probably to prevent unnecessary kills due to
 temporary memory spikes, before allowing the kernel to handle the
 condition.

I am not sure I like the idea. How does an admin decide what is the right value
of the timeout? And why he cannot use userspace oom handler to do the same
thing?

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: add oom killer delay

2013-05-30 Thread David Rientjes

On Thu, 30 May 2013, Michal Hocko wrote:

  Completely disabling the oom killer for a memcg is problematic if
  userspace is unable to address the condition itself, usually because it
  is unresponsive. 
 
 Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
 shouldn't allocated any memory so nothing should prevent it from running (if
 other tasks are preempting it permanently then the priority of the handler
 should be increased).
 

Unresponsiveness isn't necessarily only because of memory constraints, you 
may have your oom notifier in a parent cgroup that isn't oom.  If a 
process is stuck on mm-mmap_sem in the oom cgroup, though, the oom 
notifier may not be able to scrape /proc/pid and attain necessary 
information in making an oom kill decision.  If the oom notifier is in the 
oom cgroup, it may not be able to successfully read the memcg tasks 
file to even determine the set of eligible processes.  There is also no 
guarantee that the userspace oom handler will have the necessary memory to 
even re-enable the oom killer in the memcg under oom which would allow the 
kernel to make forward progress.

We've used this for a few years as a complement to oom notifiers so that a 
process would have a grace period to deal with the oom condition itself 
before allowing the kernel to terminate a process and free memory.  We've 
simply had no alternative in the presence of kernel constraints that 
prevent it from being done in any other way.  We _want_ userspace to deal 
with the issue but when it cannot collect the necessary information (and 
we're not tracing every fork() that every process in a potentially oom 
memcg does) to deal with the condition, we want the kernel to step in 
instead of relying on an admin to login or a global oom condition.

If you'd like to debate this issue, I'd be more than happy to do so and 
show why this patch is absolutely necessary for inclusion, but I'd ask 
that you'd present the code from your userspace oom handler so I can 
understand how it works without needing such backup support.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

92 matches

Mail list logo