Re: [patch] mm, memcg: add oom killer delay
On Fri 14-06-13 03:12:52, David Rientjes wrote: > On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote: > > > Reading your discussion, I think I understand your requirements. > > The problem is that I can't think you took into all options into > > accounts and found the best way is this new oom_delay. IOW, I can't > > convice oom-delay is the best way to handle your issue. > > > > Ok, let's talk about it. > > > Your requeirement is > > - Allowing userland oom-handler within local memcg. > > > > Another requirement: > > - Allow userland oom handler for global oom conditions. > > Hopefully that's hooked into memcg because the functionality is already > there, we can simply duplicate all of the oom functionality that we'll be > adding for the root memcg. I understand this and that's why I mentioned enabling oom_control for root memcg at LSF. I liked the idea at the time but the more I think about it the more I am scared about all the consequences. Maybe my fear is not justified as in-kernel (scripting or module registering its own handler) would be fragile in a similar ways. I would like to see this being discussed before we go enable memcg way. I _do_ agree that we really _want_ a way to provide a customer oom handler. > > Considering straightforward, the answer should be > > - Allowing oom-handler daemon out of memcg's control by its limit. > >(For example, a flag/capability for a task can archive this.) > >Or attaching some *fixed* resource to the task rather than cgroup. > > > >Allow to set task->secret_saving=20M. > > > > Exactly! I thought your users are untrusted so you cannot give them any additional reserves. Or is this only about the root cgroup? > First of all, thanks very much for taking an interest in our usecase and > discussing it with us. > > I didn't propose what I referred to earlier in the thread as "memcg > reserves" because I thought it was going to be a more difficult battle. > The fact that you brought it up first actually makes me think it's less > insane :) reserves for oom handler doesn't sound like a bad idea but it assumes that the handler is trusted. > We do indeed want memcg reserves and I have patches to add it if you'd > like to see that first. It ensures that this userspace oom handler can > actually do some work in determining which process to kill. The reserve > is a fraction of true memory reserves (the space below the per-zone min > watermarks) which is dependent on min_free_kbytes. This does indeed > become more difficult with true and complete kmem charging. That "work" > could be opening the tasks file (which allocates the pidlist within the > kernel), I am not familiar with details why it has to allocate memory but I guess we need to read tasks file without any in-kernel allocations. > checking /proc/pid/status for rss, This doesn't require in kernel allocation AFAICS. > checking for how long a process has been running, checking for tid, > sending a signal to drop caches, etc. > > We'd also like to do this for global oom conditions, which makes it even > more interesting. I was thinking of using a fraction of memory reserves > as the oom killer currently does (that memory below the min watermark) for > these purposes. > > Memory charging is simply bypassed for these oom handlers (we only grant > access to those waiting on the memory.oom_control eventfd) up to > memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think > this is entirely insane because these oom handlers should lead to future > memory freeing, just like TIF_MEMDIE processes. > > > Going back to your patch, what's confusing is your approach. > > Why the problem caused by the amount of memory should be solved by > > some dealy, i.e. the amount of time ? > > > > This exchanging sounds confusing to me. > > > > Even with all of the above (which is not actually that invasive of a > patch), I still think we need memory.oom_delay_millisecs. I am still not convinced. I have already mentioned that this can be handled from userspace. A simple watchdog which sits on oom_control eventfd which triggers timer (and reschedule it on a new event while there is one scheduled already) which reads oom_control to check under_oom before it enables in-kernel oom handling again doesn't need any memory allocation so there is nothing to prevent it from resurrecting the system. So I really do not see a reason for a new knob. > I probably made a mistake in describing what that is addressing if it > seems like it's trying to address any of the above. > > If a userspace oom handler fails to respond even with access to those > "memcg reserves", the kernel needs to kill within that memcg. Do we do > that above a set time period (this patch) or when the reserves are > completely exhausted? That's debatable, but if we are to allow it for > global oom conditions as well then my opinion was to make it as safe as > possible; today, we can't disable
Re: [patch] mm, memcg: add oom killer delay
On Fri 14-06-13 03:12:52, David Rientjes wrote: On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote: Reading your discussion, I think I understand your requirements. The problem is that I can't think you took into all options into accounts and found the best way is this new oom_delay. IOW, I can't convice oom-delay is the best way to handle your issue. Ok, let's talk about it. Your requeirement is - Allowing userland oom-handler within local memcg. Another requirement: - Allow userland oom handler for global oom conditions. Hopefully that's hooked into memcg because the functionality is already there, we can simply duplicate all of the oom functionality that we'll be adding for the root memcg. I understand this and that's why I mentioned enabling oom_control for root memcg at LSF. I liked the idea at the time but the more I think about it the more I am scared about all the consequences. Maybe my fear is not justified as in-kernel (scripting or module registering its own handler) would be fragile in a similar ways. I would like to see this being discussed before we go enable memcg way. I _do_ agree that we really _want_ a way to provide a customer oom handler. Considering straightforward, the answer should be - Allowing oom-handler daemon out of memcg's control by its limit. (For example, a flag/capability for a task can archive this.) Or attaching some *fixed* resource to the task rather than cgroup. Allow to set task-secret_saving=20M. Exactly! I thought your users are untrusted so you cannot give them any additional reserves. Or is this only about the root cgroup? First of all, thanks very much for taking an interest in our usecase and discussing it with us. I didn't propose what I referred to earlier in the thread as memcg reserves because I thought it was going to be a more difficult battle. The fact that you brought it up first actually makes me think it's less insane :) reserves for oom handler doesn't sound like a bad idea but it assumes that the handler is trusted. We do indeed want memcg reserves and I have patches to add it if you'd like to see that first. It ensures that this userspace oom handler can actually do some work in determining which process to kill. The reserve is a fraction of true memory reserves (the space below the per-zone min watermarks) which is dependent on min_free_kbytes. This does indeed become more difficult with true and complete kmem charging. That work could be opening the tasks file (which allocates the pidlist within the kernel), I am not familiar with details why it has to allocate memory but I guess we need to read tasks file without any in-kernel allocations. checking /proc/pid/status for rss, This doesn't require in kernel allocation AFAICS. checking for how long a process has been running, checking for tid, sending a signal to drop caches, etc. We'd also like to do this for global oom conditions, which makes it even more interesting. I was thinking of using a fraction of memory reserves as the oom killer currently does (that memory below the min watermark) for these purposes. Memory charging is simply bypassed for these oom handlers (we only grant access to those waiting on the memory.oom_control eventfd) up to memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think this is entirely insane because these oom handlers should lead to future memory freeing, just like TIF_MEMDIE processes. Going back to your patch, what's confusing is your approach. Why the problem caused by the amount of memory should be solved by some dealy, i.e. the amount of time ? This exchanging sounds confusing to me. Even with all of the above (which is not actually that invasive of a patch), I still think we need memory.oom_delay_millisecs. I am still not convinced. I have already mentioned that this can be handled from userspace. A simple watchdog which sits on oom_control eventfd which triggers timer (and reschedule it on a new event while there is one scheduled already) which reads oom_control to check under_oom before it enables in-kernel oom handling again doesn't need any memory allocation so there is nothing to prevent it from resurrecting the system. So I really do not see a reason for a new knob. I probably made a mistake in describing what that is addressing if it seems like it's trying to address any of the above. If a userspace oom handler fails to respond even with access to those memcg reserves, the kernel needs to kill within that memcg. Do we do that above a set time period (this patch) or when the reserves are completely exhausted? That's debatable, but if we are to allow it for global oom conditions as well then my opinion was to make it as safe as possible; today, we can't disable the global oom killer from userspace and I don't think we should ever allow it to be disabled. I think
Re: [patch] mm, memcg: add oom killer delay
On Tue, 25 Jun 2013, Kamezawa Hiroyuki wrote: > Considering only memcg, bypassing all charge-limit-check will work. > But as you say, that will not work against global-oom. I think it will since we have per-zone memory reserves that can be bypassed in the page allocator, not to the level of PF_MEMALLOC or TIF_MEMDIE but perhaps to the min watermark / 4, for example. A userspace global oom handler will obviously be mlocked in memory and this reserve is used only for true kmem accounting so that reading things like the memcg tasks file or reading /proc/pid/stat works, or dynamically allocate a buffer to store data to iterate over. This is why memory.oom_delay_millisecs is crucial: we want the same functionality that the "user root" has for global oom conditions at the memcg root level and in case reserves are exhausted that the kernel will kill something (which should be rare, but possible) and use the rest of memory reserves to allow to exit. > > Even with all of the above (which is not actually that invasive of a > > patch), I still think we need memory.oom_delay_millisecs. I probably made > > a mistake in describing what that is addressing if it seems like it's > > trying to address any of the above. > > > > If a userspace oom handler fails to respond even with access to those > > "memcg reserves", > > How this happens ? > If the memcg reserves are exhausted, then the kernel needs to kill something even in global oom conditions (consider a "user root" memcg tree to be the same as a global oom condition for processes attached to that tree) since otherwise the machine hangs. There's no guarantee that some root process sitting in the root memcg would be able to enforce this delay as Michal suggests since reserves could be depleted. It's important we don't do something as extreme as PF_MEMALLOC so all per-zone reserves are depleted so that the kernel can still intervene and kill something when userspace is unresponsive. > Someone may be against that kind of control and say "Hey, I have better idea". > That was another reason that oom-scirpiting was discussed. No one can > implement > general-purpose-victim-selection-logic. > Completely agreed, and our experience shows that users who manipulate their own "user root" memcgs have their own logic, this is why we're trying to make userspace oom handling as powerful as possible without risking making the machine unresponsive. > IMHO, it will be difficult but allowing to write script/filter for oom-killing > will be worth to try. like.. > > == > for_each_process : > if comm == mem_manage_daemon : > continue > if user == root : > continue > score = default_calc_score() > if score > high_score : > selected = current > == > This is effectively what already happens with the oom delay as proposed here, the userspace oom handler is given access to "memcg reserves" and a period of time to respond; if that fails, then the kernel will kill something the next time we try to charge to the memcg. > BTW, if you love the logic in the userland oom daemon, why you can't implement > it in the kernel ? Does that do some pretty things other than sending SIGKILL > ? > Some do "pretty" things like collect stats and dump it before killing something, but we also want oom handlers that don't do SIGKILL at all. An example: we statically allocate hugepages at boot because we need a large percentage of memory to be backed by hugepages for a certain class of applications and it's only available at boot. We also have a userspace that runs on these machines that is shared between hugepage machines and non-hugepage machines. At times, this userspace becomes oom because the remainder of available memory is allocated as hugetlb pages when in reality they are unmapped and sitting in a free pool. In that case, our userspace oom handler wants to free those hugetlb pages back to the kernel down to a certain watermark and then opportunistically reallocate them to the pool when memory usage on the system is lower due to spikes in the userspace. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue, 25 Jun 2013, Kamezawa Hiroyuki wrote: Considering only memcg, bypassing all charge-limit-check will work. But as you say, that will not work against global-oom. I think it will since we have per-zone memory reserves that can be bypassed in the page allocator, not to the level of PF_MEMALLOC or TIF_MEMDIE but perhaps to the min watermark / 4, for example. A userspace global oom handler will obviously be mlocked in memory and this reserve is used only for true kmem accounting so that reading things like the memcg tasks file or reading /proc/pid/stat works, or dynamically allocate a buffer to store data to iterate over. This is why memory.oom_delay_millisecs is crucial: we want the same functionality that the user root has for global oom conditions at the memcg root level and in case reserves are exhausted that the kernel will kill something (which should be rare, but possible) and use the rest of memory reserves to allow to exit. Even with all of the above (which is not actually that invasive of a patch), I still think we need memory.oom_delay_millisecs. I probably made a mistake in describing what that is addressing if it seems like it's trying to address any of the above. If a userspace oom handler fails to respond even with access to those memcg reserves, How this happens ? If the memcg reserves are exhausted, then the kernel needs to kill something even in global oom conditions (consider a user root memcg tree to be the same as a global oom condition for processes attached to that tree) since otherwise the machine hangs. There's no guarantee that some root process sitting in the root memcg would be able to enforce this delay as Michal suggests since reserves could be depleted. It's important we don't do something as extreme as PF_MEMALLOC so all per-zone reserves are depleted so that the kernel can still intervene and kill something when userspace is unresponsive. Someone may be against that kind of control and say Hey, I have better idea. That was another reason that oom-scirpiting was discussed. No one can implement general-purpose-victim-selection-logic. Completely agreed, and our experience shows that users who manipulate their own user root memcgs have their own logic, this is why we're trying to make userspace oom handling as powerful as possible without risking making the machine unresponsive. IMHO, it will be difficult but allowing to write script/filter for oom-killing will be worth to try. like.. == for_each_process : if comm == mem_manage_daemon : continue if user == root : continue score = default_calc_score() if score high_score : selected = current == This is effectively what already happens with the oom delay as proposed here, the userspace oom handler is given access to memcg reserves and a period of time to respond; if that fails, then the kernel will kill something the next time we try to charge to the memcg. BTW, if you love the logic in the userland oom daemon, why you can't implement it in the kernel ? Does that do some pretty things other than sending SIGKILL ? Some do pretty things like collect stats and dump it before killing something, but we also want oom handlers that don't do SIGKILL at all. An example: we statically allocate hugepages at boot because we need a large percentage of memory to be backed by hugepages for a certain class of applications and it's only available at boot. We also have a userspace that runs on these machines that is shared between hugepage machines and non-hugepage machines. At times, this userspace becomes oom because the remainder of available memory is allocated as hugetlb pages when in reality they are unmapped and sitting in a free pool. In that case, our userspace oom handler wants to free those hugetlb pages back to the kernel down to a certain watermark and then opportunistically reallocate them to the pool when memory usage on the system is lower due to spikes in the userspace. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
(2013/06/14 19:12), David Rientjes wrote: On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote: Reading your discussion, I think I understand your requirements. The problem is that I can't think you took into all options into accounts and found the best way is this new oom_delay. IOW, I can't convice oom-delay is the best way to handle your issue. Ok, let's talk about it. I'm sorry that my RTT is long in these days. Your requeirement is - Allowing userland oom-handler within local memcg. Another requirement: - Allow userland oom handler for global oom conditions. Hopefully that's hooked into memcg because the functionality is already there, we can simply duplicate all of the oom functionality that we'll be adding for the root memcg. At mm-summit, it was discussed ant people seems to think user-land-oom-handler is impossible. Hm, and in-kernel scripting was discussed, as far as I remember. Considering straightforward, the answer should be - Allowing oom-handler daemon out of memcg's control by its limit. (For example, a flag/capability for a task can archive this.) Or attaching some *fixed* resource to the task rather than cgroup. Allow to set task->secret_saving=20M. Exactly! First of all, thanks very much for taking an interest in our usecase and discussing it with us. I didn't propose what I referred to earlier in the thread as "memcg reserves" because I thought it was going to be a more difficult battle. The fact that you brought it up first actually makes me think it's less insane :) We do indeed want memcg reserves and I have patches to add it if you'd like to see that first. It ensures that this userspace oom handler can actually do some work in determining which process to kill. The reserve is a fraction of true memory reserves (the space below the per-zone min watermarks) which is dependent on min_free_kbytes. This does indeed become more difficult with true and complete kmem charging. That "work" could be opening the tasks file (which allocates the pidlist within the kernel), checking /proc/pid/status for rss, checking for how long a process has been running, checking for tid, sending a signal to drop caches, etc. Considering only memcg, bypassing all charge-limit-check will work. But as you say, that will not work against global-oom. Then, in-kernel scripting was discussed. We'd also like to do this for global oom conditions, which makes it even more interesting. I was thinking of using a fraction of memory reserves as the oom killer currently does (that memory below the min watermark) for these purposes. Memory charging is simply bypassed for these oom handlers (we only grant access to those waiting on the memory.oom_control eventfd) up to memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think this is entirely insane because these oom handlers should lead to future memory freeing, just like TIF_MEMDIE processes. I think that kinds of bypassing is acceptable. Going back to your patch, what's confusing is your approach. Why the problem caused by the amount of memory should be solved by some dealy, i.e. the amount of time ? This exchanging sounds confusing to me. Even with all of the above (which is not actually that invasive of a patch), I still think we need memory.oom_delay_millisecs. I probably made a mistake in describing what that is addressing if it seems like it's trying to address any of the above. If a userspace oom handler fails to respond even with access to those "memcg reserves", How this happens ? the kernel needs to kill within that memcg. Do we do that above a set time period (this patch) or when the reserves are completely exhausted? That's debatable, but if we are to allow it for global oom conditions as well then my opinion was to make it as safe as possible; today, we can't disable the global oom killer from userspace and I don't think we should ever allow it to be disabled. I think we should allow userspace a reasonable amount of time to respond and then kill if it is exceeded. For the global oom case, we want to have a priority-based memcg selection. Select the lowest priority top-level memcg and kill within it. If it has an oom notifier, send it a signal to kill something. If it fails to react, kill something after memory.oom_delay_millisecs has elapsed. If there isn't a userspace oom notifier, kill something within that lowest priority memcg. Someone may be against that kind of control and say "Hey, I have better idea". That was another reason that oom-scirpiting was discussed. No one can implement general-purpose-victim-selection-logic. The bottomline with my approach is that I don't believe there is ever a reason for an oom memcg to remain oom indefinitely. That's why I hate memory.oom_control == 1 and I think for the global notification it would be deemed a nonstarter since you couldn't even login to the machine. I'm not against what you finally want to do, but I
Re: [patch] mm, memcg: add oom killer delay
(2013/06/14 19:12), David Rientjes wrote: On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote: Reading your discussion, I think I understand your requirements. The problem is that I can't think you took into all options into accounts and found the best way is this new oom_delay. IOW, I can't convice oom-delay is the best way to handle your issue. Ok, let's talk about it. I'm sorry that my RTT is long in these days. Your requeirement is - Allowing userland oom-handler within local memcg. Another requirement: - Allow userland oom handler for global oom conditions. Hopefully that's hooked into memcg because the functionality is already there, we can simply duplicate all of the oom functionality that we'll be adding for the root memcg. At mm-summit, it was discussed ant people seems to think user-land-oom-handler is impossible. Hm, and in-kernel scripting was discussed, as far as I remember. Considering straightforward, the answer should be - Allowing oom-handler daemon out of memcg's control by its limit. (For example, a flag/capability for a task can archive this.) Or attaching some *fixed* resource to the task rather than cgroup. Allow to set task-secret_saving=20M. Exactly! First of all, thanks very much for taking an interest in our usecase and discussing it with us. I didn't propose what I referred to earlier in the thread as memcg reserves because I thought it was going to be a more difficult battle. The fact that you brought it up first actually makes me think it's less insane :) We do indeed want memcg reserves and I have patches to add it if you'd like to see that first. It ensures that this userspace oom handler can actually do some work in determining which process to kill. The reserve is a fraction of true memory reserves (the space below the per-zone min watermarks) which is dependent on min_free_kbytes. This does indeed become more difficult with true and complete kmem charging. That work could be opening the tasks file (which allocates the pidlist within the kernel), checking /proc/pid/status for rss, checking for how long a process has been running, checking for tid, sending a signal to drop caches, etc. Considering only memcg, bypassing all charge-limit-check will work. But as you say, that will not work against global-oom. Then, in-kernel scripting was discussed. We'd also like to do this for global oom conditions, which makes it even more interesting. I was thinking of using a fraction of memory reserves as the oom killer currently does (that memory below the min watermark) for these purposes. Memory charging is simply bypassed for these oom handlers (we only grant access to those waiting on the memory.oom_control eventfd) up to memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think this is entirely insane because these oom handlers should lead to future memory freeing, just like TIF_MEMDIE processes. I think that kinds of bypassing is acceptable. Going back to your patch, what's confusing is your approach. Why the problem caused by the amount of memory should be solved by some dealy, i.e. the amount of time ? This exchanging sounds confusing to me. Even with all of the above (which is not actually that invasive of a patch), I still think we need memory.oom_delay_millisecs. I probably made a mistake in describing what that is addressing if it seems like it's trying to address any of the above. If a userspace oom handler fails to respond even with access to those memcg reserves, How this happens ? the kernel needs to kill within that memcg. Do we do that above a set time period (this patch) or when the reserves are completely exhausted? That's debatable, but if we are to allow it for global oom conditions as well then my opinion was to make it as safe as possible; today, we can't disable the global oom killer from userspace and I don't think we should ever allow it to be disabled. I think we should allow userspace a reasonable amount of time to respond and then kill if it is exceeded. For the global oom case, we want to have a priority-based memcg selection. Select the lowest priority top-level memcg and kill within it. If it has an oom notifier, send it a signal to kill something. If it fails to react, kill something after memory.oom_delay_millisecs has elapsed. If there isn't a userspace oom notifier, kill something within that lowest priority memcg. Someone may be against that kind of control and say Hey, I have better idea. That was another reason that oom-scirpiting was discussed. No one can implement general-purpose-victim-selection-logic. The bottomline with my approach is that I don't believe there is ever a reason for an oom memcg to remain oom indefinitely. That's why I hate memory.oom_control == 1 and I think for the global notification it would be deemed a nonstarter since you couldn't even login to the machine. I'm not against what you finally want to do, but I don't like
Re: [patch] mm, memcg: add oom killer delay
On Fri, 14 Jun 2013, David Rientjes wrote: > Even with all of the above (which is not actually that invasive of a > patch), I still think we need memory.oom_delay_millisecs. I probably made > a mistake in describing what that is addressing if it seems like it's > trying to address any of the above. > > If a userspace oom handler fails to respond even with access to those > "memcg reserves", the kernel needs to kill within that memcg. Do we do > that above a set time period (this patch) or when the reserves are > completely exhausted? That's debatable, but if we are to allow it for > global oom conditions as well then my opinion was to make it as safe as > possible; today, we can't disable the global oom killer from userspace and > I don't think we should ever allow it to be disabled. I think we should > allow userspace a reasonable amount of time to respond and then kill if it > is exceeded. > > For the global oom case, we want to have a priority-based memcg selection. > Select the lowest priority top-level memcg and kill within it. If it has > an oom notifier, send it a signal to kill something. If it fails to > react, kill something after memory.oom_delay_millisecs has elapsed. If > there isn't a userspace oom notifier, kill something within that lowest > priority memcg. > > The bottomline with my approach is that I don't believe there is ever a > reason for an oom memcg to remain oom indefinitely. That's why I hate > memory.oom_control == 1 and I think for the global notification it would > be deemed a nonstarter since you couldn't even login to the machine. > What's the status of this patch? I'd love to be able to introduce memcg reserves so that userspace oom notifications can actually work, but we still require a failsafe in the kernel if those reserves are depleted or userspace cannot respond. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 14 Jun 2013, David Rientjes wrote: Even with all of the above (which is not actually that invasive of a patch), I still think we need memory.oom_delay_millisecs. I probably made a mistake in describing what that is addressing if it seems like it's trying to address any of the above. If a userspace oom handler fails to respond even with access to those memcg reserves, the kernel needs to kill within that memcg. Do we do that above a set time period (this patch) or when the reserves are completely exhausted? That's debatable, but if we are to allow it for global oom conditions as well then my opinion was to make it as safe as possible; today, we can't disable the global oom killer from userspace and I don't think we should ever allow it to be disabled. I think we should allow userspace a reasonable amount of time to respond and then kill if it is exceeded. For the global oom case, we want to have a priority-based memcg selection. Select the lowest priority top-level memcg and kill within it. If it has an oom notifier, send it a signal to kill something. If it fails to react, kill something after memory.oom_delay_millisecs has elapsed. If there isn't a userspace oom notifier, kill something within that lowest priority memcg. The bottomline with my approach is that I don't believe there is ever a reason for an oom memcg to remain oom indefinitely. That's why I hate memory.oom_control == 1 and I think for the global notification it would be deemed a nonstarter since you couldn't even login to the machine. What's the status of this patch? I'd love to be able to introduce memcg reserves so that userspace oom notifications can actually work, but we still require a failsafe in the kernel if those reserves are depleted or userspace cannot respond. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote: > Reading your discussion, I think I understand your requirements. > The problem is that I can't think you took into all options into > accounts and found the best way is this new oom_delay. IOW, I can't > convice oom-delay is the best way to handle your issue. > Ok, let's talk about it. > Your requeirement is > - Allowing userland oom-handler within local memcg. > Another requirement: - Allow userland oom handler for global oom conditions. Hopefully that's hooked into memcg because the functionality is already there, we can simply duplicate all of the oom functionality that we'll be adding for the root memcg. > Considering straightforward, the answer should be > - Allowing oom-handler daemon out of memcg's control by its limit. >(For example, a flag/capability for a task can archive this.) >Or attaching some *fixed* resource to the task rather than cgroup. > >Allow to set task->secret_saving=20M. > Exactly! First of all, thanks very much for taking an interest in our usecase and discussing it with us. I didn't propose what I referred to earlier in the thread as "memcg reserves" because I thought it was going to be a more difficult battle. The fact that you brought it up first actually makes me think it's less insane :) We do indeed want memcg reserves and I have patches to add it if you'd like to see that first. It ensures that this userspace oom handler can actually do some work in determining which process to kill. The reserve is a fraction of true memory reserves (the space below the per-zone min watermarks) which is dependent on min_free_kbytes. This does indeed become more difficult with true and complete kmem charging. That "work" could be opening the tasks file (which allocates the pidlist within the kernel), checking /proc/pid/status for rss, checking for how long a process has been running, checking for tid, sending a signal to drop caches, etc. We'd also like to do this for global oom conditions, which makes it even more interesting. I was thinking of using a fraction of memory reserves as the oom killer currently does (that memory below the min watermark) for these purposes. Memory charging is simply bypassed for these oom handlers (we only grant access to those waiting on the memory.oom_control eventfd) up to memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think this is entirely insane because these oom handlers should lead to future memory freeing, just like TIF_MEMDIE processes. > Going back to your patch, what's confusing is your approach. > Why the problem caused by the amount of memory should be solved by > some dealy, i.e. the amount of time ? > > This exchanging sounds confusing to me. > Even with all of the above (which is not actually that invasive of a patch), I still think we need memory.oom_delay_millisecs. I probably made a mistake in describing what that is addressing if it seems like it's trying to address any of the above. If a userspace oom handler fails to respond even with access to those "memcg reserves", the kernel needs to kill within that memcg. Do we do that above a set time period (this patch) or when the reserves are completely exhausted? That's debatable, but if we are to allow it for global oom conditions as well then my opinion was to make it as safe as possible; today, we can't disable the global oom killer from userspace and I don't think we should ever allow it to be disabled. I think we should allow userspace a reasonable amount of time to respond and then kill if it is exceeded. For the global oom case, we want to have a priority-based memcg selection. Select the lowest priority top-level memcg and kill within it. If it has an oom notifier, send it a signal to kill something. If it fails to react, kill something after memory.oom_delay_millisecs has elapsed. If there isn't a userspace oom notifier, kill something within that lowest priority memcg. The bottomline with my approach is that I don't believe there is ever a reason for an oom memcg to remain oom indefinitely. That's why I hate memory.oom_control == 1 and I think for the global notification it would be deemed a nonstarter since you couldn't even login to the machine. > I'm not against what you finally want to do, but I don't like the fix. > I'm thrilled to hear that, and I hope we can work to make userspace oom handling more effective. What do you think about that above? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote: Reading your discussion, I think I understand your requirements. The problem is that I can't think you took into all options into accounts and found the best way is this new oom_delay. IOW, I can't convice oom-delay is the best way to handle your issue. Ok, let's talk about it. Your requeirement is - Allowing userland oom-handler within local memcg. Another requirement: - Allow userland oom handler for global oom conditions. Hopefully that's hooked into memcg because the functionality is already there, we can simply duplicate all of the oom functionality that we'll be adding for the root memcg. Considering straightforward, the answer should be - Allowing oom-handler daemon out of memcg's control by its limit. (For example, a flag/capability for a task can archive this.) Or attaching some *fixed* resource to the task rather than cgroup. Allow to set task-secret_saving=20M. Exactly! First of all, thanks very much for taking an interest in our usecase and discussing it with us. I didn't propose what I referred to earlier in the thread as memcg reserves because I thought it was going to be a more difficult battle. The fact that you brought it up first actually makes me think it's less insane :) We do indeed want memcg reserves and I have patches to add it if you'd like to see that first. It ensures that this userspace oom handler can actually do some work in determining which process to kill. The reserve is a fraction of true memory reserves (the space below the per-zone min watermarks) which is dependent on min_free_kbytes. This does indeed become more difficult with true and complete kmem charging. That work could be opening the tasks file (which allocates the pidlist within the kernel), checking /proc/pid/status for rss, checking for how long a process has been running, checking for tid, sending a signal to drop caches, etc. We'd also like to do this for global oom conditions, which makes it even more interesting. I was thinking of using a fraction of memory reserves as the oom killer currently does (that memory below the min watermark) for these purposes. Memory charging is simply bypassed for these oom handlers (we only grant access to those waiting on the memory.oom_control eventfd) up to memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think this is entirely insane because these oom handlers should lead to future memory freeing, just like TIF_MEMDIE processes. Going back to your patch, what's confusing is your approach. Why the problem caused by the amount of memory should be solved by some dealy, i.e. the amount of time ? This exchanging sounds confusing to me. Even with all of the above (which is not actually that invasive of a patch), I still think we need memory.oom_delay_millisecs. I probably made a mistake in describing what that is addressing if it seems like it's trying to address any of the above. If a userspace oom handler fails to respond even with access to those memcg reserves, the kernel needs to kill within that memcg. Do we do that above a set time period (this patch) or when the reserves are completely exhausted? That's debatable, but if we are to allow it for global oom conditions as well then my opinion was to make it as safe as possible; today, we can't disable the global oom killer from userspace and I don't think we should ever allow it to be disabled. I think we should allow userspace a reasonable amount of time to respond and then kill if it is exceeded. For the global oom case, we want to have a priority-based memcg selection. Select the lowest priority top-level memcg and kill within it. If it has an oom notifier, send it a signal to kill something. If it fails to react, kill something after memory.oom_delay_millisecs has elapsed. If there isn't a userspace oom notifier, kill something within that lowest priority memcg. The bottomline with my approach is that I don't believe there is ever a reason for an oom memcg to remain oom indefinitely. That's why I hate memory.oom_control == 1 and I think for the global notification it would be deemed a nonstarter since you couldn't even login to the machine. I'm not against what you finally want to do, but I don't like the fix. I'm thrilled to hear that, and I hope we can work to make userspace oom handling more effective. What do you think about that above? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
(2013/06/14 7:25), David Rientjes wrote: On Thu, 13 Jun 2013, Michal Hocko wrote: That's not at all the objective, the changelog quite explicitly states this is a deadlock as the result of userspace having disabled the oom killer so that its userspace oom handler can resolve the condition and it being unresponsive or unable to perform its job. Ohh, so another round. Sigh. You insist on having user space handlers running in the context of the limited group. OK, I can understand your use case, although I think it is pushing the limits of the interface and it is dangerous. Ok, this is where our misunderstanding is, and I can see why you have reacted the way you have. It's my fault for not describing where we're going with this. Reading your discussion, I think I understand your requirements. The problem is that I can't think you took into all options into accounts and found the best way is this new oom_delay. IOW, I can't convice oom-delay is the best way to handle your issue. Your requeirement is - Allowing userland oom-handler within local memcg. Considering straightforward, the answer should be - Allowing oom-handler daemon out of memcg's control by its limit. (For example, a flag/capability for a task can archive this.) Or attaching some *fixed* resource to the task rather than cgroup. Allow to set task->secret_saving=20M. Going back to your patch, what's confusing is your approach. Why the problem caused by the amount of memory should be solved by some dealy, i.e. the amount of time ? This exchanging sounds confusing to me. I'm not against what you finally want to do, but I don't like the fix. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Thu, 13 Jun 2013, Michal Hocko wrote: > > That's not at all the objective, the changelog quite explicitly states > > this is a deadlock as the result of userspace having disabled the oom > > killer so that its userspace oom handler can resolve the condition and it > > being unresponsive or unable to perform its job. > > Ohh, so another round. Sigh. You insist on having user space handlers > running in the context of the limited group. OK, I can understand your > use case, although I think it is pushing the limits of the interface and > it is dangerous. Ok, this is where our misunderstanding is, and I can see why you have reacted the way you have. It's my fault for not describing where we're going with this. Userspace oom handling right now is very inadequate, not only for the reasons that you and Johannes have described. This requirement that userspace oom handlers must run with more memory than the memcgs it is monitoring is where the pain is. We have to accept the fact that users can be given control of their own memcg trees by chowning that "user root" directory to the user. The only real change you must enforce is that the limit cannot be increased unless you have a special capability. Otherwise, users can be given complete control over their own tree and enforce their own memory limits (or just accounting). I haven't seen any objection to that usecase from anybody, so I'm assuming that it's something that is understood and acknowledged. Nothing in the cgroups code prevents it and I believe it gives a much more powerful interface to the user to control their own destiny much like setting up root level memcgs for the admin. The problem is that we want to control global oom conditions as well. oom_score_adj simply doesn't allow for user-specific implementations either within user subcontainers or at the root. Users may very well want to kill the allocating task in their memcg to avoid expensive scans (SGI required oom_kill_allocating_task at a global level), kill the largest memory consumer, do a priority-based memcg scoring, kill the newest process, etc. That's the power of user oom notifications beyond just statistics collection or creating a tombstone to examine later. We want to do that at the root level as well for global oom conditions. We have a priority-based scoring method amongst top-level memcgs. They are jobs that are scheduled on machines and given a certain priority to run at. We also overcommit the machine so the sum of all top-level memory.limit_in_bytes exceeds the amount of physical RAM since jobs seldom use their entire reservation. When RAM is exhausted, we want to kill the lowest priority process from the lowest priority memcg. That's not possible to do currently because we lack global oom notification and userspace oom handling. We want to change that by notifying on the root memcg's memory.oom_control for global oom and wakeup a process that will handle global oom conditions. Memcg is the perfect vehicle for this since it already does userspace oom notification and because the oom condition in a "user root" can be handled the same as a "global oom". Waking up userspace and asking it to do anything useful in a global oom condition is problematic if it is unresponsive for whatever reason. Especially with true kmem accounting of all slab, we cannot be guaranteed that this global oom handler can actually do things like query the memory usage of a process or read a memcg's tasks file. For this reason, we want a kernel failsafe to ensure that if it isn't able to respond in a reasonable amount of time, the kernel frees memory itself. There's simply no other alternative to doing something like memory.oom_delay_millisecs then at the root level. For notification to actually work, we need to preempt the kernel at least for a certain amount of time from killing: the solution right now is to disable the oom killer entirely with memory.oom_control == 1. If the global oom handler fails the respond, the machine is essentially dead unless something starts freeing memory back for some strange reason. For our own priority-based memcg oom scoring implementation, this global oom handler would then signal the memcg oom handler in the lowest priority memcg, if it exists, to handle the situation or it would kill something from that memcg itself. The key is that without some sort of kernel-imposed delay, there's no way to sanely implement a global oom handler that doesn't risk killing the entire machine because the kernel oom killer is deferred. This patch makes that delay configurable. I could certainly push the patch that does the global oom notification on the root level memcg's memory.oom_control first if you'd like, but I didn't think there would be so much opposition to running a memcg oom handler in the memcg itself, which is what this essentially is. If that's done, then we can't simply rely on the global
Re: [patch] mm, memcg: add oom killer delay
On Wed 12-06-13 14:27:05, David Rientjes wrote: > On Wed, 12 Jun 2013, Michal Hocko wrote: > > > But the objective is to handle oom deadlocks gracefully and you cannot > > possibly miss those as they are, well, _deadlocks_. > > That's not at all the objective, the changelog quite explicitly states > this is a deadlock as the result of userspace having disabled the oom > killer so that its userspace oom handler can resolve the condition and it > being unresponsive or unable to perform its job. Ohh, so another round. Sigh. You insist on having user space handlers running in the context of the limited group. OK, I can understand your use case, although I think it is pushing the limits of the interface and it is dangerous. As the problems/deadlocks are unavoidable with this approach you really need a backup plan to reduce the damage once it happens. You insist on having in-kernel solution while there is a user space alternative possible IMO. > When you allow users to create their own memcgs, which we do and is > possible by chowning the user's root to be owned by it, and implement > their own userspace oom notifier, you must then rely on their > implementation to work 100% of the time, otherwise all those gigabytes of > memory go unfreed forever. What you're insisting on is that this > userspace is perfect No, I am not saying that. You can let your untrusted users handle their OOMs as you like. The watchdog (your fallback solution) _has_ to be trusted though and if you want it to do the job then you better implement it correctly. Is this requirement a problem? > and there is never any memory allocated (otherwise it may oom its own > user root memcg where the notifier is hosted) If the watchdog runs under root memcg then there is no limit so the only OOM that might happen is the global one and you can protect the watchdog by oom_adj as I have mentioned earlier. > and it is always responsive and able to handle the situation. If it is a trusted code then it can run with a real time priority. > This is not reality. It seems you just do not want to accept that there is other solution because the kernel solution sounds like an easier option for you. > This is why the kernel has its own oom killer and doesn't wait for a user > to go to kill something. There's no option to disable the kernel oom > killer. It's because we don't want to leave the system in a state where > no progress can be made. The same intention is for memcgs to not be left > in a state where no progress can be made even if userspace has the best > intentions. > > Your solution of a global entity to prevent these situations doesn't work > for the same reason we can't implement the kernel oom killer in userspace. > It's the exact same reason. No it is not! The core difference is that there is _always_ some memory for the watchdog (because something else might be killed to free some memory) while there is none for the global OOM. So the watchdog even doesn't need to mlock everything in. > We also want to push patches that allow global oom conditions to > trigger an eventfd notification on the root memcg with the exact same > semantics of a memcg oom: As already mentioned we have discussed this at LSF. I am still not sure it is the right thing to do though. The interface would be too tricky. There are other options to implement user defined policy for the global OOM and they should be considered before there is a decision to push it into memcg. > allow it time to respond but > step in and kill something if it fails to respond. Memcg happens to be > the perfect place to implement such a userspace policy and we want to have > a priority-based killing mechanism that is hierarchical and different from > oom_score_adj. It might sound perfect for your use cases but we should be _really_ careful to not pull another tricky interface into memcg that would fit a certain scenario. Remember use_hierarchy thingy? It sounded like a good idea at the time and it turned into a nightmare over time. It also aimed at solving a restriction at the time. The restriction is not here anymore AFAICT but we have a crippled hierarchy semantic. > For that to work properly, it cannot possibly allocate memory even on page > fault so it must be mlocked in memory and have enough buffers to store the > priorities of top-level memcgs. Asking a global watchdog to sit there > mlocked in memory to store thousands of memcgs, their priorities, their > last oom, their timeouts, etc, is a non-starter. I have asked you about an estimation already. I do not think that the memory consumption would really matter here. We are talking about few megs at most and even that is exaggerated. > I don't buy your argument that we're pushing any interface to an extreme. OK, call it a matter of taste but handling oom in the context of the oom itself without any requirements to the handler implementation and without access to any memory reserves because
Re: [patch] mm, memcg: add oom killer delay
On Wed 12-06-13 14:27:05, David Rientjes wrote: On Wed, 12 Jun 2013, Michal Hocko wrote: But the objective is to handle oom deadlocks gracefully and you cannot possibly miss those as they are, well, _deadlocks_. That's not at all the objective, the changelog quite explicitly states this is a deadlock as the result of userspace having disabled the oom killer so that its userspace oom handler can resolve the condition and it being unresponsive or unable to perform its job. Ohh, so another round. Sigh. You insist on having user space handlers running in the context of the limited group. OK, I can understand your use case, although I think it is pushing the limits of the interface and it is dangerous. As the problems/deadlocks are unavoidable with this approach you really need a backup plan to reduce the damage once it happens. You insist on having in-kernel solution while there is a user space alternative possible IMO. When you allow users to create their own memcgs, which we do and is possible by chowning the user's root to be owned by it, and implement their own userspace oom notifier, you must then rely on their implementation to work 100% of the time, otherwise all those gigabytes of memory go unfreed forever. What you're insisting on is that this userspace is perfect No, I am not saying that. You can let your untrusted users handle their OOMs as you like. The watchdog (your fallback solution) _has_ to be trusted though and if you want it to do the job then you better implement it correctly. Is this requirement a problem? and there is never any memory allocated (otherwise it may oom its own user root memcg where the notifier is hosted) If the watchdog runs under root memcg then there is no limit so the only OOM that might happen is the global one and you can protect the watchdog by oom_adj as I have mentioned earlier. and it is always responsive and able to handle the situation. If it is a trusted code then it can run with a real time priority. This is not reality. It seems you just do not want to accept that there is other solution because the kernel solution sounds like an easier option for you. This is why the kernel has its own oom killer and doesn't wait for a user to go to kill something. There's no option to disable the kernel oom killer. It's because we don't want to leave the system in a state where no progress can be made. The same intention is for memcgs to not be left in a state where no progress can be made even if userspace has the best intentions. Your solution of a global entity to prevent these situations doesn't work for the same reason we can't implement the kernel oom killer in userspace. It's the exact same reason. No it is not! The core difference is that there is _always_ some memory for the watchdog (because something else might be killed to free some memory) while there is none for the global OOM. So the watchdog even doesn't need to mlock everything in. We also want to push patches that allow global oom conditions to trigger an eventfd notification on the root memcg with the exact same semantics of a memcg oom: As already mentioned we have discussed this at LSF. I am still not sure it is the right thing to do though. The interface would be too tricky. There are other options to implement user defined policy for the global OOM and they should be considered before there is a decision to push it into memcg. allow it time to respond but step in and kill something if it fails to respond. Memcg happens to be the perfect place to implement such a userspace policy and we want to have a priority-based killing mechanism that is hierarchical and different from oom_score_adj. It might sound perfect for your use cases but we should be _really_ careful to not pull another tricky interface into memcg that would fit a certain scenario. Remember use_hierarchy thingy? It sounded like a good idea at the time and it turned into a nightmare over time. It also aimed at solving a restriction at the time. The restriction is not here anymore AFAICT but we have a crippled hierarchy semantic. For that to work properly, it cannot possibly allocate memory even on page fault so it must be mlocked in memory and have enough buffers to store the priorities of top-level memcgs. Asking a global watchdog to sit there mlocked in memory to store thousands of memcgs, their priorities, their last oom, their timeouts, etc, is a non-starter. I have asked you about an estimation already. I do not think that the memory consumption would really matter here. We are talking about few megs at most and even that is exaggerated. I don't buy your argument that we're pushing any interface to an extreme. OK, call it a matter of taste but handling oom in the context of the oom itself without any requirements to the handler implementation and without access to any memory reserves because the handler is not trusted feels weird and
Re: [patch] mm, memcg: add oom killer delay
On Thu, 13 Jun 2013, Michal Hocko wrote: That's not at all the objective, the changelog quite explicitly states this is a deadlock as the result of userspace having disabled the oom killer so that its userspace oom handler can resolve the condition and it being unresponsive or unable to perform its job. Ohh, so another round. Sigh. You insist on having user space handlers running in the context of the limited group. OK, I can understand your use case, although I think it is pushing the limits of the interface and it is dangerous. Ok, this is where our misunderstanding is, and I can see why you have reacted the way you have. It's my fault for not describing where we're going with this. Userspace oom handling right now is very inadequate, not only for the reasons that you and Johannes have described. This requirement that userspace oom handlers must run with more memory than the memcgs it is monitoring is where the pain is. We have to accept the fact that users can be given control of their own memcg trees by chowning that user root directory to the user. The only real change you must enforce is that the limit cannot be increased unless you have a special capability. Otherwise, users can be given complete control over their own tree and enforce their own memory limits (or just accounting). I haven't seen any objection to that usecase from anybody, so I'm assuming that it's something that is understood and acknowledged. Nothing in the cgroups code prevents it and I believe it gives a much more powerful interface to the user to control their own destiny much like setting up root level memcgs for the admin. The problem is that we want to control global oom conditions as well. oom_score_adj simply doesn't allow for user-specific implementations either within user subcontainers or at the root. Users may very well want to kill the allocating task in their memcg to avoid expensive scans (SGI required oom_kill_allocating_task at a global level), kill the largest memory consumer, do a priority-based memcg scoring, kill the newest process, etc. That's the power of user oom notifications beyond just statistics collection or creating a tombstone to examine later. We want to do that at the root level as well for global oom conditions. We have a priority-based scoring method amongst top-level memcgs. They are jobs that are scheduled on machines and given a certain priority to run at. We also overcommit the machine so the sum of all top-level memory.limit_in_bytes exceeds the amount of physical RAM since jobs seldom use their entire reservation. When RAM is exhausted, we want to kill the lowest priority process from the lowest priority memcg. That's not possible to do currently because we lack global oom notification and userspace oom handling. We want to change that by notifying on the root memcg's memory.oom_control for global oom and wakeup a process that will handle global oom conditions. Memcg is the perfect vehicle for this since it already does userspace oom notification and because the oom condition in a user root can be handled the same as a global oom. Waking up userspace and asking it to do anything useful in a global oom condition is problematic if it is unresponsive for whatever reason. Especially with true kmem accounting of all slab, we cannot be guaranteed that this global oom handler can actually do things like query the memory usage of a process or read a memcg's tasks file. For this reason, we want a kernel failsafe to ensure that if it isn't able to respond in a reasonable amount of time, the kernel frees memory itself. There's simply no other alternative to doing something like memory.oom_delay_millisecs then at the root level. For notification to actually work, we need to preempt the kernel at least for a certain amount of time from killing: the solution right now is to disable the oom killer entirely with memory.oom_control == 1. If the global oom handler fails the respond, the machine is essentially dead unless something starts freeing memory back for some strange reason. For our own priority-based memcg oom scoring implementation, this global oom handler would then signal the memcg oom handler in the lowest priority memcg, if it exists, to handle the situation or it would kill something from that memcg itself. The key is that without some sort of kernel-imposed delay, there's no way to sanely implement a global oom handler that doesn't risk killing the entire machine because the kernel oom killer is deferred. This patch makes that delay configurable. I could certainly push the patch that does the global oom notification on the root level memcg's memory.oom_control first if you'd like, but I didn't think there would be so much opposition to running a memcg oom handler in the memcg itself, which is what this essentially is. If that's done, then we can't simply rely on the global oom killer, which
Re: [patch] mm, memcg: add oom killer delay
(2013/06/14 7:25), David Rientjes wrote: On Thu, 13 Jun 2013, Michal Hocko wrote: That's not at all the objective, the changelog quite explicitly states this is a deadlock as the result of userspace having disabled the oom killer so that its userspace oom handler can resolve the condition and it being unresponsive or unable to perform its job. Ohh, so another round. Sigh. You insist on having user space handlers running in the context of the limited group. OK, I can understand your use case, although I think it is pushing the limits of the interface and it is dangerous. Ok, this is where our misunderstanding is, and I can see why you have reacted the way you have. It's my fault for not describing where we're going with this. Reading your discussion, I think I understand your requirements. The problem is that I can't think you took into all options into accounts and found the best way is this new oom_delay. IOW, I can't convice oom-delay is the best way to handle your issue. Your requeirement is - Allowing userland oom-handler within local memcg. Considering straightforward, the answer should be - Allowing oom-handler daemon out of memcg's control by its limit. (For example, a flag/capability for a task can archive this.) Or attaching some *fixed* resource to the task rather than cgroup. Allow to set task-secret_saving=20M. Going back to your patch, what's confusing is your approach. Why the problem caused by the amount of memory should be solved by some dealy, i.e. the amount of time ? This exchanging sounds confusing to me. I'm not against what you finally want to do, but I don't like the fix. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Wed, 12 Jun 2013, Michal Hocko wrote: > But the objective is to handle oom deadlocks gracefully and you cannot > possibly miss those as they are, well, _deadlocks_. That's not at all the objective, the changelog quite explicitly states this is a deadlock as the result of userspace having disabled the oom killer so that its userspace oom handler can resolve the condition and it being unresponsive or unable to perform its job. When you allow users to create their own memcgs, which we do and is possible by chowning the user's root to be owned by it, and implement their own userspace oom notifier, you must then rely on their implementation to work 100% of the time, otherwise all those gigabytes of memory go unfreed forever. What you're insisting on is that this userspace is perfect and there is never any memory allocated (otherwise it may oom its own user root memcg where the notifier is hosted) and it is always responsive and able to handle the situation. This is not reality. This is why the kernel has its own oom killer and doesn't wait for a user to go to kill something. There's no option to disable the kernel oom killer. It's because we don't want to leave the system in a state where no progress can be made. The same intention is for memcgs to not be left in a state where no progress can be made even if userspace has the best intentions. Your solution of a global entity to prevent these situations doesn't work for the same reason we can't implement the kernel oom killer in userspace. It's the exact same reason. We also want to push patches that allow global oom conditions to trigger an eventfd notification on the root memcg with the exact same semantics of a memcg oom: allow it time to respond but step in and kill something if it fails to respond. Memcg happens to be the perfect place to implement such a userspace policy and we want to have a priority-based killing mechanism that is hierarchical and different from oom_score_adj. For that to work properly, it cannot possibly allocate memory even on page fault so it must be mlocked in memory and have enough buffers to store the priorities of top-level memcgs. Asking a global watchdog to sit there mlocked in memory to store thousands of memcgs, their priorities, their last oom, their timeouts, etc, is a non-starter. I don't buy your argument that we're pushing any interface to an extreme. Users having the ability to manipulate their own memcgs and subcontainers isn't extreme, it's explicitly allowed by cgroups! What we're asking for is that level of control for memcg is sane and that if userspace is unresponsive that we don't lose gigabytes of memory forever. And since we've supported this type of functionality even before memcg was created for cpusets and have used and supported it for six years, I have no problem supporting such a thing upstream. I do understand that we're the largest user of memcg and use it unlike you or others on this thread do, but that doesn't mean our usecase is any less important or that we should aim for the most robust behavior possible. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue 11-06-13 13:33:40, David Rientjes wrote: > On Mon, 10 Jun 2013, Michal Hocko wrote: [...] > > Your only objection against userspace handling so far was that admin > > has no control over sub-hierarchies. But that is hardly a problem. A > > periodic tree scan would solve this easily (just to make sure we are on > > the same page - this is still the global watchdog living in a trusted > > root cgroup which is marked as unkillable for the global oom). > > > > There's no way a "periodic tree scan" would solve this easily, it's > completely racy and leaves the possibility of missing oom events when they > happen in a new memcg before the next tree scan. But the objective is to handle oom deadlocks gracefully and you cannot possibly miss those as they are, well, _deadlocks_. So while the watchdog might really miss some oom events which have been already handled that is absolutely not important here. Please note that the eventfd is notified also during registration if the group is under oom already. So you cannot miss a deadlocked group. > Not only that, but what happens when the entire machine is oom? Then a memcg deadlock is irrelevant, isn't it? Deadlocked tasks are sitting in the KILLABLE queue so they can be killed. And you can protect the watchdog from being killed. > The global watchdog can't allocate any memory to handle the event, yet > it is supposed to always be doing entire memcg tree scans, registering > oom handlers, and handling wakeups when notified? So what? The global OOM will find a victim and the life goes on. I do not think that the watchdog consumption would be negligible (simple queue for timers, eventfd stuff for each existing group, few threads). > How does this "global watchdog" know when to stop the timer? In the > kernel that would happen if the user expands the memcg limit or there's an > uncharge to the memcg. Then the group won't be marked under_oom in memory.oom_control at the time when timeout triggers so the watchdog knows that the oom has been handled already and it can be discarded. > The global watchdog stores the limit at the time of oom and compares > it before killing something? Why it would need to check the limit? It cares only about the oom events. > How many memcgs do we have to store this value for without allocating > memory in a global oom condition? You expect us to run with all this > memory mlocked in memory forever? No, the only thing that the watchdog has to care about is memory.oom_control. > How does the global watchdog know that there was an uncharge and then a > charge in the memcg so it is still oom? It will get a new event for every new oom. > This would reset the timer in the kernel but not in userspace and may > result in unnecessary killing. User space watchdog would do the same thing. When it receives an event it will enqueue or requeue the corresponding timer. > If the timeout is 10s, the memcg is > oom for 9s, uncharges, recharges, and the global watchdog checks at > 10s that it is still oom, we don't want the kill because we do get > uncharge events. > > This idea for a global watchdog just doesn't work. I am not convinced about that. Maybe I am missing some aspect but all problems you have mentioned are _solvable_. [...] > > David, maybe I am still missing something but it seems to me that the > > in-kernel timeout is not necessary because the user-space variant is not > > that hard to implement and I really do not want to add new knobs for > > something that can live outside. > > It's vitally necessary and unless you answer the questions I've asked > about your proposed "global watchdog" that exists only in your theory then Which questions are still not answered? > we'll just continue to have this circular argument. You cannot implement > a userspace variant that works this way and insisting that you can is > ridiculous. These are problems that real users face and we insist on the > problem being addressed. Your use case pushes the interface to extreme, to be honest. You are proposing a new knob that would have to be supported for ever. I feel that we should be careful and prefer user space solution if there is any. To be clear, I am not nacking this patch and I will not ack it either. If Johannes, Kamezawa or Andrew are OK with it I will not block it. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue 11-06-13 13:33:40, David Rientjes wrote: On Mon, 10 Jun 2013, Michal Hocko wrote: [...] Your only objection against userspace handling so far was that admin has no control over sub-hierarchies. But that is hardly a problem. A periodic tree scan would solve this easily (just to make sure we are on the same page - this is still the global watchdog living in a trusted root cgroup which is marked as unkillable for the global oom). There's no way a periodic tree scan would solve this easily, it's completely racy and leaves the possibility of missing oom events when they happen in a new memcg before the next tree scan. But the objective is to handle oom deadlocks gracefully and you cannot possibly miss those as they are, well, _deadlocks_. So while the watchdog might really miss some oom events which have been already handled that is absolutely not important here. Please note that the eventfd is notified also during registration if the group is under oom already. So you cannot miss a deadlocked group. Not only that, but what happens when the entire machine is oom? Then a memcg deadlock is irrelevant, isn't it? Deadlocked tasks are sitting in the KILLABLE queue so they can be killed. And you can protect the watchdog from being killed. The global watchdog can't allocate any memory to handle the event, yet it is supposed to always be doing entire memcg tree scans, registering oom handlers, and handling wakeups when notified? So what? The global OOM will find a victim and the life goes on. I do not think that the watchdog consumption would be negligible (simple queue for timers, eventfd stuff for each existing group, few threads). How does this global watchdog know when to stop the timer? In the kernel that would happen if the user expands the memcg limit or there's an uncharge to the memcg. Then the group won't be marked under_oom in memory.oom_control at the time when timeout triggers so the watchdog knows that the oom has been handled already and it can be discarded. The global watchdog stores the limit at the time of oom and compares it before killing something? Why it would need to check the limit? It cares only about the oom events. How many memcgs do we have to store this value for without allocating memory in a global oom condition? You expect us to run with all this memory mlocked in memory forever? No, the only thing that the watchdog has to care about is memory.oom_control. How does the global watchdog know that there was an uncharge and then a charge in the memcg so it is still oom? It will get a new event for every new oom. This would reset the timer in the kernel but not in userspace and may result in unnecessary killing. User space watchdog would do the same thing. When it receives an event it will enqueue or requeue the corresponding timer. If the timeout is 10s, the memcg is oom for 9s, uncharges, recharges, and the global watchdog checks at 10s that it is still oom, we don't want the kill because we do get uncharge events. This idea for a global watchdog just doesn't work. I am not convinced about that. Maybe I am missing some aspect but all problems you have mentioned are _solvable_. [...] David, maybe I am still missing something but it seems to me that the in-kernel timeout is not necessary because the user-space variant is not that hard to implement and I really do not want to add new knobs for something that can live outside. It's vitally necessary and unless you answer the questions I've asked about your proposed global watchdog that exists only in your theory then Which questions are still not answered? we'll just continue to have this circular argument. You cannot implement a userspace variant that works this way and insisting that you can is ridiculous. These are problems that real users face and we insist on the problem being addressed. Your use case pushes the interface to extreme, to be honest. You are proposing a new knob that would have to be supported for ever. I feel that we should be careful and prefer user space solution if there is any. To be clear, I am not nacking this patch and I will not ack it either. If Johannes, Kamezawa or Andrew are OK with it I will not block it. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Wed, 12 Jun 2013, Michal Hocko wrote: But the objective is to handle oom deadlocks gracefully and you cannot possibly miss those as they are, well, _deadlocks_. That's not at all the objective, the changelog quite explicitly states this is a deadlock as the result of userspace having disabled the oom killer so that its userspace oom handler can resolve the condition and it being unresponsive or unable to perform its job. When you allow users to create their own memcgs, which we do and is possible by chowning the user's root to be owned by it, and implement their own userspace oom notifier, you must then rely on their implementation to work 100% of the time, otherwise all those gigabytes of memory go unfreed forever. What you're insisting on is that this userspace is perfect and there is never any memory allocated (otherwise it may oom its own user root memcg where the notifier is hosted) and it is always responsive and able to handle the situation. This is not reality. This is why the kernel has its own oom killer and doesn't wait for a user to go to kill something. There's no option to disable the kernel oom killer. It's because we don't want to leave the system in a state where no progress can be made. The same intention is for memcgs to not be left in a state where no progress can be made even if userspace has the best intentions. Your solution of a global entity to prevent these situations doesn't work for the same reason we can't implement the kernel oom killer in userspace. It's the exact same reason. We also want to push patches that allow global oom conditions to trigger an eventfd notification on the root memcg with the exact same semantics of a memcg oom: allow it time to respond but step in and kill something if it fails to respond. Memcg happens to be the perfect place to implement such a userspace policy and we want to have a priority-based killing mechanism that is hierarchical and different from oom_score_adj. For that to work properly, it cannot possibly allocate memory even on page fault so it must be mlocked in memory and have enough buffers to store the priorities of top-level memcgs. Asking a global watchdog to sit there mlocked in memory to store thousands of memcgs, their priorities, their last oom, their timeouts, etc, is a non-starter. I don't buy your argument that we're pushing any interface to an extreme. Users having the ability to manipulate their own memcgs and subcontainers isn't extreme, it's explicitly allowed by cgroups! What we're asking for is that level of control for memcg is sane and that if userspace is unresponsive that we don't lose gigabytes of memory forever. And since we've supported this type of functionality even before memcg was created for cpusets and have used and supported it for six years, I have no problem supporting such a thing upstream. I do understand that we're the largest user of memcg and use it unlike you or others on this thread do, but that doesn't mean our usecase is any less important or that we should aim for the most robust behavior possible. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, 10 Jun 2013, Michal Hocko wrote: > > > > I don't think you yet understand the problem, which is probably my > > > > fault. > > > > I'm not insisting this be implemented in the kernel, I'm saying it's > > > > not > > > > possible to do it in userspace. > > > > > > Because you still insist on having this fallback mode running inside > > > untrusted environment AFAIU. > > > > > > > -ENOPARSE. > > Either the global watchdog is running as a trusted code and you _can_ > implement it without dealocks (this is what I claim and I haven't heard > strong arguments against that) or even the watchdog runs as an untrusted > code and then I would ask why. > As I already said, no global entity could possibly know and monitor the entire set of memcgs when users are given the power to create their own child memcgs within their tree. Not only would it be unnecessarily expensive to scan the entire memcg tree all the time, but it would also be racy and miss oom events when a child memcg is created and hits its limit before the global entity can scan for it. > Your only objection against userspace handling so far was that admin > has no control over sub-hierarchies. But that is hardly a problem. A > periodic tree scan would solve this easily (just to make sure we are on > the same page - this is still the global watchdog living in a trusted > root cgroup which is marked as unkillable for the global oom). > There's no way a "periodic tree scan" would solve this easily, it's completely racy and leaves the possibility of missing oom events when they happen in a new memcg before the next tree scan. Not only that, but what happens when the entire machine is oom? The global watchdog can't allocate any memory to handle the event, yet it is supposed to always be doing entire memcg tree scans, registering oom handlers, and handling wakeups when notified? How does this "global watchdog" know when to stop the timer? In the kernel that would happen if the user expands the memcg limit or there's an uncharge to the memcg. The global watchdog stores the limit at the time of oom and compares it before killing something? How many memcgs do we have to store this value for without allocating memory in a global oom condition? You expect us to run with all this memory mlocked in memory forever? How does the global watchdog know that there was an uncharge and then a charge in the memcg so it is still oom? This would reset the timer in the kernel but not in userspace and may result in unnecessary killing. If the timeout is 10s, the memcg is oom for 9s, uncharges, recharges, and the global watchdog checks at 10s that it is still oom, we don't want the kill because we do get uncharge events. This idea for a global watchdog just doesn't work. > > Meanwhile, the trusted resource has no knowledge whatsoever of these > > user subcontainers and it can't infinitely scan the memcg tree to find > > them because that requires memory that may not be available because of > > global oom or because of slab accounting. > > Global oom can be handled by oom_adj for the global watchdog and > slab accounting should be non issue as the limit cannot be set for the > root cgroup - or the watchdog can live in an unlimited group. > I'm referring to a global oom condition where physical RAM is exhausted, not anything to do with memcg. > Besides that. How much memory are we talking about here (per > memcg)? Have you measure that? Is it possible that your untrusted > users could cause a DoS by creating too many groups? I would be really > surprised and argument about global watchdog quality then. > We limit the number of css_id in children but not the top level of memcgs that are created by the trusted resource. > David, maybe I am still missing something but it seems to me that the > in-kernel timeout is not necessary because the user-space variant is not > that hard to implement and I really do not want to add new knobs for > something that can live outside. It's vitally necessary and unless you answer the questions I've asked about your proposed "global watchdog" that exists only in your theory then we'll just continue to have this circular argument. You cannot implement a userspace variant that works this way and insisting that you can is ridiculous. These are problems that real users face and we insist on the problem being addressed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, 10 Jun 2013, Michal Hocko wrote: I don't think you yet understand the problem, which is probably my fault. I'm not insisting this be implemented in the kernel, I'm saying it's not possible to do it in userspace. Because you still insist on having this fallback mode running inside untrusted environment AFAIU. -ENOPARSE. Either the global watchdog is running as a trusted code and you _can_ implement it without dealocks (this is what I claim and I haven't heard strong arguments against that) or even the watchdog runs as an untrusted code and then I would ask why. As I already said, no global entity could possibly know and monitor the entire set of memcgs when users are given the power to create their own child memcgs within their tree. Not only would it be unnecessarily expensive to scan the entire memcg tree all the time, but it would also be racy and miss oom events when a child memcg is created and hits its limit before the global entity can scan for it. Your only objection against userspace handling so far was that admin has no control over sub-hierarchies. But that is hardly a problem. A periodic tree scan would solve this easily (just to make sure we are on the same page - this is still the global watchdog living in a trusted root cgroup which is marked as unkillable for the global oom). There's no way a periodic tree scan would solve this easily, it's completely racy and leaves the possibility of missing oom events when they happen in a new memcg before the next tree scan. Not only that, but what happens when the entire machine is oom? The global watchdog can't allocate any memory to handle the event, yet it is supposed to always be doing entire memcg tree scans, registering oom handlers, and handling wakeups when notified? How does this global watchdog know when to stop the timer? In the kernel that would happen if the user expands the memcg limit or there's an uncharge to the memcg. The global watchdog stores the limit at the time of oom and compares it before killing something? How many memcgs do we have to store this value for without allocating memory in a global oom condition? You expect us to run with all this memory mlocked in memory forever? How does the global watchdog know that there was an uncharge and then a charge in the memcg so it is still oom? This would reset the timer in the kernel but not in userspace and may result in unnecessary killing. If the timeout is 10s, the memcg is oom for 9s, uncharges, recharges, and the global watchdog checks at 10s that it is still oom, we don't want the kill because we do get uncharge events. This idea for a global watchdog just doesn't work. Meanwhile, the trusted resource has no knowledge whatsoever of these user subcontainers and it can't infinitely scan the memcg tree to find them because that requires memory that may not be available because of global oom or because of slab accounting. Global oom can be handled by oom_adj for the global watchdog and slab accounting should be non issue as the limit cannot be set for the root cgroup - or the watchdog can live in an unlimited group. I'm referring to a global oom condition where physical RAM is exhausted, not anything to do with memcg. Besides that. How much memory are we talking about here (per memcg)? Have you measure that? Is it possible that your untrusted users could cause a DoS by creating too many groups? I would be really surprised and argument about global watchdog quality then. We limit the number of css_id in children but not the top level of memcgs that are created by the trusted resource. David, maybe I am still missing something but it seems to me that the in-kernel timeout is not necessary because the user-space variant is not that hard to implement and I really do not want to add new knobs for something that can live outside. It's vitally necessary and unless you answer the questions I've asked about your proposed global watchdog that exists only in your theory then we'll just continue to have this circular argument. You cannot implement a userspace variant that works this way and insisting that you can is ridiculous. These are problems that real users face and we insist on the problem being addressed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Wed 05-06-13 17:09:17, David Rientjes wrote: > On Wed, 5 Jun 2013, Michal Hocko wrote: > > > > For the aforementioned reason that we give users the ability to > > > manipulate > > > their own memcg trees and userspace is untrusted. Their oom notifiers > > > cannot be run as root, not only because of security but also because it > > > would not properly isolate their memory usage to their memcg tree. > > > > Yes, but nothing prevents an admin - I hope you trust at least this > > entity - to do the global watchdog for the fallback mode. So users can > > play as they like but if they are not able to cope with the oom > > situation for the defined timeout then the global (trusted and running > > in the root memcg) watchdog re-enables in-kernel oom handler. > > > > Users have the full ability to manipulate their own memcg hierarchy > under the root, the global entity that schedules these jobs is not aware > of any user subcontainers that are created beneath the user root. These > user subcontainers may be oom and our desire is for the user to be able to > have their own userspace handling implementation at a higher level (or > with memcg memory reserves). Userspace is untrusted, they can't be > expected to register an oom notifier for a child memcg with a global > resource, they may not care that they deadlock and leave behind gigabytes > of memory that can't be freed if they oom. And, if that userspace global > resource dies or becomes unresponsive itself for whatever reason, all oom > memcgs deadlock. > > > > I don't think you yet understand the problem, which is probably my fault. > > > > > > I'm not insisting this be implemented in the kernel, I'm saying it's not > > > possible to do it in userspace. > > > > Because you still insist on having this fallback mode running inside > > untrusted environment AFAIU. > > > > -ENOPARSE. Either the global watchdog is running as a trusted code and you _can_ implement it without dealocks (this is what I claim and I haven't heard strong arguments against that) or even the watchdog runs as an untrusted code and then I would ask why. > The failsafe is the kernel, it ensures that memcgs don't sit > completely deadlocked for days and weeks and take up resources that > can never be freed. Yes, I _understand_ what you are proposing. Your only objection against userspace handling so far was that admin has no control over sub-hierarchies. But that is hardly a problem. A periodic tree scan would solve this easily (just to make sure we are on the same page - this is still the global watchdog living in a trusted root cgroup which is marked as unkillable for the global oom). > The entire purpose of userspace oom notification is so > that users can implement their own policy, whatever is implemented in the > kernel may not apply (they may want to kill the largest process, the > newest, the youngest, one on a priority-based scale, etc). > > > > This is the result of memcg allowing users to disable the oom killer > > > entirely for a memcg, which is still ridiculous, because if the user > > > doesn't respond then you've wasted all that memory and cannot get it back > > > without admin intervention or a reboot. There are no other "features" in > > > the kernel that put such a responsibility on a userspace process such > > > that > > > if it doesn't work then the entire memcg deadlocks forever without admin > > > intervention. We need a failsafe in the kernel. > > > > But the memcg would deadlock within constrains assigned from somebody > > trusted. So while the memory is basically wasted the limit assignment > > already says that somebody (trusted) dedicated that much of memory. So I > > think disabling oom for ever is not that ridiculous. > > > > I don't understand what you're trying to say. Yes, a trusted resource > sets the user root's limits and that is their allotted use. To implement > a sane userspace oom handler, we need to give it time to respond; my > solution is memory.oom_delay_millisecs, your solution is disabling the oom > killer for that memcg. No! My solution is to let (trusted) userspace handle oom_delay_millisecs as a fallback if user space is not able to handle oom in time. > Anything else results in an instant oom kill from the kernel. If the > user has their own implementation, with today's kernel it is required > to disable the oom killer entirely and nothing in that untrusted > environment is ever guaranteed to re-enable the oom killer or even > have the memory to do so if it wanted. > Meanwhile, the trusted resource has no knowledge whatsoever of these > user subcontainers and it can't infinitely scan the memcg tree to find > them because that requires memory that may not be available because of > global oom or because of slab accounting. Global oom can be handled by oom_adj for the
Re: [patch] mm, memcg: add oom killer delay
On Wed 05-06-13 17:09:17, David Rientjes wrote: On Wed, 5 Jun 2013, Michal Hocko wrote: For the aforementioned reason that we give users the ability to manipulate their own memcg trees and userspace is untrusted. Their oom notifiers cannot be run as root, not only because of security but also because it would not properly isolate their memory usage to their memcg tree. Yes, but nothing prevents an admin - I hope you trust at least this entity - to do the global watchdog for the fallback mode. So users can play as they like but if they are not able to cope with the oom situation for the defined timeout then the global (trusted and running in the root memcg) watchdog re-enables in-kernel oom handler. Users have the full ability to manipulate their own memcg hierarchy under the root, the global entity that schedules these jobs is not aware of any user subcontainers that are created beneath the user root. These user subcontainers may be oom and our desire is for the user to be able to have their own userspace handling implementation at a higher level (or with memcg memory reserves). Userspace is untrusted, they can't be expected to register an oom notifier for a child memcg with a global resource, they may not care that they deadlock and leave behind gigabytes of memory that can't be freed if they oom. And, if that userspace global resource dies or becomes unresponsive itself for whatever reason, all oom memcgs deadlock. I don't think you yet understand the problem, which is probably my fault. I'm not insisting this be implemented in the kernel, I'm saying it's not possible to do it in userspace. Because you still insist on having this fallback mode running inside untrusted environment AFAIU. -ENOPARSE. Either the global watchdog is running as a trusted code and you _can_ implement it without dealocks (this is what I claim and I haven't heard strong arguments against that) or even the watchdog runs as an untrusted code and then I would ask why. The failsafe is the kernel, it ensures that memcgs don't sit completely deadlocked for days and weeks and take up resources that can never be freed. Yes, I _understand_ what you are proposing. Your only objection against userspace handling so far was that admin has no control over sub-hierarchies. But that is hardly a problem. A periodic tree scan would solve this easily (just to make sure we are on the same page - this is still the global watchdog living in a trusted root cgroup which is marked as unkillable for the global oom). The entire purpose of userspace oom notification is so that users can implement their own policy, whatever is implemented in the kernel may not apply (they may want to kill the largest process, the newest, the youngest, one on a priority-based scale, etc). This is the result of memcg allowing users to disable the oom killer entirely for a memcg, which is still ridiculous, because if the user doesn't respond then you've wasted all that memory and cannot get it back without admin intervention or a reboot. There are no other features in the kernel that put such a responsibility on a userspace process such that if it doesn't work then the entire memcg deadlocks forever without admin intervention. We need a failsafe in the kernel. But the memcg would deadlock within constrains assigned from somebody trusted. So while the memory is basically wasted the limit assignment already says that somebody (trusted) dedicated that much of memory. So I think disabling oom for ever is not that ridiculous. I don't understand what you're trying to say. Yes, a trusted resource sets the user root's limits and that is their allotted use. To implement a sane userspace oom handler, we need to give it time to respond; my solution is memory.oom_delay_millisecs, your solution is disabling the oom killer for that memcg. No! My solution is to let (trusted) userspace handle oom_delay_millisecs as a fallback if user space is not able to handle oom in time. Anything else results in an instant oom kill from the kernel. If the user has their own implementation, with today's kernel it is required to disable the oom killer entirely and nothing in that untrusted environment is ever guaranteed to re-enable the oom killer or even have the memory to do so if it wanted. Meanwhile, the trusted resource has no knowledge whatsoever of these user subcontainers and it can't infinitely scan the memcg tree to find them because that requires memory that may not be available because of global oom or because of slab accounting. Global oom can be handled by oom_adj for the global watchdog and slab accounting should be non issue as the limit cannot be set for the root cgroup - or the watchdog can live
Re: [patch] mm, memcg: add oom killer delay
On Wed, 5 Jun 2013, Michal Hocko wrote: > > For the aforementioned reason that we give users the ability to manipulate > > their own memcg trees and userspace is untrusted. Their oom notifiers > > cannot be run as root, not only because of security but also because it > > would not properly isolate their memory usage to their memcg tree. > > Yes, but nothing prevents an admin - I hope you trust at least this > entity - to do the global watchdog for the fallback mode. So users can > play as they like but if they are not able to cope with the oom > situation for the defined timeout then the global (trusted and running > in the root memcg) watchdog re-enables in-kernel oom handler. > Users have the full ability to manipulate their own memcg hierarchy under the root, the global entity that schedules these jobs is not aware of any user subcontainers that are created beneath the user root. These user subcontainers may be oom and our desire is for the user to be able to have their own userspace handling implementation at a higher level (or with memcg memory reserves). Userspace is untrusted, they can't be expected to register an oom notifier for a child memcg with a global resource, they may not care that they deadlock and leave behind gigabytes of memory that can't be freed if they oom. And, if that userspace global resource dies or becomes unresponsive itself for whatever reason, all oom memcgs deadlock. > > I don't think you yet understand the problem, which is probably my fault. > > I'm not insisting this be implemented in the kernel, I'm saying it's not > > possible to do it in userspace. > > Because you still insist on having this fallback mode running inside > untrusted environment AFAIU. > -ENOPARSE. The failsafe is the kernel, it ensures that memcgs don't sit completely deadlocked for days and weeks and take up resources that can never be freed. The entire purpose of userspace oom notification is so that users can implement their own policy, whatever is implemented in the kernel may not apply (they may want to kill the largest process, the newest, the youngest, one on a priority-based scale, etc). > > This is the result of memcg allowing users to disable the oom killer > > entirely for a memcg, which is still ridiculous, because if the user > > doesn't respond then you've wasted all that memory and cannot get it back > > without admin intervention or a reboot. There are no other "features" in > > the kernel that put such a responsibility on a userspace process such that > > if it doesn't work then the entire memcg deadlocks forever without admin > > intervention. We need a failsafe in the kernel. > > But the memcg would deadlock within constrains assigned from somebody > trusted. So while the memory is basically wasted the limit assignment > already says that somebody (trusted) dedicated that much of memory. So I > think disabling oom for ever is not that ridiculous. > I don't understand what you're trying to say. Yes, a trusted resource sets the user root's limits and that is their allotted use. To implement a sane userspace oom handler, we need to give it time to respond; my solution is memory.oom_delay_millisecs, your solution is disabling the oom killer for that memcg. Anything else results in an instant oom kill from the kernel. If the user has their own implementation, with today's kernel it is required to disable the oom killer entirely and nothing in that untrusted environment is ever guaranteed to re-enable the oom killer or even have the memory to do so if it wanted. Meanwhile, the trusted resource has no knowledge whatsoever of these user subcontainers and it can't infinitely scan the memcg tree to find them because that requires memory that may not be available because of global oom or because of slab accounting. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue 04-06-13 14:48:52, Johannes Weiner wrote: > On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote: [...] > > Now that I am looking at this again I've realized that this > > is not correct. The task which triggers memcg OOM will not > > have memcg_oom.memcg set so it would trigger a global OOM in > > pagefault_out_of_memory. Either we should return CHARGE_RETRY (and > > propagate it via mem_cgroup_do_charge) for need_to_kill or set up > > current->memcg_oom also for need_to_kill. > > You are absolutely right! I fixed it with a separate bit, like so: > > + struct memcg_oom_info { > + unsigned int in_userfault:1; > + unsigned int in_memcg_oom:1; > + int wakeups; > + struct mem_cgroup *wait_on_memcg; > + } memcg_oom; > > in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg > OOM and that it should always return true to not invoke the global > killer. But wait_on_memcg might or might not be set, depending if the > task needs to sleep. > > So the task invoking the memcg OOM killer won't sleep but retry the > fault instead (everybody else might be sleeping and the first OOM > killer invocation might not have freed anything). OK, I was about to say that you should just return CHARGE_RETRY but that doesn't solve the locking issue because although the task which called mem_cgroup_out_of_memory still holds lock which might prevent the victim from terminating. Maybe worth a note in the changelog? [...] > > > @@ -3760,22 +3761,14 @@ unlock: > > > /* > > > * By the time we get here, we already hold the mm semaphore > > > */ > > > -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > > - unsigned long address, unsigned int flags) > > > +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct > > > *vma, > > > + unsigned long address, unsigned int flags) > > > > Is this reusable? Who would call this helper or is it just for the code > > readability? I would probably prefer a smaller patch but I do not have a > > strong opinion on this. > > I just figured I'd move all the task-specific setup into the outer > function and leave this inner one with the actual page table stuff. > > The function is not called from somewhere else, but I don't see a > problem with splitting it into logical parts and have the individual > parts more readable. The alternative would have been to make every > return in handle_mm_fault() a { ret = -EFOO; goto out }, and that > would have been harder to follow. OK [...] > From: Johannes Weiner > Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context > > The memcg OOM handling is incredibly fragile because once a memcg goes > OOM, one task (kernel or userspace) is responsible for resolving the > situation. Every other task that gets caught trying to charge memory > gets stuck in a waitqueue while potentially holding various filesystem > and mm locks on which the OOM handling task may now deadlock. > > Do two things: > > 1. When OOMing in a system call (buffered IO and friends), invoke the >OOM killer but just return -ENOMEM, never sleep. Userspace should >be able to handle this. > > 2. When OOMing in a page fault and somebody else is handling the >situation, do not sleep directly in the charging code. Instead, >remember the OOMing memcg in the task struct and then fully unwind >the page fault stack first before going to sleep. > > While reworking the OOM routine, also remove a needless OOM waitqueue > wakeup when invoking the killer. Only uncharges and limit increases, > things that actually change the memory situation, should do wakeups. > > Signed-off-by: Johannes Weiner OK this one looks good. Bitfields is not the nicests piece of art but this is just a minor thing. With the remaining archs moved to pagefault_out_of_memory you can add Reviewed-by: Michal Hocko Thanks! > --- > include/linux/memcontrol.h | 22 +++ > include/linux/mm.h | 1 + > include/linux/sched.h | 6 ++ > mm/ksm.c | 2 +- > mm/memcontrol.c| 146 > - > mm/memory.c| 40 + > mm/oom_kill.c | 7 ++- > 7 files changed, 154 insertions(+), 70 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index d6183f0..b8ec9d1 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -121,6 +121,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec > *lruvec, enum lru_list); > void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > +static inline void mem_cgroup_set_userfault(struct task_struct *p) > +{ > + p->memcg_oom.in_userfault = 1; > +} > +static
Re: [patch] mm, memcg: add oom killer delay
On Tue 04-06-13 23:40:16, David Rientjes wrote: > On Tue, 4 Jun 2013, Michal Hocko wrote: > > > > I'm not sure a userspace oom notifier would want to keep a > > > preallocated buffer around that is mlocked in memory for all possible > > > lengths of this file. > > > > Well, an oom handler which allocates memory under the same restricted > > memory doesn't make much sense to me. Tracking all kmem allocations > > makes it almost impossible to implement a non-trivial handler. > > > > This isn't only about oom notifiers that reside in the same oom memcg, > they can be at a higher level or a different subtree entirely. The point > is that they can be unresponsive either because userspace is flaky, its > oom itself, or we do track all slab. The kernel is the only thing in the > position to fix the issue after a sensible user-defined amount of time has > elapsed, and that's what this patch is. > > > OK, maybe I just wasn't clear enough or I am missing your point. Your > > users _can_ implement and register their oom handlers. But as your > > requirements are rather benevolent for handlers implementation you would > > have a global watchdog which would sit on the oom_control of those > > groups (which are allowed to have own handlers - all of them in your > > case I guess) and trigger (user defined/global) timeout when it gets a > > notification. If the group was under oom always during the timeout then > > just disable oom_control until oom is settled (under_oom is 0). > > > > Why wouldn't something like this work for your use case? > > > > For the aforementioned reason that we give users the ability to manipulate > their own memcg trees and userspace is untrusted. Their oom notifiers > cannot be run as root, not only because of security but also because it > would not properly isolate their memory usage to their memcg tree. Yes, but nothing prevents an admin - I hope you trust at least this entity - to do the global watchdog for the fallback mode. So users can play as they like but if they are not able to cope with the oom situation for the defined timeout then the global (trusted and running in the root memcg) watchdog re-enables in-kernel oom handler. Maybe I just do not see why is this setup problem for your setup. > > Hohmm, so you are insisting on something that can be implemented in the > > userspace and put it into the kernel because it is more convenient for > > you and your use case. This doesn't sound like a way for accepting a > > feature. > > > > I don't think you yet understand the problem, which is probably my fault. > I'm not insisting this be implemented in the kernel, I'm saying it's not > possible to do it in userspace. Because you still insist on having this fallback mode running inside untrusted environment AFAIU. > Your idea of a timeout implemented in userspace doesn't work in > practice: userspace is both untrusted and cannot be guaranteed to be > perfect and always wakeup, get the information it does according to > its implementation, and issue a SIGKILL. > > This is the result of memcg allowing users to disable the oom killer > entirely for a memcg, which is still ridiculous, because if the user > doesn't respond then you've wasted all that memory and cannot get it back > without admin intervention or a reboot. There are no other "features" in > the kernel that put such a responsibility on a userspace process such that > if it doesn't work then the entire memcg deadlocks forever without admin > intervention. We need a failsafe in the kernel. But the memcg would deadlock within constrains assigned from somebody trusted. So while the memory is basically wasted the limit assignment already says that somebody (trusted) dedicated that much of memory. So I think disabling oom for ever is not that ridiculous. > Real users like this cannot run as root, and we cannot run in the root > memcg without charging that memory usage to the user's container for that > share of a global resource. Real users do have to tollerate buggy and > flaky userspace implementations that cannot be guaranteed to run or do > what they are supposed to do. It's too costly of a problem to not address > with a failsafe. I speak strictly from experience on this. > > > And yes we should make memcg oom handling less deadlock prone and > > Johannes' work in this thread is a good step forward. > > The long-term solution to that, which I already have patches for, is > something you would cringe even more at: memcg memory reserves that are > shared with per-zone memory reserves that get the global oom killer to > kill off that process without notification in the case the memcg memory > reserves cause a global oom. I would have to see those patches. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] mm, memcg: add oom killer delay
On Tue, 4 Jun 2013, Michal Hocko wrote: > > I'm not sure a userspace oom notifier would want to keep a > > preallocated buffer around that is mlocked in memory for all possible > > lengths of this file. > > Well, an oom handler which allocates memory under the same restricted > memory doesn't make much sense to me. Tracking all kmem allocations > makes it almost impossible to implement a non-trivial handler. > This isn't only about oom notifiers that reside in the same oom memcg, they can be at a higher level or a different subtree entirely. The point is that they can be unresponsive either because userspace is flaky, its oom itself, or we do track all slab. The kernel is the only thing in the position to fix the issue after a sensible user-defined amount of time has elapsed, and that's what this patch is. > OK, maybe I just wasn't clear enough or I am missing your point. Your > users _can_ implement and register their oom handlers. But as your > requirements are rather benevolent for handlers implementation you would > have a global watchdog which would sit on the oom_control of those > groups (which are allowed to have own handlers - all of them in your > case I guess) and trigger (user defined/global) timeout when it gets a > notification. If the group was under oom always during the timeout then > just disable oom_control until oom is settled (under_oom is 0). > > Why wouldn't something like this work for your use case? > For the aforementioned reason that we give users the ability to manipulate their own memcg trees and userspace is untrusted. Their oom notifiers cannot be run as root, not only because of security but also because it would not properly isolate their memory usage to their memcg tree. > Hohmm, so you are insisting on something that can be implemented in the > userspace and put it into the kernel because it is more convenient for > you and your use case. This doesn't sound like a way for accepting a > feature. > I don't think you yet understand the problem, which is probably my fault. I'm not insisting this be implemented in the kernel, I'm saying it's not possible to do it in userspace. Your idea of a timeout implemented in userspace doesn't work in practice: userspace is both untrusted and cannot be guaranteed to be perfect and always wakeup, get the information it does according to its implementation, and issue a SIGKILL. This is the result of memcg allowing users to disable the oom killer entirely for a memcg, which is still ridiculous, because if the user doesn't respond then you've wasted all that memory and cannot get it back without admin intervention or a reboot. There are no other "features" in the kernel that put such a responsibility on a userspace process such that if it doesn't work then the entire memcg deadlocks forever without admin intervention. We need a failsafe in the kernel. Real users like this cannot run as root, and we cannot run in the root memcg without charging that memory usage to the user's container for that share of a global resource. Real users do have to tollerate buggy and flaky userspace implementations that cannot be guaranteed to run or do what they are supposed to do. It's too costly of a problem to not address with a failsafe. I speak strictly from experience on this. > And yes we should make memcg oom handling less deadlock prone and > Johannes' work in this thread is a good step forward. The long-term solution to that, which I already have patches for, is something you would cringe even more at: memcg memory reserves that are shared with per-zone memory reserves that get the global oom killer to kill off that process without notification in the case the memcg memory reserves cause a global oom. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue, 4 Jun 2013, Michal Hocko wrote: I'm not sure a userspace oom notifier would want to keep a preallocated buffer around that is mlocked in memory for all possible lengths of this file. Well, an oom handler which allocates memory under the same restricted memory doesn't make much sense to me. Tracking all kmem allocations makes it almost impossible to implement a non-trivial handler. This isn't only about oom notifiers that reside in the same oom memcg, they can be at a higher level or a different subtree entirely. The point is that they can be unresponsive either because userspace is flaky, its oom itself, or we do track all slab. The kernel is the only thing in the position to fix the issue after a sensible user-defined amount of time has elapsed, and that's what this patch is. OK, maybe I just wasn't clear enough or I am missing your point. Your users _can_ implement and register their oom handlers. But as your requirements are rather benevolent for handlers implementation you would have a global watchdog which would sit on the oom_control of those groups (which are allowed to have own handlers - all of them in your case I guess) and trigger (user defined/global) timeout when it gets a notification. If the group was under oom always during the timeout then just disable oom_control until oom is settled (under_oom is 0). Why wouldn't something like this work for your use case? For the aforementioned reason that we give users the ability to manipulate their own memcg trees and userspace is untrusted. Their oom notifiers cannot be run as root, not only because of security but also because it would not properly isolate their memory usage to their memcg tree. Hohmm, so you are insisting on something that can be implemented in the userspace and put it into the kernel because it is more convenient for you and your use case. This doesn't sound like a way for accepting a feature. I don't think you yet understand the problem, which is probably my fault. I'm not insisting this be implemented in the kernel, I'm saying it's not possible to do it in userspace. Your idea of a timeout implemented in userspace doesn't work in practice: userspace is both untrusted and cannot be guaranteed to be perfect and always wakeup, get the information it does according to its implementation, and issue a SIGKILL. This is the result of memcg allowing users to disable the oom killer entirely for a memcg, which is still ridiculous, because if the user doesn't respond then you've wasted all that memory and cannot get it back without admin intervention or a reboot. There are no other features in the kernel that put such a responsibility on a userspace process such that if it doesn't work then the entire memcg deadlocks forever without admin intervention. We need a failsafe in the kernel. Real users like this cannot run as root, and we cannot run in the root memcg without charging that memory usage to the user's container for that share of a global resource. Real users do have to tollerate buggy and flaky userspace implementations that cannot be guaranteed to run or do what they are supposed to do. It's too costly of a problem to not address with a failsafe. I speak strictly from experience on this. And yes we should make memcg oom handling less deadlock prone and Johannes' work in this thread is a good step forward. The long-term solution to that, which I already have patches for, is something you would cringe even more at: memcg memory reserves that are shared with per-zone memory reserves that get the global oom killer to kill off that process without notification in the case the memcg memory reserves cause a global oom. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue 04-06-13 23:40:16, David Rientjes wrote: On Tue, 4 Jun 2013, Michal Hocko wrote: I'm not sure a userspace oom notifier would want to keep a preallocated buffer around that is mlocked in memory for all possible lengths of this file. Well, an oom handler which allocates memory under the same restricted memory doesn't make much sense to me. Tracking all kmem allocations makes it almost impossible to implement a non-trivial handler. This isn't only about oom notifiers that reside in the same oom memcg, they can be at a higher level or a different subtree entirely. The point is that they can be unresponsive either because userspace is flaky, its oom itself, or we do track all slab. The kernel is the only thing in the position to fix the issue after a sensible user-defined amount of time has elapsed, and that's what this patch is. OK, maybe I just wasn't clear enough or I am missing your point. Your users _can_ implement and register their oom handlers. But as your requirements are rather benevolent for handlers implementation you would have a global watchdog which would sit on the oom_control of those groups (which are allowed to have own handlers - all of them in your case I guess) and trigger (user defined/global) timeout when it gets a notification. If the group was under oom always during the timeout then just disable oom_control until oom is settled (under_oom is 0). Why wouldn't something like this work for your use case? For the aforementioned reason that we give users the ability to manipulate their own memcg trees and userspace is untrusted. Their oom notifiers cannot be run as root, not only because of security but also because it would not properly isolate their memory usage to their memcg tree. Yes, but nothing prevents an admin - I hope you trust at least this entity - to do the global watchdog for the fallback mode. So users can play as they like but if they are not able to cope with the oom situation for the defined timeout then the global (trusted and running in the root memcg) watchdog re-enables in-kernel oom handler. Maybe I just do not see why is this setup problem for your setup. Hohmm, so you are insisting on something that can be implemented in the userspace and put it into the kernel because it is more convenient for you and your use case. This doesn't sound like a way for accepting a feature. I don't think you yet understand the problem, which is probably my fault. I'm not insisting this be implemented in the kernel, I'm saying it's not possible to do it in userspace. Because you still insist on having this fallback mode running inside untrusted environment AFAIU. Your idea of a timeout implemented in userspace doesn't work in practice: userspace is both untrusted and cannot be guaranteed to be perfect and always wakeup, get the information it does according to its implementation, and issue a SIGKILL. This is the result of memcg allowing users to disable the oom killer entirely for a memcg, which is still ridiculous, because if the user doesn't respond then you've wasted all that memory and cannot get it back without admin intervention or a reboot. There are no other features in the kernel that put such a responsibility on a userspace process such that if it doesn't work then the entire memcg deadlocks forever without admin intervention. We need a failsafe in the kernel. But the memcg would deadlock within constrains assigned from somebody trusted. So while the memory is basically wasted the limit assignment already says that somebody (trusted) dedicated that much of memory. So I think disabling oom for ever is not that ridiculous. Real users like this cannot run as root, and we cannot run in the root memcg without charging that memory usage to the user's container for that share of a global resource. Real users do have to tollerate buggy and flaky userspace implementations that cannot be guaranteed to run or do what they are supposed to do. It's too costly of a problem to not address with a failsafe. I speak strictly from experience on this. And yes we should make memcg oom handling less deadlock prone and Johannes' work in this thread is a good step forward. The long-term solution to that, which I already have patches for, is something you would cringe even more at: memcg memory reserves that are shared with per-zone memory reserves that get the global oom killer to kill off that process without notification in the case the memcg memory reserves cause a global oom. I would have to see those patches. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue 04-06-13 14:48:52, Johannes Weiner wrote: On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote: [...] Now that I am looking at this again I've realized that this is not correct. The task which triggers memcg OOM will not have memcg_oom.memcg set so it would trigger a global OOM in pagefault_out_of_memory. Either we should return CHARGE_RETRY (and propagate it via mem_cgroup_do_charge) for need_to_kill or set up current-memcg_oom also for need_to_kill. You are absolutely right! I fixed it with a separate bit, like so: + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg OOM and that it should always return true to not invoke the global killer. But wait_on_memcg might or might not be set, depending if the task needs to sleep. So the task invoking the memcg OOM killer won't sleep but retry the fault instead (everybody else might be sleeping and the first OOM killer invocation might not have freed anything). OK, I was about to say that you should just return CHARGE_RETRY but that doesn't solve the locking issue because although the task which called mem_cgroup_out_of_memory still holds lock which might prevent the victim from terminating. Maybe worth a note in the changelog? [...] @@ -3760,22 +3761,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) Is this reusable? Who would call this helper or is it just for the code readability? I would probably prefer a smaller patch but I do not have a strong opinion on this. I just figured I'd move all the task-specific setup into the outer function and leave this inner one with the actual page table stuff. The function is not called from somewhere else, but I don't see a problem with splitting it into logical parts and have the individual parts more readable. The alternative would have been to make every return in handle_mm_fault() a { ret = -EFOO; goto out }, and that would have been harder to follow. OK [...] From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but just return -ENOMEM, never sleep. Userspace should be able to handle this. 2. When OOMing in a page fault and somebody else is handling the situation, do not sleep directly in the charging code. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack first before going to sleep. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Signed-off-by: Johannes Weiner han...@cmpxchg.org OK this one looks good. Bitfields is not the nicests piece of art but this is just a minor thing. With the remaining archs moved to pagefault_out_of_memory you can add Reviewed-by: Michal Hocko mho...@suse.cz Thanks! --- include/linux/memcontrol.h | 22 +++ include/linux/mm.h | 1 + include/linux/sched.h | 6 ++ mm/ksm.c | 2 +- mm/memcontrol.c| 146 - mm/memory.c| 40 + mm/oom_kill.c | 7 ++- 7 files changed, 154 insertions(+), 70 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d6183f0..b8ec9d1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -121,6 +121,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ + p-memcg_oom.in_userfault = 1; +} +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ +
Re: [patch] mm, memcg: add oom killer delay
On Wed, 5 Jun 2013, Michal Hocko wrote: For the aforementioned reason that we give users the ability to manipulate their own memcg trees and userspace is untrusted. Their oom notifiers cannot be run as root, not only because of security but also because it would not properly isolate their memory usage to their memcg tree. Yes, but nothing prevents an admin - I hope you trust at least this entity - to do the global watchdog for the fallback mode. So users can play as they like but if they are not able to cope with the oom situation for the defined timeout then the global (trusted and running in the root memcg) watchdog re-enables in-kernel oom handler. Users have the full ability to manipulate their own memcg hierarchy under the root, the global entity that schedules these jobs is not aware of any user subcontainers that are created beneath the user root. These user subcontainers may be oom and our desire is for the user to be able to have their own userspace handling implementation at a higher level (or with memcg memory reserves). Userspace is untrusted, they can't be expected to register an oom notifier for a child memcg with a global resource, they may not care that they deadlock and leave behind gigabytes of memory that can't be freed if they oom. And, if that userspace global resource dies or becomes unresponsive itself for whatever reason, all oom memcgs deadlock. I don't think you yet understand the problem, which is probably my fault. I'm not insisting this be implemented in the kernel, I'm saying it's not possible to do it in userspace. Because you still insist on having this fallback mode running inside untrusted environment AFAIU. -ENOPARSE. The failsafe is the kernel, it ensures that memcgs don't sit completely deadlocked for days and weeks and take up resources that can never be freed. The entire purpose of userspace oom notification is so that users can implement their own policy, whatever is implemented in the kernel may not apply (they may want to kill the largest process, the newest, the youngest, one on a priority-based scale, etc). This is the result of memcg allowing users to disable the oom killer entirely for a memcg, which is still ridiculous, because if the user doesn't respond then you've wasted all that memory and cannot get it back without admin intervention or a reboot. There are no other features in the kernel that put such a responsibility on a userspace process such that if it doesn't work then the entire memcg deadlocks forever without admin intervention. We need a failsafe in the kernel. But the memcg would deadlock within constrains assigned from somebody trusted. So while the memory is basically wasted the limit assignment already says that somebody (trusted) dedicated that much of memory. So I think disabling oom for ever is not that ridiculous. I don't understand what you're trying to say. Yes, a trusted resource sets the user root's limits and that is their allotted use. To implement a sane userspace oom handler, we need to give it time to respond; my solution is memory.oom_delay_millisecs, your solution is disabling the oom killer for that memcg. Anything else results in an instant oom kill from the kernel. If the user has their own implementation, with today's kernel it is required to disable the oom killer entirely and nothing in that untrusted environment is ever guaranteed to re-enable the oom killer or even have the memory to do so if it wanted. Meanwhile, the trusted resource has no knowledge whatsoever of these user subcontainers and it can't infinitely scan the memcg tree to find them because that requires memory that may not be available because of global oom or because of slab accounting. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue 04-06-13 14:48:52, Johannes Weiner wrote: > On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote: [...] > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 6dc1882..ff5e2d7 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, > > > struct mm_struct *mm, > > > while (!(page = follow_page_mask(vma, start, > > > foll_flags, _mask))) { > > > int ret; > > > - unsigned int fault_flags = 0; > > > + unsigned int fault_flags = FAULT_FLAG_KERNEL; > > > > > > /* For mlock, just skip the stack guard page. */ > > > if (foll_flags & FOLL_MLOCK) { > > > > This is also a bit tricky. Say there is an unlikely situation when a > > task fails to charge because of memcg OOM, it couldn't lock the oom > > so it ended up with current->memcg_oom set and __get_user_pages will > > turn VM_FAULT_OOM into ENOMEM but memcg_oom is still there. Then the > > following global OOM condition gets confused (well the oom will be > > triggered by somebody else so it shouldn't end up in the endless loop > > but still...), doesn't it? > > But current->memcg_oom is not set up unless current->in_userfault. > And get_user_pages does not set this flag. And my selective blindness strikes again :/ For some reason I have read those places as they enable the fault flag. Which would make some sense if there was a post handling... Anyway, I will get back to the updated patch tomorrow with a clean and fresh head. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote: > On Mon 03-06-13 14:30:18, Johannes Weiner wrote: > > On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote: > > > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: > [...] > > > > I am just afraid about all the other archs that do not support (from > > > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, > > > > openrisc, score and tile). What would be an alternative for them? > > > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This > > > > would be acceptable for me. > > > > > > blackfin is NOMMU but I guess the others should be converted to the > > > proper OOM protocol anyway and not just kill the faulting task. I can > > > update them in the next version of the patch (series). > > > > It's no longer necessary since I remove the arch-specific flag > > setting, but I converted them anyway while I was at it. Will send > > them as a separate patch. > > I am still not sure doing this unconditionally is the right way. Say > that a new arch will be added. How the poor implementer knows that memcg > oom handling requires an arch specific code to work properly? It doesn't. All that's required is that it follows the generic OOM killer protocol and calls pagefault_out_of_memory(). Killing the faulting task right there in the page fault handler is simply a bug and needs to be fixed so I don't think that any extra care from our side is necessary. > > @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup > > *memcg) > > } > > > > /* > > - * try to call OOM killer. returns false if we should exit memory-reclaim > > loop. > > + * try to call OOM killer > > */ > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > > - int order) > > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > { > > - struct oom_wait_info owait; > > - bool locked, need_to_kill; > > - > > - owait.memcg = memcg; > > - owait.wait.flags = 0; > > - owait.wait.func = memcg_oom_wake_function; > > - owait.wait.private = current; > > - INIT_LIST_HEAD(_list); > > - need_to_kill = true; > > - mem_cgroup_mark_under_oom(memcg); > > + bool locked, need_to_kill = true; > > > > /* At first, try to OOM lock hierarchy under memcg.*/ > > spin_lock(_oom_lock); > > locked = mem_cgroup_oom_lock(memcg); > > - /* > > -* Even if signal_pending(), we can't quit charge() loop without > > -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL > > -* under OOM is always welcomed, use TASK_KILLABLE here. > > -*/ > > - prepare_to_wait(_oom_waitq, , TASK_KILLABLE); > > - if (!locked || memcg->oom_kill_disable) > > + if (!locked || memcg->oom_kill_disable) { > > need_to_kill = false; > > + if (current->in_userfault) { > > + /* > > +* We start sleeping on the OOM waitqueue only > > +* after unwinding the page fault stack, so > > +* make sure we detect wakeups that happen > > +* between now and then. > > +*/ > > + mem_cgroup_mark_under_oom(memcg); > > + current->memcg_oom.wakeups = > > + atomic_read(>oom_wakeups); > > + css_get(>css); > > + current->memcg_oom.memcg = memcg; > > + } > > + } > > if (locked) > > mem_cgroup_oom_notify(memcg); > > spin_unlock(_oom_lock); > > > > - if (need_to_kill) { > > - finish_wait(_oom_waitq, ); > > + if (need_to_kill) > > mem_cgroup_out_of_memory(memcg, mask, order); > > Now that I am looking at this again I've realized that this > is not correct. The task which triggers memcg OOM will not > have memcg_oom.memcg set so it would trigger a global OOM in > pagefault_out_of_memory. Either we should return CHARGE_RETRY (and > propagate it via mem_cgroup_do_charge) for need_to_kill or set up > current->memcg_oom also for need_to_kill. You are absolutely right! I fixed it with a separate bit, like so: + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg OOM and that it should always return true to not invoke the global killer. But wait_on_memcg might or might not be set, depending if the task needs to sleep. So the task invoking the memcg OOM killer won't sleep but retry the fault instead (everybody else might be sleeping and the first OOM killer invocation might not have freed anything). > > diff --git a/mm/memory.c b/mm/memory.c > > index 6dc1882..ff5e2d7 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 14:17:54, David Rientjes wrote: > On Mon, 3 Jun 2013, Michal Hocko wrote: > > > > What do you suggest when you read the "tasks" file and it returns -ENOMEM > > > because kmalloc() fails because the userspace oom handler's memcg is also > > > oom? > > > > That would require that you track kernel allocations which is currently > > done only for explicit caches. > > > > That will not always be the case, and I think this could be a prerequisite > patch for such support that we have internally. > I'm not sure a userspace oom notifier would want to keep a > preallocated buffer around that is mlocked in memory for all possible > lengths of this file. Well, an oom handler which allocates memory under the same restricted memory doesn't make much sense to me. Tracking all kmem allocations makes it almost impossible to implement a non-trivial handler. > > > Obviously it's not a situation we want to get into, but unless you > > > know that handler's exact memory usage across multiple versions, nothing > > > else is sharing that memcg, and it's a perfect implementation, you can't > > > guarantee it. We need to address real world problems that occur in > > > practice. > > > > If you really need to have such a guarantee then you can have a _global_ > > watchdog observing oom_control of all groups that provide such a vague > > requirements for oom user handlers. > > > > The whole point is to allow the user to implement their own oom policy. OK, maybe I just wasn't clear enough or I am missing your point. Your users _can_ implement and register their oom handlers. But as your requirements are rather benevolent for handlers implementation you would have a global watchdog which would sit on the oom_control of those groups (which are allowed to have own handlers - all of them in your case I guess) and trigger (user defined/global) timeout when it gets a notification. If the group was under oom always during the timeout then just disable oom_control until oom is settled (under_oom is 0). Why wouldn't something like this work for your use case? > If the policy was completely encapsulated in kernel code, we don't need to > ever disable the oom killer even with memory.oom_control. Users may > choose to kill the largest process, the newest process, the oldest > process, sacrifice children instead of parents, prevent forkbombs, > implement their own priority scoring (which is what we do), kill the > allocating task, etc. > > To not merge this patch, I'd ask that you show an alternative that allows > users to implement their own userspace oom handlers and not require admin > intervention when things go wrong. Hohmm, so you are insisting on something that can be implemented in the userspace and put it into the kernel because it is more convenient for you and your use case. This doesn't sound like a way for accepting a feature. To make this absolutely clear. I do understand your requirements but you haven't shown any _argument_ why the timeout you are proposing cannot be implemented in the userspace. I will not ack this without this reasoning. And yes we should make memcg oom handling less deadlock prone and Johannes' work in this thread is a good step forward. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 14:30:18, Johannes Weiner wrote: > On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote: > > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: [...] > > > I am just afraid about all the other archs that do not support (from > > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, > > > openrisc, score and tile). What would be an alternative for them? > > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This > > > would be acceptable for me. > > > > blackfin is NOMMU but I guess the others should be converted to the > > proper OOM protocol anyway and not just kill the faulting task. I can > > update them in the next version of the patch (series). > > It's no longer necessary since I remove the arch-specific flag > setting, but I converted them anyway while I was at it. Will send > them as a separate patch. I am still not sure doing this unconditionally is the right way. Say that a new arch will be added. How the poor implementer knows that memcg oom handling requires an arch specific code to work properly? So while I obviously do not have anything agains your conversion of other archs that are in the tree currently I think we need something like CONFIG_OLD_VERSION_MEMCG_OOM which depends on ARCH_HAS_FAULT_OOM_RETRY. [...] > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > > index e692a02..cf60aef 100644 > > > > --- a/include/linux/sched.h > > > > +++ b/include/linux/sched.h > > > > @@ -1282,6 +1282,8 @@ struct task_struct { > > > > * execve */ > > > > unsigned in_iowait:1; > > > > > > > > + unsigned in_userfault:1; > > > > + > > > > > > [This is more a nit pick but before I forget while I am reading through > > > the rest of the patch.] > > > > > > OK there is a lot of room around those bit fields but as this is only > > > for memcg and you are enlarging the structure by the pointer then you > > > can reuse bottom bit of memcg pointer. > > > > I just didn't want to put anything in the arch code that looks too > > memcgish, even though it's the only user right now. But granted, it > > will also probably remain the only user for a while. > > I tried a couple of variants, including using the lowest memcg bit, > but it all turned into more ugliness. So that .in_userfault is still > there in v2, but it's now set in handle_mm_fault() in a generic manner > depending on a fault flag, please reconsider if you can live with it. Sure thing. [...] > From: Johannes Weiner > Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context > > The memcg OOM handling is incredibly fragile because once a memcg goes > OOM, one task (kernel or userspace) is responsible for resolving the > situation. Every other task that gets caught trying to charge memory > gets stuck in a waitqueue while potentially holding various filesystem > and mm locks on which the OOM handling task may now deadlock. > > Do two things: > > 1. When OOMing in a system call (buffered IO and friends), invoke the >OOM killer but do not trap other tasks and just return -ENOMEM for >everyone. Userspace should be able to handle this... right? > > 2. When OOMing in a page fault, invoke the OOM killer but do not trap >other chargers directly in the charging code. Instead, remember >the OOMing memcg in the task struct and then fully unwind the page >fault stack first. Then synchronize the memcg OOM from >pagefault_out_of_memory(). > > While reworking the OOM routine, also remove a needless OOM waitqueue > wakeup when invoking the killer. Only uncharges and limit increases, > things that actually change the memory situation, should do wakeups. > > Signed-off-by: Johannes Weiner > --- > include/linux/memcontrol.h | 6 +++ > include/linux/mm.h | 1 + > include/linux/sched.h | 6 +++ > mm/ksm.c | 2 +- > mm/memcontrol.c| 117 > +++-- > mm/memory.c| 40 +++- > mm/oom_kill.c | 7 ++- > 7 files changed, 108 insertions(+), 71 deletions(-) > [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index de22292..97cf32b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup > *memcg) > } > > /* > - * try to call OOM killer. returns false if we should exit memory-reclaim > loop. > + * try to call OOM killer > */ > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > - int order) > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > { > - struct oom_wait_info owait; > - bool locked, need_to_kill; > - > - owait.memcg = memcg; > - owait.wait.flags = 0; > - owait.wait.func = memcg_oom_wake_function; > - owait.wait.private = current; > -
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 14:30:18, Johannes Weiner wrote: On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote: On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: [...] I am just afraid about all the other archs that do not support (from quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, openrisc, score and tile). What would be an alternative for them? #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This would be acceptable for me. blackfin is NOMMU but I guess the others should be converted to the proper OOM protocol anyway and not just kill the faulting task. I can update them in the next version of the patch (series). It's no longer necessary since I remove the arch-specific flag setting, but I converted them anyway while I was at it. Will send them as a separate patch. I am still not sure doing this unconditionally is the right way. Say that a new arch will be added. How the poor implementer knows that memcg oom handling requires an arch specific code to work properly? So while I obviously do not have anything agains your conversion of other archs that are in the tree currently I think we need something like CONFIG_OLD_VERSION_MEMCG_OOM which depends on ARCH_HAS_FAULT_OOM_RETRY. [...] diff --git a/include/linux/sched.h b/include/linux/sched.h index e692a02..cf60aef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1282,6 +1282,8 @@ struct task_struct { * execve */ unsigned in_iowait:1; + unsigned in_userfault:1; + [This is more a nit pick but before I forget while I am reading through the rest of the patch.] OK there is a lot of room around those bit fields but as this is only for memcg and you are enlarging the structure by the pointer then you can reuse bottom bit of memcg pointer. I just didn't want to put anything in the arch code that looks too memcgish, even though it's the only user right now. But granted, it will also probably remain the only user for a while. I tried a couple of variants, including using the lowest memcg bit, but it all turned into more ugliness. So that .in_userfault is still there in v2, but it's now set in handle_mm_fault() in a generic manner depending on a fault flag, please reconsider if you can live with it. Sure thing. [...] From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but do not trap other tasks and just return -ENOMEM for everyone. Userspace should be able to handle this... right? 2. When OOMing in a page fault, invoke the OOM killer but do not trap other chargers directly in the charging code. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory(). While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- include/linux/memcontrol.h | 6 +++ include/linux/mm.h | 1 + include/linux/sched.h | 6 +++ mm/ksm.c | 2 +- mm/memcontrol.c| 117 +++-- mm/memory.c| 40 +++- mm/oom_kill.c | 7 ++- 7 files changed, 108 insertions(+), 71 deletions(-) [...] diff --git a/mm/memcontrol.c b/mm/memcontrol.c index de22292..97cf32b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c [...] @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, - int order) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.memcg = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked,
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 14:17:54, David Rientjes wrote: On Mon, 3 Jun 2013, Michal Hocko wrote: What do you suggest when you read the tasks file and it returns -ENOMEM because kmalloc() fails because the userspace oom handler's memcg is also oom? That would require that you track kernel allocations which is currently done only for explicit caches. That will not always be the case, and I think this could be a prerequisite patch for such support that we have internally. I'm not sure a userspace oom notifier would want to keep a preallocated buffer around that is mlocked in memory for all possible lengths of this file. Well, an oom handler which allocates memory under the same restricted memory doesn't make much sense to me. Tracking all kmem allocations makes it almost impossible to implement a non-trivial handler. Obviously it's not a situation we want to get into, but unless you know that handler's exact memory usage across multiple versions, nothing else is sharing that memcg, and it's a perfect implementation, you can't guarantee it. We need to address real world problems that occur in practice. If you really need to have such a guarantee then you can have a _global_ watchdog observing oom_control of all groups that provide such a vague requirements for oom user handlers. The whole point is to allow the user to implement their own oom policy. OK, maybe I just wasn't clear enough or I am missing your point. Your users _can_ implement and register their oom handlers. But as your requirements are rather benevolent for handlers implementation you would have a global watchdog which would sit on the oom_control of those groups (which are allowed to have own handlers - all of them in your case I guess) and trigger (user defined/global) timeout when it gets a notification. If the group was under oom always during the timeout then just disable oom_control until oom is settled (under_oom is 0). Why wouldn't something like this work for your use case? If the policy was completely encapsulated in kernel code, we don't need to ever disable the oom killer even with memory.oom_control. Users may choose to kill the largest process, the newest process, the oldest process, sacrifice children instead of parents, prevent forkbombs, implement their own priority scoring (which is what we do), kill the allocating task, etc. To not merge this patch, I'd ask that you show an alternative that allows users to implement their own userspace oom handlers and not require admin intervention when things go wrong. Hohmm, so you are insisting on something that can be implemented in the userspace and put it into the kernel because it is more convenient for you and your use case. This doesn't sound like a way for accepting a feature. To make this absolutely clear. I do understand your requirements but you haven't shown any _argument_ why the timeout you are proposing cannot be implemented in the userspace. I will not ack this without this reasoning. And yes we should make memcg oom handling less deadlock prone and Johannes' work in this thread is a good step forward. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote: On Mon 03-06-13 14:30:18, Johannes Weiner wrote: On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote: On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: [...] I am just afraid about all the other archs that do not support (from quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, openrisc, score and tile). What would be an alternative for them? #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This would be acceptable for me. blackfin is NOMMU but I guess the others should be converted to the proper OOM protocol anyway and not just kill the faulting task. I can update them in the next version of the patch (series). It's no longer necessary since I remove the arch-specific flag setting, but I converted them anyway while I was at it. Will send them as a separate patch. I am still not sure doing this unconditionally is the right way. Say that a new arch will be added. How the poor implementer knows that memcg oom handling requires an arch specific code to work properly? It doesn't. All that's required is that it follows the generic OOM killer protocol and calls pagefault_out_of_memory(). Killing the faulting task right there in the page fault handler is simply a bug and needs to be fixed so I don't think that any extra care from our side is necessary. @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, - int order) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.memcg = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* -* Even if signal_pending(), we can't quit charge() loop without -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL -* under OOM is always welcomed, use TASK_KILLABLE here. -*/ - prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE); - if (!locked || memcg-oom_kill_disable) + if (!locked || memcg-oom_kill_disable) { need_to_kill = false; + if (current-in_userfault) { + /* +* We start sleeping on the OOM waitqueue only +* after unwinding the page fault stack, so +* make sure we detect wakeups that happen +* between now and then. +*/ + mem_cgroup_mark_under_oom(memcg); + current-memcg_oom.wakeups = + atomic_read(memcg-oom_wakeups); + css_get(memcg-css); + current-memcg_oom.memcg = memcg; + } + } if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(memcg_oom_lock); - if (need_to_kill) { - finish_wait(memcg_oom_waitq, owait.wait); + if (need_to_kill) mem_cgroup_out_of_memory(memcg, mask, order); Now that I am looking at this again I've realized that this is not correct. The task which triggers memcg OOM will not have memcg_oom.memcg set so it would trigger a global OOM in pagefault_out_of_memory. Either we should return CHARGE_RETRY (and propagate it via mem_cgroup_do_charge) for need_to_kill or set up current-memcg_oom also for need_to_kill. You are absolutely right! I fixed it with a separate bit, like so: + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg OOM and that it should always return true to not invoke the global killer. But wait_on_memcg might or might not be set, depending if the task needs to sleep. So the task invoking the memcg OOM killer won't sleep but retry the fault instead (everybody else might be sleeping and the first OOM killer invocation might not have freed anything). diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..ff5e2d7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, while (!(page =
Re: [patch] mm, memcg: add oom killer delay
On Tue 04-06-13 14:48:52, Johannes Weiner wrote: On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote: [...] diff --git a/mm/memory.c b/mm/memory.c index 6dc1882..ff5e2d7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, while (!(page = follow_page_mask(vma, start, foll_flags, page_mask))) { int ret; - unsigned int fault_flags = 0; + unsigned int fault_flags = FAULT_FLAG_KERNEL; /* For mlock, just skip the stack guard page. */ if (foll_flags FOLL_MLOCK) { This is also a bit tricky. Say there is an unlikely situation when a task fails to charge because of memcg OOM, it couldn't lock the oom so it ended up with current-memcg_oom set and __get_user_pages will turn VM_FAULT_OOM into ENOMEM but memcg_oom is still there. Then the following global OOM condition gets confused (well the oom will be triggered by somebody else so it shouldn't end up in the endless loop but still...), doesn't it? But current-memcg_oom is not set up unless current-in_userfault. And get_user_pages does not set this flag. And my selective blindness strikes again :/ For some reason I have read those places as they enable the fault flag. Which would make some sense if there was a post handling... Anyway, I will get back to the updated patch tomorrow with a clean and fresh head. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 12:09:22PM -0700, David Rientjes wrote: > On Mon, 3 Jun 2013, Johannes Weiner wrote: > > > > It's not necessarily harder if you assign the userspace oom handlers to > > > the root of your subtree with access to more memory than the children. > > > There is no "inability" to write a proper handler, but when you have > > > dozens of individual users implementing their own userspace handlers with > > > changing memcg limits over time, then you might find it hard to have > > > perfection every time. If we had perfection, we wouldn't have to worry > > > about oom in the first place. We can't just let these gazillion memcgs > > > sit spinning forever because they get stuck, either. That's why we've > > > used this solution for years as a failsafe. Disabling the oom killer > > > entirely, even for a memcg, is ridiculous, and if you don't have a grace > > > period then oom handlers themselves just don't work. > > > > It's only ridiculous if your OOM handler is subject to the OOM > > situation it's trying to handle. > > > > You're suggesting the oom handler can't be subject to its own memcg > limits, independent of the memcg it is handling? If we demand that such a > handler be attached to the root memcg, that breaks the memory isolation > that memcg provides. We constrain processes to a memcg for the purposes > of that isolation, so it cannot use more resources than allotted. I guess the traditional use case is that you have a job manager that you trust, that sets up the groups, configures their limits etc. and would also know more about the jobs than the kernel to act as the OOM killer as well. Since it's trusted and not expected to consume any significant amounts of memory by itself, the memcg code assumes that it does not run in a cgroup itself, it's just not thought of as being part of the application class. I'm not saying it should necessarily stay that way, but it's also not a completely ridiculous model. > > What we could do is allow one task in the group to be the dedicated > > OOM handler. If we catch this task in the charge path during an OOM > > situation, we fall back to the kernel OOM handler. > > > > I'm not sure it even makes sense to have more than one oom handler per > memcg and the synchronization that requires in userspace to get the right > result, so I didn't consider dedicating a single oom handler. That would > be an entirely new interface, though, since we may have multiple processes > waiting on memory.oom_control that aren't necessarily handlers; they grab > a snapshot of memory, do logging, etc. It will probably be hard to extend the existing oom_control interface. But we could add a separate one where you put in a pid or 0 (kernel) instead of a boolean value, which then enables or disables the userspace OOM handling task for the whole subtree. If that task enters the OOM path, the kernel handler is invoked. If the task dies, the kernel handler is permanently re-enabled. Everybody is free to poll that interface for OOM notifications, not only that one task. Combined with the "enter waitqueue after unwinding page fault stack" patch, would this fully cover your usecase? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
> From: Johannes Weiner > Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context > > The memcg OOM handling is incredibly fragile because once a memcg goes > OOM, one task (kernel or userspace) is responsible for resolving the > situation. Every other task that gets caught trying to charge memory > gets stuck in a waitqueue while potentially holding various filesystem > and mm locks on which the OOM handling task may now deadlock. > > Do two things: > > 1. When OOMing in a system call (buffered IO and friends), invoke the >OOM killer but do not trap other tasks and just return -ENOMEM for >everyone. Userspace should be able to handle this... right? > > 2. When OOMing in a page fault, invoke the OOM killer but do not trap >other chargers directly in the charging code. Instead, remember >the OOMing memcg in the task struct and then fully unwind the page >fault stack first. Then synchronize the memcg OOM from >pagefault_out_of_memory(). > > While reworking the OOM routine, also remove a needless OOM waitqueue > wakeup when invoking the killer. Only uncharges and limit increases, > things that actually change the memory situation, should do wakeups. > > Signed-off-by: Johannes Weiner >From point of the memcg oom notification view, it is NOT supported on the case that an oom handler process is subjected its own memcg limit. All memcg developers clearly agreed it since it began. Even though, anyway, people have a right to shoot their own foot. :) However, this patch fixes more than that. OK, I prefer it. Good fix! Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, 3 Jun 2013, Michal Hocko wrote: > > What do you suggest when you read the "tasks" file and it returns -ENOMEM > > because kmalloc() fails because the userspace oom handler's memcg is also > > oom? > > That would require that you track kernel allocations which is currently > done only for explicit caches. > That will not always be the case, and I think this could be a prerequisite patch for such support that we have internally. I'm not sure a userspace oom notifier would want to keep a preallocated buffer around that is mlocked in memory for all possible lengths of this file. > > Obviously it's not a situation we want to get into, but unless you > > know that handler's exact memory usage across multiple versions, nothing > > else is sharing that memcg, and it's a perfect implementation, you can't > > guarantee it. We need to address real world problems that occur in > > practice. > > If you really need to have such a guarantee then you can have a _global_ > watchdog observing oom_control of all groups that provide such a vague > requirements for oom user handlers. > The whole point is to allow the user to implement their own oom policy. If the policy was completely encapsulated in kernel code, we don't need to ever disable the oom killer even with memory.oom_control. Users may choose to kill the largest process, the newest process, the oldest process, sacrifice children instead of parents, prevent forkbombs, implement their own priority scoring (which is what we do), kill the allocating task, etc. To not merge this patch, I'd ask that you show an alternative that allows users to implement their own userspace oom handlers and not require admin intervention when things go wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 11:18:09, David Rientjes wrote: > On Sat, 1 Jun 2013, Michal Hocko wrote: [...] > > I still do not see why you cannot simply read tasks file into a > > preallocated buffer. This would be few pages even for thousands of pids. > > You do not have to track processes as they come and go. > > > > What do you suggest when you read the "tasks" file and it returns -ENOMEM > because kmalloc() fails because the userspace oom handler's memcg is also > oom? That would require that you track kernel allocations which is currently done only for explicit caches. > Obviously it's not a situation we want to get into, but unless you > know that handler's exact memory usage across multiple versions, nothing > else is sharing that memcg, and it's a perfect implementation, you can't > guarantee it. We need to address real world problems that occur in > practice. If you really need to have such a guarantee then you can have a _global_ watchdog observing oom_control of all groups that provide such a vague requirements for oom user handlers. > > As I said before. oom_delay_millisecs is actually really easy to be done > > from userspace. If you really need a safety break then you can register > > such a handler as a fallback. I am not familiar with eventfd internals > > much but I guess that multiple handlers are possible. The fallback might > > be enforeced by the admin (when a new group is created) or by the > > container itself. Would something like this work for your use case? > > > > You're suggesting another userspace process that solely waits for a set > duration and then reenables the oom killer? Yes which kicks the oom killer. > It faces all the same problems as the true userspace oom handler: it's > own perfect implementation and it's own memcg constraints. But that solution might be implemented as a global policy living in a group with some reservations. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, 3 Jun 2013, Johannes Weiner wrote: > > It's not necessarily harder if you assign the userspace oom handlers to > > the root of your subtree with access to more memory than the children. > > There is no "inability" to write a proper handler, but when you have > > dozens of individual users implementing their own userspace handlers with > > changing memcg limits over time, then you might find it hard to have > > perfection every time. If we had perfection, we wouldn't have to worry > > about oom in the first place. We can't just let these gazillion memcgs > > sit spinning forever because they get stuck, either. That's why we've > > used this solution for years as a failsafe. Disabling the oom killer > > entirely, even for a memcg, is ridiculous, and if you don't have a grace > > period then oom handlers themselves just don't work. > > It's only ridiculous if your OOM handler is subject to the OOM > situation it's trying to handle. > You're suggesting the oom handler can't be subject to its own memcg limits, independent of the memcg it is handling? If we demand that such a handler be attached to the root memcg, that breaks the memory isolation that memcg provides. We constrain processes to a memcg for the purposes of that isolation, so it cannot use more resources than allotted. > Don't act as if the oom disabling semantics were unreasonable or > inconsistent with the rest of the system, memcgs were not really meant > to be self-policed by the tasks running in them. That's why we have > the waitqueue, so that everybody sits there and waits until an outside > force resolves the situation. There is nothing wrong with that, you > just have a new requirement. > The waitqueue doesn't solve anything with regard to the memory, if the memcg sits there and deadlocks forever then it is using resources (memory, not cpu) that will never be freed. > > I'm talking about the memory the kernel allocates when reading the "tasks" > > file, not userspace. This can, and will, return -ENOMEM. > > Do you mean due to kmem limitations? > Yes. > What we could do is allow one task in the group to be the dedicated > OOM handler. If we catch this task in the charge path during an OOM > situation, we fall back to the kernel OOM handler. > I'm not sure it even makes sense to have more than one oom handler per memcg and the synchronization that requires in userspace to get the right result, so I didn't consider dedicating a single oom handler. That would be an entirely new interface, though, since we may have multiple processes waiting on memory.oom_control that aren't necessarily handlers; they grab a snapshot of memory, do logging, etc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 11:18:09AM -0700, David Rientjes wrote: > On Sat, 1 Jun 2013, Michal Hocko wrote: > > OK, I assume those groups are generally untrusted, right? So you cannot > > let them register their oom handler even via an admin interface. This > > makes it a bit complicated because it makes much harder demands on the > > handler itself as it has to run under restricted environment. > > That's the point of the patch. We want to allow users to register their > own oom handler in a subtree (they may attach it to their own subtree root > and wait on memory.oom_control of a child memcg with a limit less than > that root) but not insist on an absolutely perfect implementation that can > never fail when you run on many, many servers. Userspace implementations > do fail sometimes, we just accept that. > > > I still do not see why you cannot simply read tasks file into a > > preallocated buffer. This would be few pages even for thousands of pids. > > You do not have to track processes as they come and go. > > What do you suggest when you read the "tasks" file and it returns -ENOMEM > because kmalloc() fails because the userspace oom handler's memcg is also > oom? Obviously it's not a situation we want to get into, but unless you > know that handler's exact memory usage across multiple versions, nothing > else is sharing that memcg, and it's a perfect implementation, you can't > guarantee it. We need to address real world problems that occur in > practice. > > > As I said before. oom_delay_millisecs is actually really easy to be done > > from userspace. If you really need a safety break then you can register > > such a handler as a fallback. I am not familiar with eventfd internals > > much but I guess that multiple handlers are possible. The fallback might > > be enforeced by the admin (when a new group is created) or by the > > container itself. Would something like this work for your use case? > > You're suggesting another userspace process that solely waits for a set > duration and then reenables the oom killer? It faces all the same > problems as the true userspace oom handler: it's own perfect > implementation and it's own memcg constraints. > > > > If that user is constrained to his or her own subtree, as previously > > > stated, there's also no way to login and rectify the situation at that > > > point and requires admin intervention or a reboot. > > > > Yes, insisting on the same subtree makes the life much harder for oom > > handlers. I totally agree with you on that. I just feel that introducing > > a new knob to workaround user "inability" to write a proper handler > > (what ever that means) is not justified. > > It's not necessarily harder if you assign the userspace oom handlers to > the root of your subtree with access to more memory than the children. > There is no "inability" to write a proper handler, but when you have > dozens of individual users implementing their own userspace handlers with > changing memcg limits over time, then you might find it hard to have > perfection every time. If we had perfection, we wouldn't have to worry > about oom in the first place. We can't just let these gazillion memcgs > sit spinning forever because they get stuck, either. That's why we've > used this solution for years as a failsafe. Disabling the oom killer > entirely, even for a memcg, is ridiculous, and if you don't have a grace > period then oom handlers themselves just don't work. It's only ridiculous if your OOM handler is subject to the OOM situation it's trying to handle. Don't act as if the oom disabling semantics were unreasonable or inconsistent with the rest of the system, memcgs were not really meant to be self-policed by the tasks running in them. That's why we have the waitqueue, so that everybody sits there and waits until an outside force resolves the situation. There is nothing wrong with that, you just have a new requirement. > > > Then why does "cat tasks" stall when my memcg is totally depleted of all > > > memory? > > > > if you run it like this then cat obviously needs some charged > > allocations. If you had a proper handler which mlocks its buffer for the > > read syscall then you shouldn't require any allocation at the oom time. > > This shouldn't be that hard to do without too much memory overhead. As I > > said we are talking about few (dozens) of pages per handler. > > > > I'm talking about the memory the kernel allocates when reading the "tasks" > file, not userspace. This can, and will, return -ENOMEM. Do you mean due to kmem limitations? What we could do is allow one task in the group to be the dedicated OOM handler. If we catch this task in the charge path during an OOM situation, we fall back to the kernel OOM handler. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote: > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: > > On Sat 01-06-13 02:11:51, Johannes Weiner wrote: > > [...] > > > From: Johannes Weiner > > > Subject: [PATCH] memcg: more robust oom handling > > > > > > The memcg OOM handling is incredibly fragile because once a memcg goes > > > OOM, one task (kernel or userspace) is responsible for resolving the > > > situation. Every other task that gets caught trying to charge memory > > > gets stuck in a waitqueue while potentially holding various filesystem > > > and mm locks on which the OOM handling task may now deadlock. > > > > > > Do two things to charge attempts under OOM: > > > > > > 1. Do not trap system calls (buffered IO and friends), just return > > >-ENOMEM. Userspace should be able to handle this... right? > > > > > > 2. Do not trap page faults directly in the charging context. Instead, > > >remember the OOMing memcg in the task struct and fully unwind the > > >page fault stack first. Then synchronize the memcg OOM from > > >pagefault_out_of_memory() > > > > I think this should work and I really like it! Nice work Johannes, I > > never dared to go that deep and my opposite approach was also much more > > fragile. > > > > I am just afraid about all the other archs that do not support (from > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, > > openrisc, score and tile). What would be an alternative for them? > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This > > would be acceptable for me. > > blackfin is NOMMU but I guess the others should be converted to the > proper OOM protocol anyway and not just kill the faulting task. I can > update them in the next version of the patch (series). It's no longer necessary since I remove the arch-specific flag setting, but I converted them anyway while I was at it. Will send them as a separate patch. > > > Not-quite-signed-off-by: Johannes Weiner > > > --- > > > arch/x86/mm/fault.c| 2 + > > > include/linux/memcontrol.h | 6 +++ > > > include/linux/sched.h | 6 +++ > > > mm/memcontrol.c| 104 > > > + > > > mm/oom_kill.c | 7 ++- > > > 5 files changed, 78 insertions(+), 47 deletions(-) > > > > > [...] > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index e692a02..cf60aef 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -1282,6 +1282,8 @@ struct task_struct { > > >* execve */ > > > unsigned in_iowait:1; > > > > > > + unsigned in_userfault:1; > > > + > > > > [This is more a nit pick but before I forget while I am reading through > > the rest of the patch.] > > > > OK there is a lot of room around those bit fields but as this is only > > for memcg and you are enlarging the structure by the pointer then you > > can reuse bottom bit of memcg pointer. > > I just didn't want to put anything in the arch code that looks too > memcgish, even though it's the only user right now. But granted, it > will also probably remain the only user for a while. I tried a couple of variants, including using the lowest memcg bit, but it all turned into more ugliness. So that .in_userfault is still there in v2, but it's now set in handle_mm_fault() in a generic manner depending on a fault flag, please reconsider if you can live with it. > > > @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup > > > *memcg) > > > } > > > > > > /* > > > - * try to call OOM killer. returns false if we should exit > > > memory-reclaim loop. > > > + * try to call OOM killer > > > */ > > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > > > - int order) > > > +static void mem_cgroup_oom(struct mem_cgroup *memcg, > > > +gfp_t mask, int order, > > > +bool in_userfault) > > > { > > > - struct oom_wait_info owait; > > > - bool locked, need_to_kill; > > > - > > > - owait.memcg = memcg; > > > - owait.wait.flags = 0; > > > - owait.wait.func = memcg_oom_wake_function; > > > - owait.wait.private = current; > > > - INIT_LIST_HEAD(_list); > > > - need_to_kill = true; > > > - mem_cgroup_mark_under_oom(memcg); > > > + bool locked, need_to_kill = true; > > > > > > /* At first, try to OOM lock hierarchy under memcg.*/ > > > spin_lock(_oom_lock); > > > locked = mem_cgroup_oom_lock(memcg); > > > - /* > > > - * Even if signal_pending(), we can't quit charge() loop without > > > - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL > > > - * under OOM is always welcomed, use TASK_KILLABLE here. > > > - */ > > > - prepare_to_wait(_oom_waitq, , TASK_KILLABLE); > > > - if (!locked || memcg->oom_kill_disable) > > > + if (!locked || memcg->oom_kill_disable) { > > > need_to_kill = false; > > > +
Re: [patch] mm, memcg: add oom killer delay
On Sat, 1 Jun 2013, Michal Hocko wrote: > > Users obviously don't have the ability to attach processes to the root > > memcg. They are constrained to their own subtree of memcgs. > > OK, I assume those groups are generally untrusted, right? So you cannot > let them register their oom handler even via an admin interface. This > makes it a bit complicated because it makes much harder demands on the > handler itself as it has to run under restricted environment. > That's the point of the patch. We want to allow users to register their own oom handler in a subtree (they may attach it to their own subtree root and wait on memory.oom_control of a child memcg with a limit less than that root) but not insist on an absolutely perfect implementation that can never fail when you run on many, many servers. Userspace implementations do fail sometimes, we just accept that. > I still do not see why you cannot simply read tasks file into a > preallocated buffer. This would be few pages even for thousands of pids. > You do not have to track processes as they come and go. > What do you suggest when you read the "tasks" file and it returns -ENOMEM because kmalloc() fails because the userspace oom handler's memcg is also oom? Obviously it's not a situation we want to get into, but unless you know that handler's exact memory usage across multiple versions, nothing else is sharing that memcg, and it's a perfect implementation, you can't guarantee it. We need to address real world problems that occur in practice. > As I said before. oom_delay_millisecs is actually really easy to be done > from userspace. If you really need a safety break then you can register > such a handler as a fallback. I am not familiar with eventfd internals > much but I guess that multiple handlers are possible. The fallback might > be enforeced by the admin (when a new group is created) or by the > container itself. Would something like this work for your use case? > You're suggesting another userspace process that solely waits for a set duration and then reenables the oom killer? It faces all the same problems as the true userspace oom handler: it's own perfect implementation and it's own memcg constraints. > > If that user is constrained to his or her own subtree, as previously > > stated, there's also no way to login and rectify the situation at that > > point and requires admin intervention or a reboot. > > Yes, insisting on the same subtree makes the life much harder for oom > handlers. I totally agree with you on that. I just feel that introducing > a new knob to workaround user "inability" to write a proper handler > (what ever that means) is not justified. > It's not necessarily harder if you assign the userspace oom handlers to the root of your subtree with access to more memory than the children. There is no "inability" to write a proper handler, but when you have dozens of individual users implementing their own userspace handlers with changing memcg limits over time, then you might find it hard to have perfection every time. If we had perfection, we wouldn't have to worry about oom in the first place. We can't just let these gazillion memcgs sit spinning forever because they get stuck, either. That's why we've used this solution for years as a failsafe. Disabling the oom killer entirely, even for a memcg, is ridiculous, and if you don't have a grace period then oom handlers themselves just don't work. > > Then why does "cat tasks" stall when my memcg is totally depleted of all > > memory? > > if you run it like this then cat obviously needs some charged > allocations. If you had a proper handler which mlocks its buffer for the > read syscall then you shouldn't require any allocation at the oom time. > This shouldn't be that hard to do without too much memory overhead. As I > said we are talking about few (dozens) of pages per handler. > I'm talking about the memory the kernel allocates when reading the "tasks" file, not userspace. This can, and will, return -ENOMEM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 12:48:39, Johannes Weiner wrote: > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: > > On Sat 01-06-13 02:11:51, Johannes Weiner wrote: > > [...] > > > From: Johannes Weiner > > > Subject: [PATCH] memcg: more robust oom handling > > > > > > The memcg OOM handling is incredibly fragile because once a memcg goes > > > OOM, one task (kernel or userspace) is responsible for resolving the > > > situation. Every other task that gets caught trying to charge memory > > > gets stuck in a waitqueue while potentially holding various filesystem > > > and mm locks on which the OOM handling task may now deadlock. > > > > > > Do two things to charge attempts under OOM: > > > > > > 1. Do not trap system calls (buffered IO and friends), just return > > >-ENOMEM. Userspace should be able to handle this... right? > > > > > > 2. Do not trap page faults directly in the charging context. Instead, > > >remember the OOMing memcg in the task struct and fully unwind the > > >page fault stack first. Then synchronize the memcg OOM from > > >pagefault_out_of_memory() > > > > I think this should work and I really like it! Nice work Johannes, I > > never dared to go that deep and my opposite approach was also much more > > fragile. > > > > I am just afraid about all the other archs that do not support (from > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, > > openrisc, score and tile). What would be an alternative for them? > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This > > would be acceptable for me. > > blackfin is NOMMU but I guess the others should be converted to the > proper OOM protocol anyway and not just kill the faulting task. I can > update them in the next version of the patch (series). OK, if you are willing to convert them all then even better. > > > Not-quite-signed-off-by: Johannes Weiner > > > --- > > > arch/x86/mm/fault.c| 2 + > > > include/linux/memcontrol.h | 6 +++ > > > include/linux/sched.h | 6 +++ > > > mm/memcontrol.c| 104 > > > + > > > mm/oom_kill.c | 7 ++- > > > 5 files changed, 78 insertions(+), 47 deletions(-) > > > > > [...] > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index e692a02..cf60aef 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -1282,6 +1282,8 @@ struct task_struct { > > >* execve */ > > > unsigned in_iowait:1; > > > > > > + unsigned in_userfault:1; > > > + > > > > [This is more a nit pick but before I forget while I am reading through > > the rest of the patch.] > > > > OK there is a lot of room around those bit fields but as this is only > > for memcg and you are enlarging the structure by the pointer then you > > can reuse bottom bit of memcg pointer. > > I just didn't want to put anything in the arch code that looks too > memcgish, even though it's the only user right now. But granted, it > will also probably remain the only user for a while. OK, no objection. I would just use a more generic name. Something like memcg_oom_can_block. [...`] > > > if (locked) > > > mem_cgroup_oom_notify(memcg); > > > spin_unlock(_oom_lock); > > > > > > if (need_to_kill) { > > > - finish_wait(_oom_waitq, ); > > > mem_cgroup_out_of_memory(memcg, mask, order); > > > - } else { > > > - schedule(); > > > - finish_wait(_oom_waitq, ); > > > + memcg_oom_recover(memcg); > > > > Why do we need to call memcg_oom_recover here? We do not know that any > > charges have been released. Say mem_cgroup_out_of_memory selected a task > > which migrated to our group (without its charges) so we would kill the > > poor guy and free no memory from this group. > > Now you wake up oom waiters to refault but they will end up in the same > > situation. I think it should be sufficient to wait for memcg_oom_recover > > until the memory is uncharged (which we do already). > > It's a leftover from how it was before (see the memcg_wakeup_oom > below), but you are right, we can get rid of it. right, I have missed that. Then it would deserve a note in the changelog. Something like " memcg_wakeup_oom should be removed because there is no guarantee that mem_cgroup_out_of_memory was able to free any memory. It could have killed a task which doesn't have any charges from the group. Waiters should be woken up by memcg_oom_recover during uncharge or a limit change. " [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 31 May 2013, Andrew Morton wrote: > > Admins may set the oom killer delay using the new interface: > > > > # echo 6 > memory.oom_delay_millisecs > > > > This will defer oom killing to the kernel only after 60 seconds has > > elapsed by putting the task to sleep for 60 seconds. > > How often is that delay actually useful, in the real world? > > IOW, in what proportion of cases does the system just remain stuck for > 60 seconds and then get an oom-killing? > It wouldn't be the system, it would just be the oom memcg that would be stuck. We actually use 10s by default, but it's adjustable for users in their own memcg hierarchies. It gives just enough time for userspace to deal with the situation and then defer to the kernel if it's unresponsive, this tends to happen quite regularly when you have many, many servers. Same situation if the userspace oom handler has died and isn't running, perhaps because of its own memory constraints (everything on our systems is memory constrained). Obviously it isn't going to reenable the oom killer before it dies from SIGSEGV. I'd argue that the current functionality that allows users to disable the oom killer for a memcg indefinitely is a bit ridiculous. It requires admin intervention to fix such a state and it would be pointless to have an oom memcg for a week, a month, a year, just completely deadlocked on making forward progress and consuming resources. memory.oom_delay_millisecs in my patch is limited to MAX_SCHEDULE_TIMEOUT just as a sanity check since we currently allow indefinite oom killer disabling. I think if we were to rethink disabling the oom killer entirely via memory.oom_control and realize such a condition over a prolonged period is insane then this memory.oom_delay_millisecs ceiling would be better defined as something in minutes. At the same time, we really like userspace oom notifications so users can implement their own handlers. So where's the compromise between instantly oom killing something and waiting forever for userspace to respond? My suggestion is memory.oom_delay_millisecs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 06:31:34PM +0200, Michal Hocko wrote: > On Sat 01-06-13 02:11:51, Johannes Weiner wrote: > > @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg) > > { > > /* for filtering, pass "memcg" as argument. */ > > __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); > > + atomic_inc(>oom_wakeups); > > } > > > > static void memcg_oom_recover(struct mem_cgroup *memcg) > [...] > > + prepare_to_wait(_oom_waitq, , TASK_KILLABLE); > > + /* Only sleep if we didn't miss any wakeups since OOM */ > > + if (atomic_read(>oom_wakeups) == current->memcg_oom.wakeups) > > + schedule(); > > On the way home it occured to me that the ordering might be wrong here. > The wake up can be lost here. > __wake_up(memcg_oom_waitq) > > prepare_to_wait > atomic_read(>oom_wakeups) > atomic_inc(oom_wakeups) > > I guess we want atomic_inc before __wake_up, right? I think you are right, thanks for spotting this. Will be fixed in version 2. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: > On Sat 01-06-13 02:11:51, Johannes Weiner wrote: > [...] > > From: Johannes Weiner > > Subject: [PATCH] memcg: more robust oom handling > > > > The memcg OOM handling is incredibly fragile because once a memcg goes > > OOM, one task (kernel or userspace) is responsible for resolving the > > situation. Every other task that gets caught trying to charge memory > > gets stuck in a waitqueue while potentially holding various filesystem > > and mm locks on which the OOM handling task may now deadlock. > > > > Do two things to charge attempts under OOM: > > > > 1. Do not trap system calls (buffered IO and friends), just return > >-ENOMEM. Userspace should be able to handle this... right? > > > > 2. Do not trap page faults directly in the charging context. Instead, > >remember the OOMing memcg in the task struct and fully unwind the > >page fault stack first. Then synchronize the memcg OOM from > >pagefault_out_of_memory() > > I think this should work and I really like it! Nice work Johannes, I > never dared to go that deep and my opposite approach was also much more > fragile. > > I am just afraid about all the other archs that do not support (from > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, > openrisc, score and tile). What would be an alternative for them? > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This > would be acceptable for me. blackfin is NOMMU but I guess the others should be converted to the proper OOM protocol anyway and not just kill the faulting task. I can update them in the next version of the patch (series). > > Not-quite-signed-off-by: Johannes Weiner > > --- > > arch/x86/mm/fault.c| 2 + > > include/linux/memcontrol.h | 6 +++ > > include/linux/sched.h | 6 +++ > > mm/memcontrol.c| 104 > > + > > mm/oom_kill.c | 7 ++- > > 5 files changed, 78 insertions(+), 47 deletions(-) > > > [...] > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index e692a02..cf60aef 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -1282,6 +1282,8 @@ struct task_struct { > > * execve */ > > unsigned in_iowait:1; > > > > + unsigned in_userfault:1; > > + > > [This is more a nit pick but before I forget while I am reading through > the rest of the patch.] > > OK there is a lot of room around those bit fields but as this is only > for memcg and you are enlarging the structure by the pointer then you > can reuse bottom bit of memcg pointer. I just didn't want to put anything in the arch code that looks too memcgish, even though it's the only user right now. But granted, it will also probably remain the only user for a while. > > @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup > > *memcg) > > } > > > > /* > > - * try to call OOM killer. returns false if we should exit memory-reclaim > > loop. > > + * try to call OOM killer > > */ > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > > - int order) > > +static void mem_cgroup_oom(struct mem_cgroup *memcg, > > + gfp_t mask, int order, > > + bool in_userfault) > > { > > - struct oom_wait_info owait; > > - bool locked, need_to_kill; > > - > > - owait.memcg = memcg; > > - owait.wait.flags = 0; > > - owait.wait.func = memcg_oom_wake_function; > > - owait.wait.private = current; > > - INIT_LIST_HEAD(_list); > > - need_to_kill = true; > > - mem_cgroup_mark_under_oom(memcg); > > + bool locked, need_to_kill = true; > > > > /* At first, try to OOM lock hierarchy under memcg.*/ > > spin_lock(_oom_lock); > > locked = mem_cgroup_oom_lock(memcg); > > - /* > > -* Even if signal_pending(), we can't quit charge() loop without > > -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL > > -* under OOM is always welcomed, use TASK_KILLABLE here. > > -*/ > > - prepare_to_wait(_oom_waitq, , TASK_KILLABLE); > > - if (!locked || memcg->oom_kill_disable) > > + if (!locked || memcg->oom_kill_disable) { > > need_to_kill = false; > > + if (in_userfault) { > > + /* > > +* We start sleeping on the OOM waitqueue only > > +* after unwinding the page fault stack, so > > +* make sure we detect wakeups that happen > > +* between now and then. > > +*/ > > + mem_cgroup_mark_under_oom(memcg); > > + current->memcg_oom.wakeups = > > + atomic_read(>oom_wakeups); > > + css_get(>css); > > + current->memcg_oom.memcg = memcg; > > + } > > + } > > if (locked) > >
Re: [patch] mm, memcg: add oom killer delay
On Sat 01-06-13 02:11:51, Johannes Weiner wrote: > @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg) > { > /* for filtering, pass "memcg" as argument. */ > __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); > + atomic_inc(>oom_wakeups); > } > > static void memcg_oom_recover(struct mem_cgroup *memcg) [...] > + prepare_to_wait(_oom_waitq, , TASK_KILLABLE); > + /* Only sleep if we didn't miss any wakeups since OOM */ > + if (atomic_read(>oom_wakeups) == current->memcg_oom.wakeups) > + schedule(); On the way home it occured to me that the ordering might be wrong here. The wake up can be lost here. __wake_up(memcg_oom_waitq) prepare_to_wait atomic_read(>oom_wakeups) atomic_inc(oom_wakeups) I guess we want atomic_inc before __wake_up, right? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] > From: Johannes Weiner > Subject: [PATCH] memcg: more robust oom handling > > The memcg OOM handling is incredibly fragile because once a memcg goes > OOM, one task (kernel or userspace) is responsible for resolving the > situation. Every other task that gets caught trying to charge memory > gets stuck in a waitqueue while potentially holding various filesystem > and mm locks on which the OOM handling task may now deadlock. > > Do two things to charge attempts under OOM: > > 1. Do not trap system calls (buffered IO and friends), just return >-ENOMEM. Userspace should be able to handle this... right? > > 2. Do not trap page faults directly in the charging context. Instead, >remember the OOMing memcg in the task struct and fully unwind the >page fault stack first. Then synchronize the memcg OOM from >pagefault_out_of_memory() I think this should work and I really like it! Nice work Johannes, I never dared to go that deep and my opposite approach was also much more fragile. I am just afraid about all the other archs that do not support (from quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, openrisc, score and tile). What would be an alternative for them? #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This would be acceptable for me. Few comments bellow. > Not-quite-signed-off-by: Johannes Weiner > --- > arch/x86/mm/fault.c| 2 + > include/linux/memcontrol.h | 6 +++ > include/linux/sched.h | 6 +++ > mm/memcontrol.c| 104 > + > mm/oom_kill.c | 7 ++- > 5 files changed, 78 insertions(+), 47 deletions(-) > [...] > diff --git a/include/linux/sched.h b/include/linux/sched.h > index e692a02..cf60aef 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1282,6 +1282,8 @@ struct task_struct { >* execve */ > unsigned in_iowait:1; > > + unsigned in_userfault:1; > + [This is more a nit pick but before I forget while I am reading through the rest of the patch.] OK there is a lot of room around those bit fields but as this is only for memcg and you are enlarging the structure by the pointer then you can reuse bottom bit of memcg pointer. > /* task may not gain privileges */ > unsigned no_new_privs:1; > [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cc3026a..6e13ebe 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup > *memcg) > } > > /* > - * try to call OOM killer. returns false if we should exit memory-reclaim > loop. > + * try to call OOM killer > */ > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > - int order) > +static void mem_cgroup_oom(struct mem_cgroup *memcg, > +gfp_t mask, int order, > +bool in_userfault) > { > - struct oom_wait_info owait; > - bool locked, need_to_kill; > - > - owait.memcg = memcg; > - owait.wait.flags = 0; > - owait.wait.func = memcg_oom_wake_function; > - owait.wait.private = current; > - INIT_LIST_HEAD(_list); > - need_to_kill = true; > - mem_cgroup_mark_under_oom(memcg); > + bool locked, need_to_kill = true; > > /* At first, try to OOM lock hierarchy under memcg.*/ > spin_lock(_oom_lock); > locked = mem_cgroup_oom_lock(memcg); > - /* > - * Even if signal_pending(), we can't quit charge() loop without > - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL > - * under OOM is always welcomed, use TASK_KILLABLE here. > - */ > - prepare_to_wait(_oom_waitq, , TASK_KILLABLE); > - if (!locked || memcg->oom_kill_disable) > + if (!locked || memcg->oom_kill_disable) { > need_to_kill = false; > + if (in_userfault) { > + /* > + * We start sleeping on the OOM waitqueue only > + * after unwinding the page fault stack, so > + * make sure we detect wakeups that happen > + * between now and then. > + */ > + mem_cgroup_mark_under_oom(memcg); > + current->memcg_oom.wakeups = > + atomic_read(>oom_wakeups); > + css_get(>css); > + current->memcg_oom.memcg = memcg; > + } > + } > if (locked) > mem_cgroup_oom_notify(memcg); > spin_unlock(_oom_lock); > > if (need_to_kill) { > - finish_wait(_oom_waitq, ); > mem_cgroup_out_of_memory(memcg, mask, order); > - } else { > - schedule(); > - finish_wait(_oom_waitq, ); > + memcg_oom_recover(memcg);
Re: [patch] mm, memcg: add oom killer delay
On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: more robust oom handling The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things to charge attempts under OOM: 1. Do not trap system calls (buffered IO and friends), just return -ENOMEM. Userspace should be able to handle this... right? 2. Do not trap page faults directly in the charging context. Instead, remember the OOMing memcg in the task struct and fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory() I think this should work and I really like it! Nice work Johannes, I never dared to go that deep and my opposite approach was also much more fragile. I am just afraid about all the other archs that do not support (from quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, openrisc, score and tile). What would be an alternative for them? #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This would be acceptable for me. Few comments bellow. Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org --- arch/x86/mm/fault.c| 2 + include/linux/memcontrol.h | 6 +++ include/linux/sched.h | 6 +++ mm/memcontrol.c| 104 + mm/oom_kill.c | 7 ++- 5 files changed, 78 insertions(+), 47 deletions(-) [...] diff --git a/include/linux/sched.h b/include/linux/sched.h index e692a02..cf60aef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1282,6 +1282,8 @@ struct task_struct { * execve */ unsigned in_iowait:1; + unsigned in_userfault:1; + [This is more a nit pick but before I forget while I am reading through the rest of the patch.] OK there is a lot of room around those bit fields but as this is only for memcg and you are enlarging the structure by the pointer then you can reuse bottom bit of memcg pointer. /* task may not gain privileges */ unsigned no_new_privs:1; [...] diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cc3026a..6e13ebe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c [...] @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, - int order) +static void mem_cgroup_oom(struct mem_cgroup *memcg, +gfp_t mask, int order, +bool in_userfault) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.memcg = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE); - if (!locked || memcg-oom_kill_disable) + if (!locked || memcg-oom_kill_disable) { need_to_kill = false; + if (in_userfault) { + /* + * We start sleeping on the OOM waitqueue only + * after unwinding the page fault stack, so + * make sure we detect wakeups that happen + * between now and then. + */ + mem_cgroup_mark_under_oom(memcg); + current-memcg_oom.wakeups = + atomic_read(memcg-oom_wakeups); + css_get(memcg-css); + current-memcg_oom.memcg = memcg; + } + } if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(memcg_oom_lock); if (need_to_kill) { - finish_wait(memcg_oom_waitq, owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); - } else { - schedule(); - finish_wait(memcg_oom_waitq, owait.wait); +
Re: [patch] mm, memcg: add oom killer delay
On Sat 01-06-13 02:11:51, Johannes Weiner wrote: @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg) { /* for filtering, pass memcg as argument. */ __wake_up(memcg_oom_waitq, TASK_NORMAL, 0, memcg); + atomic_inc(memcg-oom_wakeups); } static void memcg_oom_recover(struct mem_cgroup *memcg) [...] + prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(memcg-oom_wakeups) == current-memcg_oom.wakeups) + schedule(); On the way home it occured to me that the ordering might be wrong here. The wake up can be lost here. __wake_up(memcg_oom_waitq) preempted prepare_to_wait atomic_read(memcg-oom_wakeups) atomic_inc(oom_wakeups) I guess we want atomic_inc before __wake_up, right? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: more robust oom handling The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things to charge attempts under OOM: 1. Do not trap system calls (buffered IO and friends), just return -ENOMEM. Userspace should be able to handle this... right? 2. Do not trap page faults directly in the charging context. Instead, remember the OOMing memcg in the task struct and fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory() I think this should work and I really like it! Nice work Johannes, I never dared to go that deep and my opposite approach was also much more fragile. I am just afraid about all the other archs that do not support (from quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, openrisc, score and tile). What would be an alternative for them? #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This would be acceptable for me. blackfin is NOMMU but I guess the others should be converted to the proper OOM protocol anyway and not just kill the faulting task. I can update them in the next version of the patch (series). Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org --- arch/x86/mm/fault.c| 2 + include/linux/memcontrol.h | 6 +++ include/linux/sched.h | 6 +++ mm/memcontrol.c| 104 + mm/oom_kill.c | 7 ++- 5 files changed, 78 insertions(+), 47 deletions(-) [...] diff --git a/include/linux/sched.h b/include/linux/sched.h index e692a02..cf60aef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1282,6 +1282,8 @@ struct task_struct { * execve */ unsigned in_iowait:1; + unsigned in_userfault:1; + [This is more a nit pick but before I forget while I am reading through the rest of the patch.] OK there is a lot of room around those bit fields but as this is only for memcg and you are enlarging the structure by the pointer then you can reuse bottom bit of memcg pointer. I just didn't want to put anything in the arch code that looks too memcgish, even though it's the only user right now. But granted, it will also probably remain the only user for a while. @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, - int order) +static void mem_cgroup_oom(struct mem_cgroup *memcg, + gfp_t mask, int order, + bool in_userfault) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.memcg = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* -* Even if signal_pending(), we can't quit charge() loop without -* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL -* under OOM is always welcomed, use TASK_KILLABLE here. -*/ - prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE); - if (!locked || memcg-oom_kill_disable) + if (!locked || memcg-oom_kill_disable) { need_to_kill = false; + if (in_userfault) { + /* +* We start sleeping on the OOM waitqueue only +* after unwinding the page fault stack, so +* make sure we detect wakeups that happen +* between now and then. +*/ + mem_cgroup_mark_under_oom(memcg); + current-memcg_oom.wakeups = + atomic_read(memcg-oom_wakeups); + css_get(memcg-css); + current-memcg_oom.memcg = memcg; + } + } if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(memcg_oom_lock); if (need_to_kill) { -
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 06:31:34PM +0200, Michal Hocko wrote: On Sat 01-06-13 02:11:51, Johannes Weiner wrote: @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg) { /* for filtering, pass memcg as argument. */ __wake_up(memcg_oom_waitq, TASK_NORMAL, 0, memcg); + atomic_inc(memcg-oom_wakeups); } static void memcg_oom_recover(struct mem_cgroup *memcg) [...] + prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(memcg-oom_wakeups) == current-memcg_oom.wakeups) + schedule(); On the way home it occured to me that the ordering might be wrong here. The wake up can be lost here. __wake_up(memcg_oom_waitq) preempted prepare_to_wait atomic_read(memcg-oom_wakeups) atomic_inc(oom_wakeups) I guess we want atomic_inc before __wake_up, right? I think you are right, thanks for spotting this. Will be fixed in version 2. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 31 May 2013, Andrew Morton wrote: Admins may set the oom killer delay using the new interface: # echo 6 memory.oom_delay_millisecs This will defer oom killing to the kernel only after 60 seconds has elapsed by putting the task to sleep for 60 seconds. How often is that delay actually useful, in the real world? IOW, in what proportion of cases does the system just remain stuck for 60 seconds and then get an oom-killing? It wouldn't be the system, it would just be the oom memcg that would be stuck. We actually use 10s by default, but it's adjustable for users in their own memcg hierarchies. It gives just enough time for userspace to deal with the situation and then defer to the kernel if it's unresponsive, this tends to happen quite regularly when you have many, many servers. Same situation if the userspace oom handler has died and isn't running, perhaps because of its own memory constraints (everything on our systems is memory constrained). Obviously it isn't going to reenable the oom killer before it dies from SIGSEGV. I'd argue that the current functionality that allows users to disable the oom killer for a memcg indefinitely is a bit ridiculous. It requires admin intervention to fix such a state and it would be pointless to have an oom memcg for a week, a month, a year, just completely deadlocked on making forward progress and consuming resources. memory.oom_delay_millisecs in my patch is limited to MAX_SCHEDULE_TIMEOUT just as a sanity check since we currently allow indefinite oom killer disabling. I think if we were to rethink disabling the oom killer entirely via memory.oom_control and realize such a condition over a prolonged period is insane then this memory.oom_delay_millisecs ceiling would be better defined as something in minutes. At the same time, we really like userspace oom notifications so users can implement their own handlers. So where's the compromise between instantly oom killing something and waiting forever for userspace to respond? My suggestion is memory.oom_delay_millisecs. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 12:48:39, Johannes Weiner wrote: On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: more robust oom handling The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things to charge attempts under OOM: 1. Do not trap system calls (buffered IO and friends), just return -ENOMEM. Userspace should be able to handle this... right? 2. Do not trap page faults directly in the charging context. Instead, remember the OOMing memcg in the task struct and fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory() I think this should work and I really like it! Nice work Johannes, I never dared to go that deep and my opposite approach was also much more fragile. I am just afraid about all the other archs that do not support (from quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, openrisc, score and tile). What would be an alternative for them? #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This would be acceptable for me. blackfin is NOMMU but I guess the others should be converted to the proper OOM protocol anyway and not just kill the faulting task. I can update them in the next version of the patch (series). OK, if you are willing to convert them all then even better. Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org --- arch/x86/mm/fault.c| 2 + include/linux/memcontrol.h | 6 +++ include/linux/sched.h | 6 +++ mm/memcontrol.c| 104 + mm/oom_kill.c | 7 ++- 5 files changed, 78 insertions(+), 47 deletions(-) [...] diff --git a/include/linux/sched.h b/include/linux/sched.h index e692a02..cf60aef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1282,6 +1282,8 @@ struct task_struct { * execve */ unsigned in_iowait:1; + unsigned in_userfault:1; + [This is more a nit pick but before I forget while I am reading through the rest of the patch.] OK there is a lot of room around those bit fields but as this is only for memcg and you are enlarging the structure by the pointer then you can reuse bottom bit of memcg pointer. I just didn't want to put anything in the arch code that looks too memcgish, even though it's the only user right now. But granted, it will also probably remain the only user for a while. OK, no objection. I would just use a more generic name. Something like memcg_oom_can_block. [...`] if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(memcg_oom_lock); if (need_to_kill) { - finish_wait(memcg_oom_waitq, owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); - } else { - schedule(); - finish_wait(memcg_oom_waitq, owait.wait); + memcg_oom_recover(memcg); Why do we need to call memcg_oom_recover here? We do not know that any charges have been released. Say mem_cgroup_out_of_memory selected a task which migrated to our group (without its charges) so we would kill the poor guy and free no memory from this group. Now you wake up oom waiters to refault but they will end up in the same situation. I think it should be sufficient to wait for memcg_oom_recover until the memory is uncharged (which we do already). It's a leftover from how it was before (see the memcg_wakeup_oom below), but you are right, we can get rid of it. right, I have missed that. Then it would deserve a note in the changelog. Something like memcg_wakeup_oom should be removed because there is no guarantee that mem_cgroup_out_of_memory was able to free any memory. It could have killed a task which doesn't have any charges from the group. Waiters should be woken up by memcg_oom_recover during uncharge or a limit change. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Sat, 1 Jun 2013, Michal Hocko wrote: Users obviously don't have the ability to attach processes to the root memcg. They are constrained to their own subtree of memcgs. OK, I assume those groups are generally untrusted, right? So you cannot let them register their oom handler even via an admin interface. This makes it a bit complicated because it makes much harder demands on the handler itself as it has to run under restricted environment. That's the point of the patch. We want to allow users to register their own oom handler in a subtree (they may attach it to their own subtree root and wait on memory.oom_control of a child memcg with a limit less than that root) but not insist on an absolutely perfect implementation that can never fail when you run on many, many servers. Userspace implementations do fail sometimes, we just accept that. I still do not see why you cannot simply read tasks file into a preallocated buffer. This would be few pages even for thousands of pids. You do not have to track processes as they come and go. What do you suggest when you read the tasks file and it returns -ENOMEM because kmalloc() fails because the userspace oom handler's memcg is also oom? Obviously it's not a situation we want to get into, but unless you know that handler's exact memory usage across multiple versions, nothing else is sharing that memcg, and it's a perfect implementation, you can't guarantee it. We need to address real world problems that occur in practice. As I said before. oom_delay_millisecs is actually really easy to be done from userspace. If you really need a safety break then you can register such a handler as a fallback. I am not familiar with eventfd internals much but I guess that multiple handlers are possible. The fallback might be enforeced by the admin (when a new group is created) or by the container itself. Would something like this work for your use case? You're suggesting another userspace process that solely waits for a set duration and then reenables the oom killer? It faces all the same problems as the true userspace oom handler: it's own perfect implementation and it's own memcg constraints. If that user is constrained to his or her own subtree, as previously stated, there's also no way to login and rectify the situation at that point and requires admin intervention or a reboot. Yes, insisting on the same subtree makes the life much harder for oom handlers. I totally agree with you on that. I just feel that introducing a new knob to workaround user inability to write a proper handler (what ever that means) is not justified. It's not necessarily harder if you assign the userspace oom handlers to the root of your subtree with access to more memory than the children. There is no inability to write a proper handler, but when you have dozens of individual users implementing their own userspace handlers with changing memcg limits over time, then you might find it hard to have perfection every time. If we had perfection, we wouldn't have to worry about oom in the first place. We can't just let these gazillion memcgs sit spinning forever because they get stuck, either. That's why we've used this solution for years as a failsafe. Disabling the oom killer entirely, even for a memcg, is ridiculous, and if you don't have a grace period then oom handlers themselves just don't work. Then why does cat tasks stall when my memcg is totally depleted of all memory? if you run it like this then cat obviously needs some charged allocations. If you had a proper handler which mlocks its buffer for the read syscall then you shouldn't require any allocation at the oom time. This shouldn't be that hard to do without too much memory overhead. As I said we are talking about few (dozens) of pages per handler. I'm talking about the memory the kernel allocates when reading the tasks file, not userspace. This can, and will, return -ENOMEM. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote: On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote: On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: more robust oom handling The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things to charge attempts under OOM: 1. Do not trap system calls (buffered IO and friends), just return -ENOMEM. Userspace should be able to handle this... right? 2. Do not trap page faults directly in the charging context. Instead, remember the OOMing memcg in the task struct and fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory() I think this should work and I really like it! Nice work Johannes, I never dared to go that deep and my opposite approach was also much more fragile. I am just afraid about all the other archs that do not support (from quick grep it looks like: blackfin, c6x, h8300, metag, mn10300, openrisc, score and tile). What would be an alternative for them? #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This would be acceptable for me. blackfin is NOMMU but I guess the others should be converted to the proper OOM protocol anyway and not just kill the faulting task. I can update them in the next version of the patch (series). It's no longer necessary since I remove the arch-specific flag setting, but I converted them anyway while I was at it. Will send them as a separate patch. Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org --- arch/x86/mm/fault.c| 2 + include/linux/memcontrol.h | 6 +++ include/linux/sched.h | 6 +++ mm/memcontrol.c| 104 + mm/oom_kill.c | 7 ++- 5 files changed, 78 insertions(+), 47 deletions(-) [...] diff --git a/include/linux/sched.h b/include/linux/sched.h index e692a02..cf60aef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1282,6 +1282,8 @@ struct task_struct { * execve */ unsigned in_iowait:1; + unsigned in_userfault:1; + [This is more a nit pick but before I forget while I am reading through the rest of the patch.] OK there is a lot of room around those bit fields but as this is only for memcg and you are enlarging the structure by the pointer then you can reuse bottom bit of memcg pointer. I just didn't want to put anything in the arch code that looks too memcgish, even though it's the only user right now. But granted, it will also probably remain the only user for a while. I tried a couple of variants, including using the lowest memcg bit, but it all turned into more ugliness. So that .in_userfault is still there in v2, but it's now set in handle_mm_fault() in a generic manner depending on a fault flag, please reconsider if you can live with it. @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, - int order) +static void mem_cgroup_oom(struct mem_cgroup *memcg, +gfp_t mask, int order, +bool in_userfault) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.memcg = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(memcg_oom_waitq, owait.wait, TASK_KILLABLE); - if (!locked || memcg-oom_kill_disable) + if (!locked || memcg-oom_kill_disable) { need_to_kill = false; + if (in_userfault) { + /* + * We start sleeping on the OOM waitqueue only + * after unwinding the page fault stack, so + * make
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 11:18:09AM -0700, David Rientjes wrote: On Sat, 1 Jun 2013, Michal Hocko wrote: OK, I assume those groups are generally untrusted, right? So you cannot let them register their oom handler even via an admin interface. This makes it a bit complicated because it makes much harder demands on the handler itself as it has to run under restricted environment. That's the point of the patch. We want to allow users to register their own oom handler in a subtree (they may attach it to their own subtree root and wait on memory.oom_control of a child memcg with a limit less than that root) but not insist on an absolutely perfect implementation that can never fail when you run on many, many servers. Userspace implementations do fail sometimes, we just accept that. I still do not see why you cannot simply read tasks file into a preallocated buffer. This would be few pages even for thousands of pids. You do not have to track processes as they come and go. What do you suggest when you read the tasks file and it returns -ENOMEM because kmalloc() fails because the userspace oom handler's memcg is also oom? Obviously it's not a situation we want to get into, but unless you know that handler's exact memory usage across multiple versions, nothing else is sharing that memcg, and it's a perfect implementation, you can't guarantee it. We need to address real world problems that occur in practice. As I said before. oom_delay_millisecs is actually really easy to be done from userspace. If you really need a safety break then you can register such a handler as a fallback. I am not familiar with eventfd internals much but I guess that multiple handlers are possible. The fallback might be enforeced by the admin (when a new group is created) or by the container itself. Would something like this work for your use case? You're suggesting another userspace process that solely waits for a set duration and then reenables the oom killer? It faces all the same problems as the true userspace oom handler: it's own perfect implementation and it's own memcg constraints. If that user is constrained to his or her own subtree, as previously stated, there's also no way to login and rectify the situation at that point and requires admin intervention or a reboot. Yes, insisting on the same subtree makes the life much harder for oom handlers. I totally agree with you on that. I just feel that introducing a new knob to workaround user inability to write a proper handler (what ever that means) is not justified. It's not necessarily harder if you assign the userspace oom handlers to the root of your subtree with access to more memory than the children. There is no inability to write a proper handler, but when you have dozens of individual users implementing their own userspace handlers with changing memcg limits over time, then you might find it hard to have perfection every time. If we had perfection, we wouldn't have to worry about oom in the first place. We can't just let these gazillion memcgs sit spinning forever because they get stuck, either. That's why we've used this solution for years as a failsafe. Disabling the oom killer entirely, even for a memcg, is ridiculous, and if you don't have a grace period then oom handlers themselves just don't work. It's only ridiculous if your OOM handler is subject to the OOM situation it's trying to handle. Don't act as if the oom disabling semantics were unreasonable or inconsistent with the rest of the system, memcgs were not really meant to be self-policed by the tasks running in them. That's why we have the waitqueue, so that everybody sits there and waits until an outside force resolves the situation. There is nothing wrong with that, you just have a new requirement. Then why does cat tasks stall when my memcg is totally depleted of all memory? if you run it like this then cat obviously needs some charged allocations. If you had a proper handler which mlocks its buffer for the read syscall then you shouldn't require any allocation at the oom time. This shouldn't be that hard to do without too much memory overhead. As I said we are talking about few (dozens) of pages per handler. I'm talking about the memory the kernel allocates when reading the tasks file, not userspace. This can, and will, return -ENOMEM. Do you mean due to kmem limitations? What we could do is allow one task in the group to be the dedicated OOM handler. If we catch this task in the charge path during an OOM situation, we fall back to the kernel OOM handler. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, 3 Jun 2013, Johannes Weiner wrote: It's not necessarily harder if you assign the userspace oom handlers to the root of your subtree with access to more memory than the children. There is no inability to write a proper handler, but when you have dozens of individual users implementing their own userspace handlers with changing memcg limits over time, then you might find it hard to have perfection every time. If we had perfection, we wouldn't have to worry about oom in the first place. We can't just let these gazillion memcgs sit spinning forever because they get stuck, either. That's why we've used this solution for years as a failsafe. Disabling the oom killer entirely, even for a memcg, is ridiculous, and if you don't have a grace period then oom handlers themselves just don't work. It's only ridiculous if your OOM handler is subject to the OOM situation it's trying to handle. You're suggesting the oom handler can't be subject to its own memcg limits, independent of the memcg it is handling? If we demand that such a handler be attached to the root memcg, that breaks the memory isolation that memcg provides. We constrain processes to a memcg for the purposes of that isolation, so it cannot use more resources than allotted. Don't act as if the oom disabling semantics were unreasonable or inconsistent with the rest of the system, memcgs were not really meant to be self-policed by the tasks running in them. That's why we have the waitqueue, so that everybody sits there and waits until an outside force resolves the situation. There is nothing wrong with that, you just have a new requirement. The waitqueue doesn't solve anything with regard to the memory, if the memcg sits there and deadlocks forever then it is using resources (memory, not cpu) that will never be freed. I'm talking about the memory the kernel allocates when reading the tasks file, not userspace. This can, and will, return -ENOMEM. Do you mean due to kmem limitations? Yes. What we could do is allow one task in the group to be the dedicated OOM handler. If we catch this task in the charge path during an OOM situation, we fall back to the kernel OOM handler. I'm not sure it even makes sense to have more than one oom handler per memcg and the synchronization that requires in userspace to get the right result, so I didn't consider dedicating a single oom handler. That would be an entirely new interface, though, since we may have multiple processes waiting on memory.oom_control that aren't necessarily handlers; they grab a snapshot of memory, do logging, etc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon 03-06-13 11:18:09, David Rientjes wrote: On Sat, 1 Jun 2013, Michal Hocko wrote: [...] I still do not see why you cannot simply read tasks file into a preallocated buffer. This would be few pages even for thousands of pids. You do not have to track processes as they come and go. What do you suggest when you read the tasks file and it returns -ENOMEM because kmalloc() fails because the userspace oom handler's memcg is also oom? That would require that you track kernel allocations which is currently done only for explicit caches. Obviously it's not a situation we want to get into, but unless you know that handler's exact memory usage across multiple versions, nothing else is sharing that memcg, and it's a perfect implementation, you can't guarantee it. We need to address real world problems that occur in practice. If you really need to have such a guarantee then you can have a _global_ watchdog observing oom_control of all groups that provide such a vague requirements for oom user handlers. As I said before. oom_delay_millisecs is actually really easy to be done from userspace. If you really need a safety break then you can register such a handler as a fallback. I am not familiar with eventfd internals much but I guess that multiple handlers are possible. The fallback might be enforeced by the admin (when a new group is created) or by the container itself. Would something like this work for your use case? You're suggesting another userspace process that solely waits for a set duration and then reenables the oom killer? Yes which kicks the oom killer. It faces all the same problems as the true userspace oom handler: it's own perfect implementation and it's own memcg constraints. But that solution might be implemented as a global policy living in a group with some reservations. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, 3 Jun 2013, Michal Hocko wrote: What do you suggest when you read the tasks file and it returns -ENOMEM because kmalloc() fails because the userspace oom handler's memcg is also oom? That would require that you track kernel allocations which is currently done only for explicit caches. That will not always be the case, and I think this could be a prerequisite patch for such support that we have internally. I'm not sure a userspace oom notifier would want to keep a preallocated buffer around that is mlocked in memory for all possible lengths of this file. Obviously it's not a situation we want to get into, but unless you know that handler's exact memory usage across multiple versions, nothing else is sharing that memcg, and it's a perfect implementation, you can't guarantee it. We need to address real world problems that occur in practice. If you really need to have such a guarantee then you can have a _global_ watchdog observing oom_control of all groups that provide such a vague requirements for oom user handlers. The whole point is to allow the user to implement their own oom policy. If the policy was completely encapsulated in kernel code, we don't need to ever disable the oom killer even with memory.oom_control. Users may choose to kill the largest process, the newest process, the oldest process, sacrifice children instead of parents, prevent forkbombs, implement their own priority scoring (which is what we do), kill the allocating task, etc. To not merge this patch, I'd ask that you show an alternative that allows users to implement their own userspace oom handlers and not require admin intervention when things go wrong. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but do not trap other tasks and just return -ENOMEM for everyone. Userspace should be able to handle this... right? 2. When OOMing in a page fault, invoke the OOM killer but do not trap other chargers directly in the charging code. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory(). While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Signed-off-by: Johannes Weiner han...@cmpxchg.org From point of the memcg oom notification view, it is NOT supported on the case that an oom handler process is subjected its own memcg limit. All memcg developers clearly agreed it since it began. Even though, anyway, people have a right to shoot their own foot. :) However, this patch fixes more than that. OK, I prefer it. Good fix! Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Mon, Jun 03, 2013 at 12:09:22PM -0700, David Rientjes wrote: On Mon, 3 Jun 2013, Johannes Weiner wrote: It's not necessarily harder if you assign the userspace oom handlers to the root of your subtree with access to more memory than the children. There is no inability to write a proper handler, but when you have dozens of individual users implementing their own userspace handlers with changing memcg limits over time, then you might find it hard to have perfection every time. If we had perfection, we wouldn't have to worry about oom in the first place. We can't just let these gazillion memcgs sit spinning forever because they get stuck, either. That's why we've used this solution for years as a failsafe. Disabling the oom killer entirely, even for a memcg, is ridiculous, and if you don't have a grace period then oom handlers themselves just don't work. It's only ridiculous if your OOM handler is subject to the OOM situation it's trying to handle. You're suggesting the oom handler can't be subject to its own memcg limits, independent of the memcg it is handling? If we demand that such a handler be attached to the root memcg, that breaks the memory isolation that memcg provides. We constrain processes to a memcg for the purposes of that isolation, so it cannot use more resources than allotted. I guess the traditional use case is that you have a job manager that you trust, that sets up the groups, configures their limits etc. and would also know more about the jobs than the kernel to act as the OOM killer as well. Since it's trusted and not expected to consume any significant amounts of memory by itself, the memcg code assumes that it does not run in a cgroup itself, it's just not thought of as being part of the application class. I'm not saying it should necessarily stay that way, but it's also not a completely ridiculous model. What we could do is allow one task in the group to be the dedicated OOM handler. If we catch this task in the charge path during an OOM situation, we fall back to the kernel OOM handler. I'm not sure it even makes sense to have more than one oom handler per memcg and the synchronization that requires in userspace to get the right result, so I didn't consider dedicating a single oom handler. That would be an entirely new interface, though, since we may have multiple processes waiting on memory.oom_control that aren't necessarily handlers; they grab a snapshot of memory, do logging, etc. It will probably be hard to extend the existing oom_control interface. But we could add a separate one where you put in a pid or 0 (kernel) instead of a boolean value, which then enables or disables the userspace OOM handling task for the whole subtree. If that task enters the OOM path, the kernel handler is invoked. If the task dies, the kernel handler is permanently re-enabled. Everybody is free to poll that interface for OOM notifications, not only that one task. Combined with the enter waitqueue after unwinding page fault stack patch, would this fully cover your usecase? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Sat, Jun 01, 2013 at 12:29:05PM +0200, Michal Hocko wrote: > On Sat 01-06-13 02:11:51, Johannes Weiner wrote: > [...] > > I'm currently messing around with the below patch. When a task faults > > and charges under OOM, the memcg is remembered in the task struct and > > then made to sleep on the memcg's OOM waitqueue only after unwinding > > the page fault stack. With the kernel OOM killer disabled, all tasks > > in the OOMing group sit nicely in > > > > mem_cgroup_oom_synchronize > > pagefault_out_of_memory > > mm_fault_error > > __do_page_fault > > page_fault > > 0x > > > > regardless of whether they were faulting anon or file. They do not > > even hold the mmap_sem anymore at this point. > > > > [ I kept syscalls really simple for now and just have them return > > -ENOMEM, never trap them at all (just like the global OOM case). > > It would be more work to have them wait from a flatter stack too, > > but it should be doable if necessary. ] > > > > I suggested this at the MM summit and people were essentially asking > > if I was feeling well, so maybe I'm still missing a gaping hole in > > this idea. > > I didn't get to look at the patch (will do on Monday) but it doesn't > sounds entirely crazy. Well, we would have to drop mmap_sem so things > have to be rechecked but we are doing that already with VM_FAULT_RETRY > in some archs so it should work. The global OOM case has been doing this for a long time (1c0fe6e mm: invoke oom-killer from page fault), way before VM_FAULT_RETRY. The fault is aborted with VM_FAULT_OOM and the oom killer is invoked. Either the task gets killed or it'll just retrigger the fault. The only difference is that a memcg OOM kill may take longer because of userspace handling, so memcg needs a waitqueue where the global case simply does a trylock (try_set_zonelist_oom) and restarts the fault immediately if somebody else is handling the situation. In fact, when Nick added the page fault OOM invocation, KAME merged something similar to my patch, which tried to catch memcg OOM kills from pagefault_out_of_memory() (a636b32 memcg: avoid unnecessary system-wide-oom-killer). Only when he reworked the whole memcg OOM synchronization, added the ability to disable OOM and the waitqueues etc, the OOMs were trapped right there in the charge context (867578c memcg: fix oom kill behavior). But I see no reason why we shouldn't be able to keep the waitqueues and still go back to synchronizing from the bottom of the page fault stack. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] > I'm currently messing around with the below patch. When a task faults > and charges under OOM, the memcg is remembered in the task struct and > then made to sleep on the memcg's OOM waitqueue only after unwinding > the page fault stack. With the kernel OOM killer disabled, all tasks > in the OOMing group sit nicely in > > mem_cgroup_oom_synchronize > pagefault_out_of_memory > mm_fault_error > __do_page_fault > page_fault > 0x > > regardless of whether they were faulting anon or file. They do not > even hold the mmap_sem anymore at this point. > > [ I kept syscalls really simple for now and just have them return > -ENOMEM, never trap them at all (just like the global OOM case). > It would be more work to have them wait from a flatter stack too, > but it should be doable if necessary. ] > > I suggested this at the MM summit and people were essentially asking > if I was feeling well, so maybe I'm still missing a gaping hole in > this idea. I didn't get to look at the patch (will do on Monday) but it doesn't sounds entirely crazy. Well, we would have to drop mmap_sem so things have to be rechecked but we are doing that already with VM_FAULT_RETRY in some archs so it should work. > Patch only works on x86 as of now, on other architectures memcg OOM > will invoke the global OOM killer. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri 31-05-13 12:29:17, David Rientjes wrote: > On Fri, 31 May 2013, Michal Hocko wrote: > > > > We allow users to control their own memcgs by chowning them, so they must > > > be run in the same hierarchy if they want to run their own userspace oom > > > handler. There's nothing in the kernel that prevents that and the user > > > has no other option but to run in a parent cgroup. > > > > If the access to the oom_control file is controlled by the file > > permissions then the oom handler can live inside root cgroup. Why do you > > need "must be in the same hierarchy" requirement? > > > > Users obviously don't have the ability to attach processes to the root > memcg. They are constrained to their own subtree of memcgs. OK, I assume those groups are generally untrusted, right? So you cannot let them register their oom handler even via an admin interface. This makes it a bit complicated because it makes much harder demands on the handler itself as it has to run under restricted environment. > > > It's too easy to simply do even a "ps ax" in an oom memcg and make that > > > thread completely unresponsive because it allocates memory. > > > > Yes, but we are talking about oom handler and that one has to be really > > careful about what it does. So doing something that simply allocates is > > dangerous. > > > > Show me a userspace oom handler that doesn't get notified of every fork() > in a memcg, causing a performance degradation of its own for a complete > and utter slowpath, that will know the entire process tree of its own > memcg or a child memcg. I still do not see why you cannot simply read tasks file into a preallocated buffer. This would be few pages even for thousands of pids. You do not have to track processes as they come and go. > This discussion is all fine and good from a theoretical point of view > until you actually have to implement one of these yourself. Multiple > users are going to be running their own oom notifiers and without some > sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too > easily deadlock looking for memory. As I said before. oom_delay_millisecs is actually really easy to be done from userspace. If you really need a safety break then you can register such a handler as a fallback. I am not familiar with eventfd internals much but I guess that multiple handlers are possible. The fallback might be enforeced by the admin (when a new group is created) or by the container itself. Would something like this work for your use case? > If that user is constrained to his or her own subtree, as previously > stated, there's also no way to login and rectify the situation at that > point and requires admin intervention or a reboot. Yes, insisting on the same subtree makes the life much harder for oom handlers. I totally agree with you on that. I just feel that introducing a new knob to workaround user "inability" to write a proper handler (what ever that means) is not justified. > > > Then perhaps I'm raising constraints that you've never worked with, I > > > don't know. We choose to have a priority-based approach that is > > > inherited > > > by children; this priority is kept in userspace and and the oom handler > > > would naturally need to know the set of tasks in the oom memcg at the > > > time > > > of oom and their parent-child relationship. These priorities are > > > completely independent of memory usage. > > > > OK, but both reading tasks file and readdir should be doable without > > userspace (aka charged) allocations. Moreover if you run those oom > > handlers under the root cgroup then it should be even easier. > > Then why does "cat tasks" stall when my memcg is totally depleted of all > memory? if you run it like this then cat obviously needs some charged allocations. If you had a proper handler which mlocks its buffer for the read syscall then you shouldn't require any allocation at the oom time. This shouldn't be that hard to do without too much memory overhead. As I said we are talking about few (dozens) of pages per handler. > This isn't even the argument because memory.oom_delay_millisecs isn't > going to help that situation. I'm talking about a failsafe that ensures a > memcg can't deadlock. The global oom killer will always have to exist in > the kernel, at least in the most simplistic of forms, solely for this > reason; you can't move all of the logic to userspace and expect it to > react 100% of the time. The global case is even more complicated because every allocation matters - not just those that are charged as for memcg case. Btw. at last LSF there was a discussion about enabling oom_control for the root cgroup but this is off-topic here and should be discussed separately when somebody actually tries to implement this. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info
Re: [patch] mm, memcg: add oom killer delay
On Fri, May 31, 2013 at 12:29:17PM -0700, David Rientjes wrote: > On Fri, 31 May 2013, Michal Hocko wrote: > > > It's too easy to simply do even a "ps ax" in an oom memcg and make that > > > thread completely unresponsive because it allocates memory. > > > > Yes, but we are talking about oom handler and that one has to be really > > careful about what it does. So doing something that simply allocates is > > dangerous. > > > > Show me a userspace oom handler that doesn't get notified of every fork() > in a memcg, causing a performance degradation of its own for a complete > and utter slowpath, that will know the entire process tree of its own > memcg or a child memcg. > > This discussion is all fine and good from a theoretical point of view > until you actually have to implement one of these yourself. Multiple > users are going to be running their own oom notifiers and without some > sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too > easily deadlock looking for memory. If that user is constrained to his or > her own subtree, as previously stated, there's also no way to login and > rectify the situation at that point and requires admin intervention or a > reboot. Memcg OOM killing is fragile, doing it in userspace does not help. But a userspace OOM handler that is subject to the OOM situation it is supposed to resolve? That's kind of... cruel. However, I agree with your other point: as long as chargers can get stuck during OOM while sitting on an unpredictable bunch of locks, we can not claim that there is a right way to write a reliable userspace OOM handler. Userspace has no way of knowing which operations are safe, and it might even vary between kernel versions. But I would prefer a better solution than the timeout. Michal in the past already tried to disable the OOM killer for page cache charges in general due to deadlock scenarios. Could we invoke the killer just the same but make sure we never trap tasks in the charge context? I'm currently messing around with the below patch. When a task faults and charges under OOM, the memcg is remembered in the task struct and then made to sleep on the memcg's OOM waitqueue only after unwinding the page fault stack. With the kernel OOM killer disabled, all tasks in the OOMing group sit nicely in mem_cgroup_oom_synchronize pagefault_out_of_memory mm_fault_error __do_page_fault page_fault 0x regardless of whether they were faulting anon or file. They do not even hold the mmap_sem anymore at this point. [ I kept syscalls really simple for now and just have them return -ENOMEM, never trap them at all (just like the global OOM case). It would be more work to have them wait from a flatter stack too, but it should be doable if necessary. ] I suggested this at the MM summit and people were essentially asking if I was feeling well, so maybe I'm still missing a gaping hole in this idea. Patch only works on x86 as of now, on other architectures memcg OOM will invoke the global OOM killer. --- From: Johannes Weiner Subject: [PATCH] memcg: more robust oom handling The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things to charge attempts under OOM: 1. Do not trap system calls (buffered IO and friends), just return -ENOMEM. Userspace should be able to handle this... right? 2. Do not trap page faults directly in the charging context. Instead, remember the OOMing memcg in the task struct and fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory() Not-quite-signed-off-by: Johannes Weiner --- arch/x86/mm/fault.c| 2 + include/linux/memcontrol.h | 6 +++ include/linux/sched.h | 6 +++ mm/memcontrol.c| 104 + mm/oom_kill.c | 7 ++- 5 files changed, 78 insertions(+), 47 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 0e88336..df043a0 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1185,7 +1185,9 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault: */ + current->in_userfault = 1; fault = handle_mm_fault(mm, vma, address, flags); + current->in_userfault = 0; if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { if (mm_fault_error(regs, error_code, address, fault)) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d6183f0..e1b84ea 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -121,6 +121,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec
Re: [patch] mm, memcg: add oom killer delay
On Fri, May 31, 2013 at 12:29:17PM -0700, David Rientjes wrote: On Fri, 31 May 2013, Michal Hocko wrote: It's too easy to simply do even a ps ax in an oom memcg and make that thread completely unresponsive because it allocates memory. Yes, but we are talking about oom handler and that one has to be really careful about what it does. So doing something that simply allocates is dangerous. Show me a userspace oom handler that doesn't get notified of every fork() in a memcg, causing a performance degradation of its own for a complete and utter slowpath, that will know the entire process tree of its own memcg or a child memcg. This discussion is all fine and good from a theoretical point of view until you actually have to implement one of these yourself. Multiple users are going to be running their own oom notifiers and without some sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too easily deadlock looking for memory. If that user is constrained to his or her own subtree, as previously stated, there's also no way to login and rectify the situation at that point and requires admin intervention or a reboot. Memcg OOM killing is fragile, doing it in userspace does not help. But a userspace OOM handler that is subject to the OOM situation it is supposed to resolve? That's kind of... cruel. However, I agree with your other point: as long as chargers can get stuck during OOM while sitting on an unpredictable bunch of locks, we can not claim that there is a right way to write a reliable userspace OOM handler. Userspace has no way of knowing which operations are safe, and it might even vary between kernel versions. But I would prefer a better solution than the timeout. Michal in the past already tried to disable the OOM killer for page cache charges in general due to deadlock scenarios. Could we invoke the killer just the same but make sure we never trap tasks in the charge context? I'm currently messing around with the below patch. When a task faults and charges under OOM, the memcg is remembered in the task struct and then made to sleep on the memcg's OOM waitqueue only after unwinding the page fault stack. With the kernel OOM killer disabled, all tasks in the OOMing group sit nicely in mem_cgroup_oom_synchronize pagefault_out_of_memory mm_fault_error __do_page_fault page_fault 0x regardless of whether they were faulting anon or file. They do not even hold the mmap_sem anymore at this point. [ I kept syscalls really simple for now and just have them return -ENOMEM, never trap them at all (just like the global OOM case). It would be more work to have them wait from a flatter stack too, but it should be doable if necessary. ] I suggested this at the MM summit and people were essentially asking if I was feeling well, so maybe I'm still missing a gaping hole in this idea. Patch only works on x86 as of now, on other architectures memcg OOM will invoke the global OOM killer. --- From: Johannes Weiner han...@cmpxchg.org Subject: [PATCH] memcg: more robust oom handling The memcg OOM handling is incredibly fragile because once a memcg goes OOM, one task (kernel or userspace) is responsible for resolving the situation. Every other task that gets caught trying to charge memory gets stuck in a waitqueue while potentially holding various filesystem and mm locks on which the OOM handling task may now deadlock. Do two things to charge attempts under OOM: 1. Do not trap system calls (buffered IO and friends), just return -ENOMEM. Userspace should be able to handle this... right? 2. Do not trap page faults directly in the charging context. Instead, remember the OOMing memcg in the task struct and fully unwind the page fault stack first. Then synchronize the memcg OOM from pagefault_out_of_memory() Not-quite-signed-off-by: Johannes Weiner han...@cmpxchg.org --- arch/x86/mm/fault.c| 2 + include/linux/memcontrol.h | 6 +++ include/linux/sched.h | 6 +++ mm/memcontrol.c| 104 + mm/oom_kill.c | 7 ++- 5 files changed, 78 insertions(+), 47 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 0e88336..df043a0 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1185,7 +1185,9 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault: */ + current-in_userfault = 1; fault = handle_mm_fault(mm, vma, address, flags); + current-in_userfault = 0; if (unlikely(fault (VM_FAULT_RETRY|VM_FAULT_ERROR))) { if (mm_fault_error(regs, error_code, address, fault)) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d6183f0..e1b84ea 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -121,6 +121,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec
Re: [patch] mm, memcg: add oom killer delay
On Fri 31-05-13 12:29:17, David Rientjes wrote: On Fri, 31 May 2013, Michal Hocko wrote: We allow users to control their own memcgs by chowning them, so they must be run in the same hierarchy if they want to run their own userspace oom handler. There's nothing in the kernel that prevents that and the user has no other option but to run in a parent cgroup. If the access to the oom_control file is controlled by the file permissions then the oom handler can live inside root cgroup. Why do you need must be in the same hierarchy requirement? Users obviously don't have the ability to attach processes to the root memcg. They are constrained to their own subtree of memcgs. OK, I assume those groups are generally untrusted, right? So you cannot let them register their oom handler even via an admin interface. This makes it a bit complicated because it makes much harder demands on the handler itself as it has to run under restricted environment. It's too easy to simply do even a ps ax in an oom memcg and make that thread completely unresponsive because it allocates memory. Yes, but we are talking about oom handler and that one has to be really careful about what it does. So doing something that simply allocates is dangerous. Show me a userspace oom handler that doesn't get notified of every fork() in a memcg, causing a performance degradation of its own for a complete and utter slowpath, that will know the entire process tree of its own memcg or a child memcg. I still do not see why you cannot simply read tasks file into a preallocated buffer. This would be few pages even for thousands of pids. You do not have to track processes as they come and go. This discussion is all fine and good from a theoretical point of view until you actually have to implement one of these yourself. Multiple users are going to be running their own oom notifiers and without some sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too easily deadlock looking for memory. As I said before. oom_delay_millisecs is actually really easy to be done from userspace. If you really need a safety break then you can register such a handler as a fallback. I am not familiar with eventfd internals much but I guess that multiple handlers are possible. The fallback might be enforeced by the admin (when a new group is created) or by the container itself. Would something like this work for your use case? If that user is constrained to his or her own subtree, as previously stated, there's also no way to login and rectify the situation at that point and requires admin intervention or a reboot. Yes, insisting on the same subtree makes the life much harder for oom handlers. I totally agree with you on that. I just feel that introducing a new knob to workaround user inability to write a proper handler (what ever that means) is not justified. Then perhaps I'm raising constraints that you've never worked with, I don't know. We choose to have a priority-based approach that is inherited by children; this priority is kept in userspace and and the oom handler would naturally need to know the set of tasks in the oom memcg at the time of oom and their parent-child relationship. These priorities are completely independent of memory usage. OK, but both reading tasks file and readdir should be doable without userspace (aka charged) allocations. Moreover if you run those oom handlers under the root cgroup then it should be even easier. Then why does cat tasks stall when my memcg is totally depleted of all memory? if you run it like this then cat obviously needs some charged allocations. If you had a proper handler which mlocks its buffer for the read syscall then you shouldn't require any allocation at the oom time. This shouldn't be that hard to do without too much memory overhead. As I said we are talking about few (dozens) of pages per handler. This isn't even the argument because memory.oom_delay_millisecs isn't going to help that situation. I'm talking about a failsafe that ensures a memcg can't deadlock. The global oom killer will always have to exist in the kernel, at least in the most simplistic of forms, solely for this reason; you can't move all of the logic to userspace and expect it to react 100% of the time. The global case is even more complicated because every allocation matters - not just those that are charged as for memcg case. Btw. at last LSF there was a discussion about enabling oom_control for the root cgroup but this is off-topic here and should be discussed separately when somebody actually tries to implement this. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] I'm currently messing around with the below patch. When a task faults and charges under OOM, the memcg is remembered in the task struct and then made to sleep on the memcg's OOM waitqueue only after unwinding the page fault stack. With the kernel OOM killer disabled, all tasks in the OOMing group sit nicely in mem_cgroup_oom_synchronize pagefault_out_of_memory mm_fault_error __do_page_fault page_fault 0x regardless of whether they were faulting anon or file. They do not even hold the mmap_sem anymore at this point. [ I kept syscalls really simple for now and just have them return -ENOMEM, never trap them at all (just like the global OOM case). It would be more work to have them wait from a flatter stack too, but it should be doable if necessary. ] I suggested this at the MM summit and people were essentially asking if I was feeling well, so maybe I'm still missing a gaping hole in this idea. I didn't get to look at the patch (will do on Monday) but it doesn't sounds entirely crazy. Well, we would have to drop mmap_sem so things have to be rechecked but we are doing that already with VM_FAULT_RETRY in some archs so it should work. Patch only works on x86 as of now, on other architectures memcg OOM will invoke the global OOM killer. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Sat, Jun 01, 2013 at 12:29:05PM +0200, Michal Hocko wrote: On Sat 01-06-13 02:11:51, Johannes Weiner wrote: [...] I'm currently messing around with the below patch. When a task faults and charges under OOM, the memcg is remembered in the task struct and then made to sleep on the memcg's OOM waitqueue only after unwinding the page fault stack. With the kernel OOM killer disabled, all tasks in the OOMing group sit nicely in mem_cgroup_oom_synchronize pagefault_out_of_memory mm_fault_error __do_page_fault page_fault 0x regardless of whether they were faulting anon or file. They do not even hold the mmap_sem anymore at this point. [ I kept syscalls really simple for now and just have them return -ENOMEM, never trap them at all (just like the global OOM case). It would be more work to have them wait from a flatter stack too, but it should be doable if necessary. ] I suggested this at the MM summit and people were essentially asking if I was feeling well, so maybe I'm still missing a gaping hole in this idea. I didn't get to look at the patch (will do on Monday) but it doesn't sounds entirely crazy. Well, we would have to drop mmap_sem so things have to be rechecked but we are doing that already with VM_FAULT_RETRY in some archs so it should work. The global OOM case has been doing this for a long time (1c0fe6e mm: invoke oom-killer from page fault), way before VM_FAULT_RETRY. The fault is aborted with VM_FAULT_OOM and the oom killer is invoked. Either the task gets killed or it'll just retrigger the fault. The only difference is that a memcg OOM kill may take longer because of userspace handling, so memcg needs a waitqueue where the global case simply does a trylock (try_set_zonelist_oom) and restarts the fault immediately if somebody else is handling the situation. In fact, when Nick added the page fault OOM invocation, KAME merged something similar to my patch, which tried to catch memcg OOM kills from pagefault_out_of_memory() (a636b32 memcg: avoid unnecessary system-wide-oom-killer). Only when he reworked the whole memcg OOM synchronization, added the ability to disable OOM and the waitqueues etc, the OOMs were trapped right there in the charge context (867578c memcg: fix oom kill behavior). But I see no reason why we shouldn't be able to keep the waitqueues and still go back to synchronizing from the bottom of the page fault stack. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Wed, 29 May 2013 18:18:10 -0700 (PDT) David Rientjes wrote: > Completely disabling the oom killer for a memcg is problematic if > userspace is unable to address the condition itself, usually because it > is unresponsive. This scenario creates a memcg deadlock: tasks are > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to > exit or move, or the oom killer reenabled and userspace is unable to do > so. > > An additional possible use case is to defer oom killing within a memcg > for a set period of time, probably to prevent unnecessary kills due to > temporary memory spikes, before allowing the kernel to handle the > condition. > > This patch adds an oom killer delay so that a memcg may be configured to > wait at least a pre-defined number of milliseconds before calling the oom > killer. If the oom condition persists for this number of milliseconds, > the oom killer will be called the next time the memory controller > attempts to charge a page (and memory.oom_control is set to 0). This > allows userspace to have a short period of time to respond to the > condition before deferring to the kernel to kill a task. > > Admins may set the oom killer delay using the new interface: > > # echo 6 > memory.oom_delay_millisecs > > This will defer oom killing to the kernel only after 60 seconds has > elapsed by putting the task to sleep for 60 seconds. How often is that delay actually useful, in the real world? IOW, in what proportion of cases does the system just remain stuck for 60 seconds and then get an oom-killing? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 31 May 2013, Michal Hocko wrote: > > We allow users to control their own memcgs by chowning them, so they must > > be run in the same hierarchy if they want to run their own userspace oom > > handler. There's nothing in the kernel that prevents that and the user > > has no other option but to run in a parent cgroup. > > If the access to the oom_control file is controlled by the file > permissions then the oom handler can live inside root cgroup. Why do you > need "must be in the same hierarchy" requirement? > Users obviously don't have the ability to attach processes to the root memcg. They are constrained to their own subtree of memcgs. > > It's too easy to simply do even a "ps ax" in an oom memcg and make that > > thread completely unresponsive because it allocates memory. > > Yes, but we are talking about oom handler and that one has to be really > careful about what it does. So doing something that simply allocates is > dangerous. > Show me a userspace oom handler that doesn't get notified of every fork() in a memcg, causing a performance degradation of its own for a complete and utter slowpath, that will know the entire process tree of its own memcg or a child memcg. This discussion is all fine and good from a theoretical point of view until you actually have to implement one of these yourself. Multiple users are going to be running their own oom notifiers and without some sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too easily deadlock looking for memory. If that user is constrained to his or her own subtree, as previously stated, there's also no way to login and rectify the situation at that point and requires admin intervention or a reboot. > > Then perhaps I'm raising constraints that you've never worked with, I > > don't know. We choose to have a priority-based approach that is inherited > > by children; this priority is kept in userspace and and the oom handler > > would naturally need to know the set of tasks in the oom memcg at the time > > of oom and their parent-child relationship. These priorities are > > completely independent of memory usage. > > OK, but both reading tasks file and readdir should be doable without > userspace (aka charged) allocations. Moreover if you run those oom > handlers under the root cgroup then it should be even easier. Then why does "cat tasks" stall when my memcg is totally depleted of all memory? This isn't even the argument because memory.oom_delay_millisecs isn't going to help that situation. I'm talking about a failsafe that ensures a memcg can't deadlock. The global oom killer will always have to exist in the kernel, at least in the most simplistic of forms, solely for this reason; you can't move all of the logic to userspace and expect it to react 100% of the time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri 31-05-13 03:22:59, David Rientjes wrote: > On Fri, 31 May 2013, Michal Hocko wrote: > > > I have always discouraged people from running oom handler in the same > > memcg (or even in the same hierarchy). > > > > We allow users to control their own memcgs by chowning them, so they must > be run in the same hierarchy if they want to run their own userspace oom > handler. There's nothing in the kernel that prevents that and the user > has no other option but to run in a parent cgroup. If the access to the oom_control file is controlled by the file permissions then the oom handler can live inside root cgroup. Why do you need "must be in the same hierarchy" requirement? > > Yes, mmap_sem is tricky. Nothing in the proc code should take it for > > writing and charges are done with mmap_sem held for reading but that > > doesn't prevent from non-oom thread to try to get it for writing which > > would block also all future readers. We have also seen i_mutex being > > held during charge so you have to be careful about that one as well but > > I am not aware of other locks that could be a problem. > > > > The question is, do you really need to open any /proc// files which > > depend on mmap_sem (e.g. maps, smaps). /proc//status should tell you > > about used memory. Or put it another way around. What kind of data you > > need for your OOM handler? > > > > It's too easy to simply do even a "ps ax" in an oom memcg and make that > thread completely unresponsive because it allocates memory. Yes, but we are talking about oom handler and that one has to be really careful about what it does. So doing something that simply allocates is dangerous. > > I might be thinking about different use cases but user space OOM > > handlers I have seen so far had quite a good idea what is going on > > in the group and who to kill. > > Then perhaps I'm raising constraints that you've never worked with, I > don't know. We choose to have a priority-based approach that is inherited > by children; this priority is kept in userspace and and the oom handler > would naturally need to know the set of tasks in the oom memcg at the time > of oom and their parent-child relationship. These priorities are > completely independent of memory usage. OK, but both reading tasks file and readdir should be doable without userspace (aka charged) allocations. Moreover if you run those oom handlers under the root cgroup then it should be even easier. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri 31-05-13 03:22:59, David Rientjes wrote: > On Fri, 31 May 2013, Michal Hocko wrote: [...] > > > If the oom notifier is in the oom cgroup, it may not be able to > > > successfully read the memcg "tasks" file to even determine the set of > > > eligible processes. > > > > It would have to use preallocated buffer and have mlocked all the memory > > that will be used during oom event. > > > > Wrong, the kernel itself allocates memory when reading this information > and that would fail in an oom memcg. But that memory is not charged to a memcg, is it? So unless you are heading towards global OOM you should be safe. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 31 May 2013, Michal Hocko wrote: > I have always discouraged people from running oom handler in the same > memcg (or even in the same hierarchy). > We allow users to control their own memcgs by chowning them, so they must be run in the same hierarchy if they want to run their own userspace oom handler. There's nothing in the kernel that prevents that and the user has no other option but to run in a parent cgroup. > Yes, mmap_sem is tricky. Nothing in the proc code should take it for > writing and charges are done with mmap_sem held for reading but that > doesn't prevent from non-oom thread to try to get it for writing which > would block also all future readers. We have also seen i_mutex being > held during charge so you have to be careful about that one as well but > I am not aware of other locks that could be a problem. > > The question is, do you really need to open any /proc// files which > depend on mmap_sem (e.g. maps, smaps). /proc//status should tell you > about used memory. Or put it another way around. What kind of data you > need for your OOM handler? > It's too easy to simply do even a "ps ax" in an oom memcg and make that thread completely unresponsive because it allocates memory. > I might be thinking about different use cases but user space OOM > handlers I have seen so far had quite a good idea what is going on > in the group and who to kill. Then perhaps I'm raising constraints that you've never worked with, I don't know. We choose to have a priority-based approach that is inherited by children; this priority is kept in userspace and and the oom handler would naturally need to know the set of tasks in the oom memcg at the time of oom and their parent-child relationship. These priorities are completely independent of memory usage. > > If the oom notifier is in the oom cgroup, it may not be able to > > successfully read the memcg "tasks" file to even determine the set of > > eligible processes. > > It would have to use preallocated buffer and have mlocked all the memory > that will be used during oom event. > Wrong, the kernel itself allocates memory when reading this information and that would fail in an oom memcg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Thu 30-05-13 13:47:30, David Rientjes wrote: > On Thu, 30 May 2013, Michal Hocko wrote: > > > > Completely disabling the oom killer for a memcg is problematic if > > > userspace is unable to address the condition itself, usually because it > > > is unresponsive. > > > > Isn't this a bug in the userspace oom handler? Why is it unresponsive? It > > shouldn't allocated any memory so nothing should prevent it from running (if > > other tasks are preempting it permanently then the priority of the handler > > should be increased). > > > > Unresponsiveness isn't necessarily only because of memory constraints, you > may have your oom notifier in a parent cgroup that isn't oom. I have always discouraged people from running oom handler in the same memcg (or even in the same hierarchy). > If a process is stuck on mm->mmap_sem in the oom cgroup, though, the > oom notifier may not be able to scrape /proc/pid and attain necessary > information in making an oom kill decision. Yes, mmap_sem is tricky. Nothing in the proc code should take it for writing and charges are done with mmap_sem held for reading but that doesn't prevent from non-oom thread to try to get it for writing which would block also all future readers. We have also seen i_mutex being held during charge so you have to be careful about that one as well but I am not aware of other locks that could be a problem. The question is, do you really need to open any /proc// files which depend on mmap_sem (e.g. maps, smaps). /proc//status should tell you about used memory. Or put it another way around. What kind of data you need for your OOM handler? I might be thinking about different use cases but user space OOM handlers I have seen so far had quite a good idea what is going on in the group and who to kill. So it was more a decision based on the workload and its semantic rather than based on the used memory (which is done quite sensibly with the in kernel handler already) or something that would trip over mmap_sem when trying to get information. > If the oom notifier is in the oom cgroup, it may not be able to > successfully read the memcg "tasks" file to even determine the set of > eligible processes. It would have to use preallocated buffer and have mlocked all the memory that will be used during oom event. > There is also no guarantee that the userspace oom handler will have > the necessary memory to even re-enable the oom killer in the memcg > under oom which would allow the kernel to make forward progress. Why it wouldn't have enough memory to write to the file? Assuming that the oom handler has the file handle (for limit_in_bytes) open, it has mlocked all the necessary memory (code + buffers) then I do not see what would prevent it from writing it to limit_in_bytes. > We've used this for a few years as a complement to oom notifiers so that a > process would have a grace period to deal with the oom condition itself > before allowing the kernel to terminate a process and free memory. We've > simply had no alternative in the presence of kernel constraints that > prevent it from being done in any other way. We _want_ userspace to deal > with the issue but when it cannot collect the necessary information (and > we're not tracing every fork() that every process in a potentially oom > memcg does) to deal with the condition, we want the kernel to step in > instead of relying on an admin to login or a global oom condition. > > If you'd like to debate this issue, I'd be more than happy to do so and > show why this patch is absolutely necessary for inclusion, but I'd ask > that you'd present the code from your userspace oom handler so I can > understand how it works without needing such backup support. I usually do not write those things myself as I am supporting others so I do not have any code handy right now. But I can come up with a simple handler which implements your timeout based killer for peak workloads you have mentioned earlier. That one should be quite easy to do. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Thu 30-05-13 13:47:30, David Rientjes wrote: On Thu, 30 May 2013, Michal Hocko wrote: Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because it is unresponsive. Isn't this a bug in the userspace oom handler? Why is it unresponsive? It shouldn't allocated any memory so nothing should prevent it from running (if other tasks are preempting it permanently then the priority of the handler should be increased). Unresponsiveness isn't necessarily only because of memory constraints, you may have your oom notifier in a parent cgroup that isn't oom. I have always discouraged people from running oom handler in the same memcg (or even in the same hierarchy). If a process is stuck on mm-mmap_sem in the oom cgroup, though, the oom notifier may not be able to scrape /proc/pid and attain necessary information in making an oom kill decision. Yes, mmap_sem is tricky. Nothing in the proc code should take it for writing and charges are done with mmap_sem held for reading but that doesn't prevent from non-oom thread to try to get it for writing which would block also all future readers. We have also seen i_mutex being held during charge so you have to be careful about that one as well but I am not aware of other locks that could be a problem. The question is, do you really need to open any /proc/pid/ files which depend on mmap_sem (e.g. maps, smaps). /proc/pid/status should tell you about used memory. Or put it another way around. What kind of data you need for your OOM handler? I might be thinking about different use cases but user space OOM handlers I have seen so far had quite a good idea what is going on in the group and who to kill. So it was more a decision based on the workload and its semantic rather than based on the used memory (which is done quite sensibly with the in kernel handler already) or something that would trip over mmap_sem when trying to get information. If the oom notifier is in the oom cgroup, it may not be able to successfully read the memcg tasks file to even determine the set of eligible processes. It would have to use preallocated buffer and have mlocked all the memory that will be used during oom event. There is also no guarantee that the userspace oom handler will have the necessary memory to even re-enable the oom killer in the memcg under oom which would allow the kernel to make forward progress. Why it wouldn't have enough memory to write to the file? Assuming that the oom handler has the file handle (for limit_in_bytes) open, it has mlocked all the necessary memory (code + buffers) then I do not see what would prevent it from writing it to limit_in_bytes. We've used this for a few years as a complement to oom notifiers so that a process would have a grace period to deal with the oom condition itself before allowing the kernel to terminate a process and free memory. We've simply had no alternative in the presence of kernel constraints that prevent it from being done in any other way. We _want_ userspace to deal with the issue but when it cannot collect the necessary information (and we're not tracing every fork() that every process in a potentially oom memcg does) to deal with the condition, we want the kernel to step in instead of relying on an admin to login or a global oom condition. If you'd like to debate this issue, I'd be more than happy to do so and show why this patch is absolutely necessary for inclusion, but I'd ask that you'd present the code from your userspace oom handler so I can understand how it works without needing such backup support. I usually do not write those things myself as I am supporting others so I do not have any code handy right now. But I can come up with a simple handler which implements your timeout based killer for peak workloads you have mentioned earlier. That one should be quite easy to do. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 31 May 2013, Michal Hocko wrote: I have always discouraged people from running oom handler in the same memcg (or even in the same hierarchy). We allow users to control their own memcgs by chowning them, so they must be run in the same hierarchy if they want to run their own userspace oom handler. There's nothing in the kernel that prevents that and the user has no other option but to run in a parent cgroup. Yes, mmap_sem is tricky. Nothing in the proc code should take it for writing and charges are done with mmap_sem held for reading but that doesn't prevent from non-oom thread to try to get it for writing which would block also all future readers. We have also seen i_mutex being held during charge so you have to be careful about that one as well but I am not aware of other locks that could be a problem. The question is, do you really need to open any /proc/pid/ files which depend on mmap_sem (e.g. maps, smaps). /proc/pid/status should tell you about used memory. Or put it another way around. What kind of data you need for your OOM handler? It's too easy to simply do even a ps ax in an oom memcg and make that thread completely unresponsive because it allocates memory. I might be thinking about different use cases but user space OOM handlers I have seen so far had quite a good idea what is going on in the group and who to kill. Then perhaps I'm raising constraints that you've never worked with, I don't know. We choose to have a priority-based approach that is inherited by children; this priority is kept in userspace and and the oom handler would naturally need to know the set of tasks in the oom memcg at the time of oom and their parent-child relationship. These priorities are completely independent of memory usage. If the oom notifier is in the oom cgroup, it may not be able to successfully read the memcg tasks file to even determine the set of eligible processes. It would have to use preallocated buffer and have mlocked all the memory that will be used during oom event. Wrong, the kernel itself allocates memory when reading this information and that would fail in an oom memcg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri 31-05-13 03:22:59, David Rientjes wrote: On Fri, 31 May 2013, Michal Hocko wrote: [...] If the oom notifier is in the oom cgroup, it may not be able to successfully read the memcg tasks file to even determine the set of eligible processes. It would have to use preallocated buffer and have mlocked all the memory that will be used during oom event. Wrong, the kernel itself allocates memory when reading this information and that would fail in an oom memcg. But that memory is not charged to a memcg, is it? So unless you are heading towards global OOM you should be safe. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri 31-05-13 03:22:59, David Rientjes wrote: On Fri, 31 May 2013, Michal Hocko wrote: I have always discouraged people from running oom handler in the same memcg (or even in the same hierarchy). We allow users to control their own memcgs by chowning them, so they must be run in the same hierarchy if they want to run their own userspace oom handler. There's nothing in the kernel that prevents that and the user has no other option but to run in a parent cgroup. If the access to the oom_control file is controlled by the file permissions then the oom handler can live inside root cgroup. Why do you need must be in the same hierarchy requirement? Yes, mmap_sem is tricky. Nothing in the proc code should take it for writing and charges are done with mmap_sem held for reading but that doesn't prevent from non-oom thread to try to get it for writing which would block also all future readers. We have also seen i_mutex being held during charge so you have to be careful about that one as well but I am not aware of other locks that could be a problem. The question is, do you really need to open any /proc/pid/ files which depend on mmap_sem (e.g. maps, smaps). /proc/pid/status should tell you about used memory. Or put it another way around. What kind of data you need for your OOM handler? It's too easy to simply do even a ps ax in an oom memcg and make that thread completely unresponsive because it allocates memory. Yes, but we are talking about oom handler and that one has to be really careful about what it does. So doing something that simply allocates is dangerous. I might be thinking about different use cases but user space OOM handlers I have seen so far had quite a good idea what is going on in the group and who to kill. Then perhaps I'm raising constraints that you've never worked with, I don't know. We choose to have a priority-based approach that is inherited by children; this priority is kept in userspace and and the oom handler would naturally need to know the set of tasks in the oom memcg at the time of oom and their parent-child relationship. These priorities are completely independent of memory usage. OK, but both reading tasks file and readdir should be doable without userspace (aka charged) allocations. Moreover if you run those oom handlers under the root cgroup then it should be even easier. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Fri, 31 May 2013, Michal Hocko wrote: We allow users to control their own memcgs by chowning them, so they must be run in the same hierarchy if they want to run their own userspace oom handler. There's nothing in the kernel that prevents that and the user has no other option but to run in a parent cgroup. If the access to the oom_control file is controlled by the file permissions then the oom handler can live inside root cgroup. Why do you need must be in the same hierarchy requirement? Users obviously don't have the ability to attach processes to the root memcg. They are constrained to their own subtree of memcgs. It's too easy to simply do even a ps ax in an oom memcg and make that thread completely unresponsive because it allocates memory. Yes, but we are talking about oom handler and that one has to be really careful about what it does. So doing something that simply allocates is dangerous. Show me a userspace oom handler that doesn't get notified of every fork() in a memcg, causing a performance degradation of its own for a complete and utter slowpath, that will know the entire process tree of its own memcg or a child memcg. This discussion is all fine and good from a theoretical point of view until you actually have to implement one of these yourself. Multiple users are going to be running their own oom notifiers and without some sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too easily deadlock looking for memory. If that user is constrained to his or her own subtree, as previously stated, there's also no way to login and rectify the situation at that point and requires admin intervention or a reboot. Then perhaps I'm raising constraints that you've never worked with, I don't know. We choose to have a priority-based approach that is inherited by children; this priority is kept in userspace and and the oom handler would naturally need to know the set of tasks in the oom memcg at the time of oom and their parent-child relationship. These priorities are completely independent of memory usage. OK, but both reading tasks file and readdir should be doable without userspace (aka charged) allocations. Moreover if you run those oom handlers under the root cgroup then it should be even easier. Then why does cat tasks stall when my memcg is totally depleted of all memory? This isn't even the argument because memory.oom_delay_millisecs isn't going to help that situation. I'm talking about a failsafe that ensures a memcg can't deadlock. The global oom killer will always have to exist in the kernel, at least in the most simplistic of forms, solely for this reason; you can't move all of the logic to userspace and expect it to react 100% of the time. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Wed, 29 May 2013 18:18:10 -0700 (PDT) David Rientjes rient...@google.com wrote: Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because it is unresponsive. This scenario creates a memcg deadlock: tasks are sitting in TASK_KILLABLE waiting for the limit to be increased, a task to exit or move, or the oom killer reenabled and userspace is unable to do so. An additional possible use case is to defer oom killing within a memcg for a set period of time, probably to prevent unnecessary kills due to temporary memory spikes, before allowing the kernel to handle the condition. This patch adds an oom killer delay so that a memcg may be configured to wait at least a pre-defined number of milliseconds before calling the oom killer. If the oom condition persists for this number of milliseconds, the oom killer will be called the next time the memory controller attempts to charge a page (and memory.oom_control is set to 0). This allows userspace to have a short period of time to respond to the condition before deferring to the kernel to kill a task. Admins may set the oom killer delay using the new interface: # echo 6 memory.oom_delay_millisecs This will defer oom killing to the kernel only after 60 seconds has elapsed by putting the task to sleep for 60 seconds. How often is that delay actually useful, in the real world? IOW, in what proportion of cases does the system just remain stuck for 60 seconds and then get an oom-killing? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Thu, 30 May 2013, Michal Hocko wrote: > > Completely disabling the oom killer for a memcg is problematic if > > userspace is unable to address the condition itself, usually because it > > is unresponsive. > > Isn't this a bug in the userspace oom handler? Why is it unresponsive? It > shouldn't allocated any memory so nothing should prevent it from running (if > other tasks are preempting it permanently then the priority of the handler > should be increased). > Unresponsiveness isn't necessarily only because of memory constraints, you may have your oom notifier in a parent cgroup that isn't oom. If a process is stuck on mm->mmap_sem in the oom cgroup, though, the oom notifier may not be able to scrape /proc/pid and attain necessary information in making an oom kill decision. If the oom notifier is in the oom cgroup, it may not be able to successfully read the memcg "tasks" file to even determine the set of eligible processes. There is also no guarantee that the userspace oom handler will have the necessary memory to even re-enable the oom killer in the memcg under oom which would allow the kernel to make forward progress. We've used this for a few years as a complement to oom notifiers so that a process would have a grace period to deal with the oom condition itself before allowing the kernel to terminate a process and free memory. We've simply had no alternative in the presence of kernel constraints that prevent it from being done in any other way. We _want_ userspace to deal with the issue but when it cannot collect the necessary information (and we're not tracing every fork() that every process in a potentially oom memcg does) to deal with the condition, we want the kernel to step in instead of relying on an admin to login or a global oom condition. If you'd like to debate this issue, I'd be more than happy to do so and show why this patch is absolutely necessary for inclusion, but I'd ask that you'd present the code from your userspace oom handler so I can understand how it works without needing such backup support. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Wed 29-05-13 18:18:10, David Rientjes wrote: > Completely disabling the oom killer for a memcg is problematic if > userspace is unable to address the condition itself, usually because it > is unresponsive. Isn't this a bug in the userspace oom handler? Why is it unresponsive? It shouldn't allocated any memory so nothing should prevent it from running (if other tasks are preempting it permanently then the priority of the handler should be increased). > This scenario creates a memcg deadlock: tasks are > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to > exit or move, or the oom killer reenabled and userspace is unable to do > so. > > An additional possible use case is to defer oom killing within a memcg > for a set period of time, probably to prevent unnecessary kills due to > temporary memory spikes, before allowing the kernel to handle the > condition. I am not sure I like the idea. How does an admin decide what is the right value of the timeout? And why he cannot use userspace oom handler to do the same thing? [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Wed 29-05-13 18:18:10, David Rientjes wrote: Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because it is unresponsive. Isn't this a bug in the userspace oom handler? Why is it unresponsive? It shouldn't allocated any memory so nothing should prevent it from running (if other tasks are preempting it permanently then the priority of the handler should be increased). This scenario creates a memcg deadlock: tasks are sitting in TASK_KILLABLE waiting for the limit to be increased, a task to exit or move, or the oom killer reenabled and userspace is unable to do so. An additional possible use case is to defer oom killing within a memcg for a set period of time, probably to prevent unnecessary kills due to temporary memory spikes, before allowing the kernel to handle the condition. I am not sure I like the idea. How does an admin decide what is the right value of the timeout? And why he cannot use userspace oom handler to do the same thing? [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm, memcg: add oom killer delay
On Thu, 30 May 2013, Michal Hocko wrote: Completely disabling the oom killer for a memcg is problematic if userspace is unable to address the condition itself, usually because it is unresponsive. Isn't this a bug in the userspace oom handler? Why is it unresponsive? It shouldn't allocated any memory so nothing should prevent it from running (if other tasks are preempting it permanently then the priority of the handler should be increased). Unresponsiveness isn't necessarily only because of memory constraints, you may have your oom notifier in a parent cgroup that isn't oom. If a process is stuck on mm-mmap_sem in the oom cgroup, though, the oom notifier may not be able to scrape /proc/pid and attain necessary information in making an oom kill decision. If the oom notifier is in the oom cgroup, it may not be able to successfully read the memcg tasks file to even determine the set of eligible processes. There is also no guarantee that the userspace oom handler will have the necessary memory to even re-enable the oom killer in the memcg under oom which would allow the kernel to make forward progress. We've used this for a few years as a complement to oom notifiers so that a process would have a grace period to deal with the oom condition itself before allowing the kernel to terminate a process and free memory. We've simply had no alternative in the presence of kernel constraints that prevent it from being done in any other way. We _want_ userspace to deal with the issue but when it cannot collect the necessary information (and we're not tracing every fork() that every process in a potentially oom memcg does) to deal with the condition, we want the kernel to step in instead of relying on an admin to login or a global oom condition. If you'd like to debate this issue, I'd be more than happy to do so and show why this patch is absolutely necessary for inclusion, but I'd ask that you'd present the code from your userspace oom handler so I can understand how it works without needing such backup support. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/