Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Mon 20-01-14 22:13:21, David Rientjes wrote: > On Thu, 16 Jan 2014, Michal Hocko wrote: > > > > The heuristic may have existed for ages, but the proposed memcg > > > configuration for preserving memory such that userspace oom handlers may > > > run such as > > > > > >_root__ > > > / \ > > > user oom > > > /\ / \ > > > AB a b > > > > > > where user/memory.limit_in_bytes == [amount of present RAM] + > > > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be > > > problematic, including Johannes' buggy bypass for charges in memcgs with > > > pending memcgs that has since been fixed after I identified it. This > > > bypass is included. Processes attached to "a" and "b" are userspace oom > > > handlers for processes attached to "A" and "B", respectively. > > > > > > The amount of memory you're talking about is proportional to the number > > > of > > > processes that have pending SIGKILLs (and now those with PF_EXITING set), > > > the former of which is obviously more concerning since they could be > > > charging memory at any point in the kernel that would succeed. > > > > I understand your concerns. Yes, excessive charges might be dangerous. I > > haven't dismissed that when you mentioned it earlier. I am just > > repeatedly asking how much memory are we talking about, how real is the > > issue and what are all the other conseqeunces. And for some reason you > > are not providing that information (or maybe I am just not seeing that > > in your responses) and that is why we are stuck in circle. > > > > Wtf are you talking about? You're adding a bypass in this patch and then > you're asking me to go and see how much memory it could potentially bypass > and take away from oom handlers under the above memcg configuration? No. You are mixing two things. One of them is adding PF_EXITING bypass while the other is removing fatal_signal_pending bypass. The first one is a subset of the later and it doesn't add an excessive amount of charges because there are no direct allocations after exit_signals. You haven't shown that this is not true and your only concern was that this might change in future. Besides that my argument was that even if such an allocation led to the global OOM the task would be given TIF_MEMDIE and nothing would be killed. The other part is fatal_signal_pending which we have there for ages and you want to remove it. In order to do that I am asking you for some data backing up that removal. You keep repeating your arguments but they lack data or at least show code paths which would wildly allocate after task has been killed which wouldn't be fixable by fatal_signal_pending check in the caller to show that the issue is real. Besides that you are completely ignoring other concerns I have mentioned, e.g. possible performance regressions when a pointless reclaim slows existing tags. Please try to understand that this is not Black thing. > This seems like something you should provide before throwing out > patches that nobody has tested if you want to make the argument that > the above memcg configuration is valid for handling userspace oom > notifications. > > And you certainly have dismissed what I've mentioned earlier when I said > that anybody can add memory allocation to the exit path later on and > nobody is going to think about how much memory this is going to bypass to > the root memcg and potentially take away from userspace oom handlers. If this happens then it has to be fixed and if not fixable then reconsider this heuristic. > There's two possible ways to forward this: > > - avoid bypass to the root memcg in every possible case such that the >above memcg configuration actually makes a guarantee to userspace oom >handlers attached to it, or > > - provide per-memcg memory reserves such that userspace oom handlers can >allocate and charge memory without the above memcg configuration so >there is a guarantee. David, you are aware that there are memory allocations that are out of memcg/kmem scope, aren't you? This means that whether you add memcg charge-reserves or access to memory reserves to memcg OOM killers then you still can never rule out the global OOM killer. > What's not acceptable, now or ever, is suggesting a solution to a problem > that is supposed to guarantee some resource and then allow under some > circumstances that resource to be completely depleted such that the > solution never works. And yet you still haven't shown that such depletion is real. E.g. g-u-p backs off when it sees fatal signal pending other callers that allocate charged memory should do the same. > > Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that > > should terminate in a short time. I think that having a setup with a > > guarantee of never triggering the
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Mon 20-01-14 22:13:21, David Rientjes wrote: On Thu, 16 Jan 2014, Michal Hocko wrote: The heuristic may have existed for ages, but the proposed memcg configuration for preserving memory such that userspace oom handlers may run such as _root__ / \ user oom /\ / \ AB a b where user/memory.limit_in_bytes == [amount of present RAM] + oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be problematic, including Johannes' buggy bypass for charges in memcgs with pending memcgs that has since been fixed after I identified it. This bypass is included. Processes attached to a and b are userspace oom handlers for processes attached to A and B, respectively. The amount of memory you're talking about is proportional to the number of processes that have pending SIGKILLs (and now those with PF_EXITING set), the former of which is obviously more concerning since they could be charging memory at any point in the kernel that would succeed. I understand your concerns. Yes, excessive charges might be dangerous. I haven't dismissed that when you mentioned it earlier. I am just repeatedly asking how much memory are we talking about, how real is the issue and what are all the other conseqeunces. And for some reason you are not providing that information (or maybe I am just not seeing that in your responses) and that is why we are stuck in circle. Wtf are you talking about? You're adding a bypass in this patch and then you're asking me to go and see how much memory it could potentially bypass and take away from oom handlers under the above memcg configuration? No. You are mixing two things. One of them is adding PF_EXITING bypass while the other is removing fatal_signal_pending bypass. The first one is a subset of the later and it doesn't add an excessive amount of charges because there are no direct allocations after exit_signals. You haven't shown that this is not true and your only concern was that this might change in future. Besides that my argument was that even if such an allocation led to the global OOM the task would be given TIF_MEMDIE and nothing would be killed. The other part is fatal_signal_pending which we have there for ages and you want to remove it. In order to do that I am asking you for some data backing up that removal. You keep repeating your arguments but they lack data or at least show code paths which would wildly allocatecharge after task has been killed which wouldn't be fixable by fatal_signal_pending check in the caller to show that the issue is real. Besides that you are completely ignoring other concerns I have mentioned, e.g. possible performance regressions when a pointless reclaim slows existing tags. Please try to understand that this is not BlackWhite thing. This seems like something you should provide before throwing out patches that nobody has tested if you want to make the argument that the above memcg configuration is valid for handling userspace oom notifications. And you certainly have dismissed what I've mentioned earlier when I said that anybody can add memory allocation to the exit path later on and nobody is going to think about how much memory this is going to bypass to the root memcg and potentially take away from userspace oom handlers. If this happens then it has to be fixed and if not fixable then reconsider this heuristic. There's two possible ways to forward this: - avoid bypass to the root memcg in every possible case such that the above memcg configuration actually makes a guarantee to userspace oom handlers attached to it, or - provide per-memcg memory reserves such that userspace oom handlers can allocate and charge memory without the above memcg configuration so there is a guarantee. David, you are aware that there are memory allocations that are out of memcg/kmem scope, aren't you? This means that whether you add memcg charge-reserves or access to memory reserves to memcg OOM killers then you still can never rule out the global OOM killer. What's not acceptable, now or ever, is suggesting a solution to a problem that is supposed to guarantee some resource and then allow under some circumstances that resource to be completely depleted such that the solution never works. And yet you still haven't shown that such depletion is real. E.g. g-u-p backs off when it sees fatal signal pending other callers that allocate charged memory should do the same. Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that should terminate in a short time. I think that having a setup with a guarantee of never triggering the global OOM is too ambitious and I am even skeptical it would be achievable. Short time is meaningless if the
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Thu, 16 Jan 2014, Michal Hocko wrote: > > The heuristic may have existed for ages, but the proposed memcg > > configuration for preserving memory such that userspace oom handlers may > > run such as > > > > _root__ > > / \ > > user oom > >/\ / \ > >AB a b > > > > where user/memory.limit_in_bytes == [amount of present RAM] + > > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be > > problematic, including Johannes' buggy bypass for charges in memcgs with > > pending memcgs that has since been fixed after I identified it. This > > bypass is included. Processes attached to "a" and "b" are userspace oom > > handlers for processes attached to "A" and "B", respectively. > > > > The amount of memory you're talking about is proportional to the number of > > processes that have pending SIGKILLs (and now those with PF_EXITING set), > > the former of which is obviously more concerning since they could be > > charging memory at any point in the kernel that would succeed. > > I understand your concerns. Yes, excessive charges might be dangerous. I > haven't dismissed that when you mentioned it earlier. I am just > repeatedly asking how much memory are we talking about, how real is the > issue and what are all the other conseqeunces. And for some reason you > are not providing that information (or maybe I am just not seeing that > in your responses) and that is why we are stuck in circle. > Wtf are you talking about? You're adding a bypass in this patch and then you're asking me to go and see how much memory it could potentially bypass and take away from oom handlers under the above memcg configuration? This seems like something you should provide before throwing out patches that nobody has tested if you want to make the argument that the above memcg configuration is valid for handling userspace oom notifications. And you certainly have dismissed what I've mentioned earlier when I said that anybody can add memory allocation to the exit path later on and nobody is going to think about how much memory this is going to bypass to the root memcg and potentially take away from userspace oom handlers. There's two possible ways to forward this: - avoid bypass to the root memcg in every possible case such that the above memcg configuration actually makes a guarantee to userspace oom handlers attached to it, or - provide per-memcg memory reserves such that userspace oom handlers can allocate and charge memory without the above memcg configuration so there is a guarantee. What's not acceptable, now or ever, is suggesting a solution to a problem that is supposed to guarantee some resource and then allow under some circumstances that resource to be completely depleted such that the solution never works. > Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that > should terminate in a short time. I think that having a setup with a > guarantee of never triggering the global OOM is too ambitious and I am > even skeptical it would be achievable. > "Short time" is meaningless if the memory allocation causes memory to not be available to userspace oom handlers. If allocations are allowed to be charged because you're in the exit() path or because you have SIGKILL, that can result in a system oom condition that would prevent userspace from being able to handle them. > > I'm debating both fatal_signal_pending() and PF_EXITING here since they > > are now both bypasses, we need to remove fatal_signal_pending(). My > > simple question with your patch: how do you guarantee memory to processes > > attached to "a" and "b"? > > The only way you can get that _guarantee_ is to account all the memory > allocations. And that is not implemented and I would even question > whether it is worthwhile. So we still have to live with a possibility > of triggering the global OOM killer. That's why I believe we need to be > able to tell the kernel what is the user policy for oom killer (that is > a different discussion though). > So you're saying that Tejun's suggested userspace oom handler configuration is pointless, correct? We can certainly provide a guarantee if memory is reserved specifically for userspace oom handling like I proposed, the same way that memory reserves are guaranteed for oom killed processes. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Thu, 16 Jan 2014, Michal Hocko wrote: The heuristic may have existed for ages, but the proposed memcg configuration for preserving memory such that userspace oom handlers may run such as _root__ / \ user oom /\ / \ AB a b where user/memory.limit_in_bytes == [amount of present RAM] + oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be problematic, including Johannes' buggy bypass for charges in memcgs with pending memcgs that has since been fixed after I identified it. This bypass is included. Processes attached to a and b are userspace oom handlers for processes attached to A and B, respectively. The amount of memory you're talking about is proportional to the number of processes that have pending SIGKILLs (and now those with PF_EXITING set), the former of which is obviously more concerning since they could be charging memory at any point in the kernel that would succeed. I understand your concerns. Yes, excessive charges might be dangerous. I haven't dismissed that when you mentioned it earlier. I am just repeatedly asking how much memory are we talking about, how real is the issue and what are all the other conseqeunces. And for some reason you are not providing that information (or maybe I am just not seeing that in your responses) and that is why we are stuck in circle. Wtf are you talking about? You're adding a bypass in this patch and then you're asking me to go and see how much memory it could potentially bypass and take away from oom handlers under the above memcg configuration? This seems like something you should provide before throwing out patches that nobody has tested if you want to make the argument that the above memcg configuration is valid for handling userspace oom notifications. And you certainly have dismissed what I've mentioned earlier when I said that anybody can add memory allocation to the exit path later on and nobody is going to think about how much memory this is going to bypass to the root memcg and potentially take away from userspace oom handlers. There's two possible ways to forward this: - avoid bypass to the root memcg in every possible case such that the above memcg configuration actually makes a guarantee to userspace oom handlers attached to it, or - provide per-memcg memory reserves such that userspace oom handlers can allocate and charge memory without the above memcg configuration so there is a guarantee. What's not acceptable, now or ever, is suggesting a solution to a problem that is supposed to guarantee some resource and then allow under some circumstances that resource to be completely depleted such that the solution never works. Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that should terminate in a short time. I think that having a setup with a guarantee of never triggering the global OOM is too ambitious and I am even skeptical it would be achievable. Short time is meaningless if the memory allocation causes memory to not be available to userspace oom handlers. If allocations are allowed to be charged because you're in the exit() path or because you have SIGKILL, that can result in a system oom condition that would prevent userspace from being able to handle them. I'm debating both fatal_signal_pending() and PF_EXITING here since they are now both bypasses, we need to remove fatal_signal_pending(). My simple question with your patch: how do you guarantee memory to processes attached to a and b? The only way you can get that _guarantee_ is to account all the memory allocations. And that is not implemented and I would even question whether it is worthwhile. So we still have to live with a possibility of triggering the global OOM killer. That's why I believe we need to be able to tell the kernel what is the user policy for oom killer (that is a different discussion though). So you're saying that Tejun's suggested userspace oom handler configuration is pointless, correct? We can certainly provide a guarantee if memory is reserved specifically for userspace oom handling like I proposed, the same way that memory reserves are guaranteed for oom killed processes. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Wed 15-01-14 13:19:21, David Rientjes wrote: > On Wed, 15 Jan 2014, Michal Hocko wrote: > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > > index b8dfed1b9d87..b86fbb04b7c6 100644 > > > > > > --- a/mm/memcontrol.c > > > > > > +++ b/mm/memcontrol.c > > > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct > > > > > > mm_struct *mm, > > > > > > * MEMDIE process. > > > > > > */ > > > > > > if (unlikely(test_thread_flag(TIF_MEMDIE) > > > > > > -|| fatal_signal_pending(current))) > > > > > > +|| fatal_signal_pending(current)) > > > > > > +|| current->flags & PF_EXITING) > > > > > > goto bypass; > > > > > > > > > > > > if (unlikely(task_in_memcg_oom(current))) > > > > > > > > > > This would become problematic if significant amount of memory is > > > > > charged > > > > > in the exit() path. > > > > > > > > But this would hurt also for fatal_signal_pending tasks, wouldn't it? > > > > > > Yes, and as I've said twice now, that should be removed. > > > > And you failed to provide any relevant data to back your suggestions. I > > have told you that we have these heuristics for ages and we need a > > strong justification to drop them. So if you really think that they are > > not appropriate then back your statements with real data. > > > > E.g. measure how much memory are we talking about. > > The heuristic may have existed for ages, but the proposed memcg > configuration for preserving memory such that userspace oom handlers may > run such as > >_root__ > / \ > user oom > /\ / \ > AB a b > > where user/memory.limit_in_bytes == [amount of present RAM] + > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be > problematic, including Johannes' buggy bypass for charges in memcgs with > pending memcgs that has since been fixed after I identified it. This > bypass is included. Processes attached to "a" and "b" are userspace oom > handlers for processes attached to "A" and "B", respectively. > > The amount of memory you're talking about is proportional to the number of > processes that have pending SIGKILLs (and now those with PF_EXITING set), > the former of which is obviously more concerning since they could be > charging memory at any point in the kernel that would succeed. I understand your concerns. Yes, excessive charges might be dangerous. I haven't dismissed that when you mentioned it earlier. I am just repeatedly asking how much memory are we talking about, how real is the issue and what are all the other conseqeunces. And for some reason you are not providing that information (or maybe I am just not seeing that in your responses) and that is why we are stuck in circle. For example your some_fudge needs to consider memory which is not accounted for. That is rather hard to predict because it depends on drivers (or whoever calls page allocator directly) you have and the current load. Also all tasks living in the root memcg are not accounted. So you would need to have some pillow to be safe. Do fatal_signal_pending tasks allocate way much more than your pillow? I am also not sure whether fatal_signal_pending failing charge (without going via OOM which would set TIF_MEMDIE) could prevent exiting task. Finally there is still a risk for regressions when a killed task causes a pointless reclaim just to free few pages for a task which will free memory few seconds later. > The latter > is concerning only if future memory allocation post-PF_EXITING would be > become significant and nobody is going to think about oom memcg bypasses > in such a case. > > To use the configuration suggested above, we need to prevent as many > bypasses as possible to the root memcg. Otherwise, the memory protected > for the "oom" memcg from processes constrained by the limit of "user" is > no longer protected. This isn't only a problem with the bypasses here in > the charging path, but also unaccounted kernel memory, for example. Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that should terminate in a short time. I think that having a setup with a guarantee of never triggering the global OOM is too ambitious and I am even skeptical it would be achievable. > For this to be usable, we need to ensure that the limit of the "oom" memcg > is protected for the userspace oom handlers that are attached. With a > charge bypassed to the root memcg greater than or equal to the limit of > the "oom" memcg OR cumulative charges bypassed to the root memcg greater > than or equal to the limit of the "oom" memcg by processes with pending > SIGKILLs, userspace oom handlers cannot respond. That's particuarly > dangerous without a memcg oom kill delay, as proposed before, since >
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Wed 15-01-14 13:19:21, David Rientjes wrote: On Wed, 15 Jan 2014, Michal Hocko wrote: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) -|| fatal_signal_pending(current))) +|| fatal_signal_pending(current)) +|| current-flags PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) This would become problematic if significant amount of memory is charged in the exit() path. But this would hurt also for fatal_signal_pending tasks, wouldn't it? Yes, and as I've said twice now, that should be removed. And you failed to provide any relevant data to back your suggestions. I have told you that we have these heuristics for ages and we need a strong justification to drop them. So if you really think that they are not appropriate then back your statements with real data. E.g. measure how much memory are we talking about. The heuristic may have existed for ages, but the proposed memcg configuration for preserving memory such that userspace oom handlers may run such as _root__ / \ user oom /\ / \ AB a b where user/memory.limit_in_bytes == [amount of present RAM] + oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be problematic, including Johannes' buggy bypass for charges in memcgs with pending memcgs that has since been fixed after I identified it. This bypass is included. Processes attached to a and b are userspace oom handlers for processes attached to A and B, respectively. The amount of memory you're talking about is proportional to the number of processes that have pending SIGKILLs (and now those with PF_EXITING set), the former of which is obviously more concerning since they could be charging memory at any point in the kernel that would succeed. I understand your concerns. Yes, excessive charges might be dangerous. I haven't dismissed that when you mentioned it earlier. I am just repeatedly asking how much memory are we talking about, how real is the issue and what are all the other conseqeunces. And for some reason you are not providing that information (or maybe I am just not seeing that in your responses) and that is why we are stuck in circle. For example your some_fudge needs to consider memory which is not accounted for. That is rather hard to predict because it depends on drivers (or whoever calls page allocator directly) you have and the current load. Also all tasks living in the root memcg are not accounted. So you would need to have some pillow to be safe. Do fatal_signal_pending tasks allocate way much more than your pillow? I am also not sure whether fatal_signal_pending failing charge (without going via OOM which would set TIF_MEMDIE) could prevent exiting task. Finally there is still a risk for regressions when a killed task causes a pointless reclaim just to free few pages for a task which will free memory few seconds later. The latter is concerning only if future memory allocation post-PF_EXITING would be become significant and nobody is going to think about oom memcg bypasses in such a case. To use the configuration suggested above, we need to prevent as many bypasses as possible to the root memcg. Otherwise, the memory protected for the oom memcg from processes constrained by the limit of user is no longer protected. This isn't only a problem with the bypasses here in the charging path, but also unaccounted kernel memory, for example. Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that should terminate in a short time. I think that having a setup with a guarantee of never triggering the global OOM is too ambitious and I am even skeptical it would be achievable. For this to be usable, we need to ensure that the limit of the oom memcg is protected for the userspace oom handlers that are attached. With a charge bypassed to the root memcg greater than or equal to the limit of the oom memcg OR cumulative charges bypassed to the root memcg greater than or equal to the limit of the oom memcg by processes with pending SIGKILLs, userspace oom handlers cannot respond. That's particuarly dangerous without a memcg oom kill delay, as proposed before, since userspace must disable oom killing entirely for both A and B for userspace notification to be meaningful, since all processes are now livelocked. These bypasses should be given to one thread and
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Wed, 15 Jan 2014, Michal Hocko wrote: > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > index b8dfed1b9d87..b86fbb04b7c6 100644 > > > > > --- a/mm/memcontrol.c > > > > > +++ b/mm/memcontrol.c > > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct > > > > > mm_struct *mm, > > > > >* MEMDIE process. > > > > >*/ > > > > > if (unlikely(test_thread_flag(TIF_MEMDIE) > > > > > - || fatal_signal_pending(current))) > > > > > + || fatal_signal_pending(current)) > > > > > + || current->flags & PF_EXITING) > > > > > goto bypass; > > > > > > > > > > if (unlikely(task_in_memcg_oom(current))) > > > > > > > > This would become problematic if significant amount of memory is > > > > charged > > > > in the exit() path. > > > > > > But this would hurt also for fatal_signal_pending tasks, wouldn't it? > > > > Yes, and as I've said twice now, that should be removed. > > And you failed to provide any relevant data to back your suggestions. I > have told you that we have these heuristics for ages and we need a > strong justification to drop them. So if you really think that they are > not appropriate then back your statements with real data. > > E.g. measure how much memory are we talking about. > The heuristic may have existed for ages, but the proposed memcg configuration for preserving memory such that userspace oom handlers may run such as _root__ / \ user oom /\ / \ AB a b where user/memory.limit_in_bytes == [amount of present RAM] + oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be problematic, including Johannes' buggy bypass for charges in memcgs with pending memcgs that has since been fixed after I identified it. This bypass is included. Processes attached to "a" and "b" are userspace oom handlers for processes attached to "A" and "B", respectively. The amount of memory you're talking about is proportional to the number of processes that have pending SIGKILLs (and now those with PF_EXITING set), the former of which is obviously more concerning since they could be charging memory at any point in the kernel that would succeed. The latter is concerning only if future memory allocation post-PF_EXITING would be become significant and nobody is going to think about oom memcg bypasses in such a case. To use the configuration suggested above, we need to prevent as many bypasses as possible to the root memcg. Otherwise, the memory protected for the "oom" memcg from processes constrained by the limit of "user" is no longer protected. This isn't only a problem with the bypasses here in the charging path, but also unaccounted kernel memory, for example. For this to be usable, we need to ensure that the limit of the "oom" memcg is protected for the userspace oom handlers that are attached. With a charge bypassed to the root memcg greater than or equal to the limit of the "oom" memcg OR cumulative charges bypassed to the root memcg greater than or equal to the limit of the "oom" memcg by processes with pending SIGKILLs, userspace oom handlers cannot respond. That's particuarly dangerous without a memcg oom kill delay, as proposed before, since userspace must disable oom killing entirely for both "A" and "B" for userspace notification to be meaningful, since all processes are now livelocked. > > These bypasses should be given to one thread and one thread only, > > which would be the oom killed thread if it needs access to memory > > reserves to either allocate memory or charge memory. > > There is no way to determine whether a task has been killed due to user > space OOM killer or by a regular kill. > I'm referring to only granting TIF_MEMDIE to a single process in any memcg hierarchy at or below the memcg that has encountered its limit to avoid granting it to many processes and bypassing their charges to the root memcg; the same variation of the above code, but going through the memcg oom killer to get TIF_MEMDIE first. We must be vigilant and only grant TIF_MEMDIE for the process that shall exit. > > If you are suggesting we use the "user" and "oom" top-level memcg > > hierarchy for allowing memory to be available for userspace system oom > > handlers, then this has become important when in the past it may have been > > a minor point. > > I am not sure it would be _that_ important and if that really becomes to > be the case then we should deal with it. So far I haven't see any > evidence there is a lot of memory charged on the exit path. > I'm debating both fatal_signal_pending() and PF_EXITING here since they are now both bypasses, we need to remove fatal_signal_pending(). My simple question with your patch: how do you guarantee
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Fri 10-01-14 13:33:01, David Rientjes wrote: > On Fri, 10 Jan 2014, Michal Hocko wrote: > > > > > --- > > > > mm/memcontrol.c | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index b8dfed1b9d87..b86fbb04b7c6 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct > > > > mm_struct *mm, > > > > * MEMDIE process. > > > > */ > > > > if (unlikely(test_thread_flag(TIF_MEMDIE) > > > > -|| fatal_signal_pending(current))) > > > > +|| fatal_signal_pending(current)) > > > > +|| current->flags & PF_EXITING) > > > > goto bypass; > > > > > > > > if (unlikely(task_in_memcg_oom(current))) > > > > > > This would become problematic if significant amount of memory is charged > > > in the exit() path. > > > > But this would hurt also for fatal_signal_pending tasks, wouldn't it? > > Yes, and as I've said twice now, that should be removed. And you failed to provide any relevant data to back your suggestions. I have told you that we have these heuristics for ages and we need a strong justification to drop them. So if you really think that they are not appropriate then back your statements with real data. E.g. measure how much memory are we talking about. > These bypasses should be given to one thread and one thread only, > which would be the oom killed thread if it needs access to memory > reserves to either allocate memory or charge memory. There is no way to determine whether a task has been killed due to user space OOM killer or by a regular kill. > If you are suggesting we use the "user" and "oom" top-level memcg > hierarchy for allowing memory to be available for userspace system oom > handlers, then this has become important when in the past it may have been > a minor point. I am not sure it would be _that_ important and if that really becomes to be the case then we should deal with it. So far I haven't see any evidence there is a lot of memory charged on the exit path. > > Besides that I do not see any source of allocation after exit_signals. > > > > That's fine for today but may not be in the future. If memory allocation > is done after PF_EXITING in the future, are people going to check memcg > bypasses? No. And now we have additional memory bypass to root that will > cause our userspace system oom hanlders to be oom themselves with the > suggested configuration. > > Using the "user" and "oom" top-level memcg hierarchy is a double edged > sword, we must attempt to prevent all of these bypasses as much as > possible. The only relevant bypass here is for TIF_MEMDIE which would be > set if necessary for the one thread that needs it. TIF_MEMDIE doesn't work for userspace OOM killers. So we cannot rely on this flag currently. > > > I don't know of an egregious amount of memory being > > > allocated and charged after PF_EXITING is set, but if it happens in the > > > future then this could potentially cause system oom conditions even in > > > memcg configurations > > > > Even if that happens then the global OOM killer would give the exiting > > task access to memory reserves and wouldn't kill anything else. > > > > So I am not sure what problem do you see exactly. > > > > Userspace system oom handlers being able to handle memcg oom conditions in > the top-level "user" memcg as proposed by Tejun. If the global oom killer > becomes a part of that discussion at all, then the userspace system oom > handler never got a chance to handle the "user" oom. > > > Besides that allocating egregious amount of memory after exit_signals > > sounds fundamentally broken to me. > > > > Egregious could be defined as allocating a few bytes multiplied by > thousands of threads in PF_EXITING. Does this happen in the real life. Look, I have no objections to make the OOM handling better but it would help a lot to build new heuristics based on some data in hands. I tried to repeat that again and again but it seems to not help. I do not want to end up with new sets of heuristics that break other stuff jut because they made sense in the context of the specific usecase. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Fri 10-01-14 13:33:01, David Rientjes wrote: On Fri, 10 Jan 2014, Michal Hocko wrote: --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) -|| fatal_signal_pending(current))) +|| fatal_signal_pending(current)) +|| current-flags PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) This would become problematic if significant amount of memory is charged in the exit() path. But this would hurt also for fatal_signal_pending tasks, wouldn't it? Yes, and as I've said twice now, that should be removed. And you failed to provide any relevant data to back your suggestions. I have told you that we have these heuristics for ages and we need a strong justification to drop them. So if you really think that they are not appropriate then back your statements with real data. E.g. measure how much memory are we talking about. These bypasses should be given to one thread and one thread only, which would be the oom killed thread if it needs access to memory reserves to either allocate memory or charge memory. There is no way to determine whether a task has been killed due to user space OOM killer or by a regular kill. If you are suggesting we use the user and oom top-level memcg hierarchy for allowing memory to be available for userspace system oom handlers, then this has become important when in the past it may have been a minor point. I am not sure it would be _that_ important and if that really becomes to be the case then we should deal with it. So far I haven't see any evidence there is a lot of memory charged on the exit path. Besides that I do not see any source of allocation after exit_signals. That's fine for today but may not be in the future. If memory allocation is done after PF_EXITING in the future, are people going to check memcg bypasses? No. And now we have additional memory bypass to root that will cause our userspace system oom hanlders to be oom themselves with the suggested configuration. Using the user and oom top-level memcg hierarchy is a double edged sword, we must attempt to prevent all of these bypasses as much as possible. The only relevant bypass here is for TIF_MEMDIE which would be set if necessary for the one thread that needs it. TIF_MEMDIE doesn't work for userspace OOM killers. So we cannot rely on this flag currently. I don't know of an egregious amount of memory being allocated and charged after PF_EXITING is set, but if it happens in the future then this could potentially cause system oom conditions even in memcg configurations Even if that happens then the global OOM killer would give the exiting task access to memory reserves and wouldn't kill anything else. So I am not sure what problem do you see exactly. Userspace system oom handlers being able to handle memcg oom conditions in the top-level user memcg as proposed by Tejun. If the global oom killer becomes a part of that discussion at all, then the userspace system oom handler never got a chance to handle the user oom. Besides that allocating egregious amount of memory after exit_signals sounds fundamentally broken to me. Egregious could be defined as allocating a few bytes multiplied by thousands of threads in PF_EXITING. Does this happen in the real life. Look, I have no objections to make the OOM handling better but it would help a lot to build new heuristics based on some data in hands. I tried to repeat that again and again but it seems to not help. I do not want to end up with new sets of heuristics that break other stuff jut because they made sense in the context of the specific usecase. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Wed, 15 Jan 2014, Michal Hocko wrote: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) - || fatal_signal_pending(current))) + || fatal_signal_pending(current)) + || current-flags PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) This would become problematic if significant amount of memory is charged in the exit() path. But this would hurt also for fatal_signal_pending tasks, wouldn't it? Yes, and as I've said twice now, that should be removed. And you failed to provide any relevant data to back your suggestions. I have told you that we have these heuristics for ages and we need a strong justification to drop them. So if you really think that they are not appropriate then back your statements with real data. E.g. measure how much memory are we talking about. The heuristic may have existed for ages, but the proposed memcg configuration for preserving memory such that userspace oom handlers may run such as _root__ / \ user oom /\ / \ AB a b where user/memory.limit_in_bytes == [amount of present RAM] + oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be problematic, including Johannes' buggy bypass for charges in memcgs with pending memcgs that has since been fixed after I identified it. This bypass is included. Processes attached to a and b are userspace oom handlers for processes attached to A and B, respectively. The amount of memory you're talking about is proportional to the number of processes that have pending SIGKILLs (and now those with PF_EXITING set), the former of which is obviously more concerning since they could be charging memory at any point in the kernel that would succeed. The latter is concerning only if future memory allocation post-PF_EXITING would be become significant and nobody is going to think about oom memcg bypasses in such a case. To use the configuration suggested above, we need to prevent as many bypasses as possible to the root memcg. Otherwise, the memory protected for the oom memcg from processes constrained by the limit of user is no longer protected. This isn't only a problem with the bypasses here in the charging path, but also unaccounted kernel memory, for example. For this to be usable, we need to ensure that the limit of the oom memcg is protected for the userspace oom handlers that are attached. With a charge bypassed to the root memcg greater than or equal to the limit of the oom memcg OR cumulative charges bypassed to the root memcg greater than or equal to the limit of the oom memcg by processes with pending SIGKILLs, userspace oom handlers cannot respond. That's particuarly dangerous without a memcg oom kill delay, as proposed before, since userspace must disable oom killing entirely for both A and B for userspace notification to be meaningful, since all processes are now livelocked. These bypasses should be given to one thread and one thread only, which would be the oom killed thread if it needs access to memory reserves to either allocate memory or charge memory. There is no way to determine whether a task has been killed due to user space OOM killer or by a regular kill. I'm referring to only granting TIF_MEMDIE to a single process in any memcg hierarchy at or below the memcg that has encountered its limit to avoid granting it to many processes and bypassing their charges to the root memcg; the same variation of the above code, but going through the memcg oom killer to get TIF_MEMDIE first. We must be vigilant and only grant TIF_MEMDIE for the process that shall exit. If you are suggesting we use the user and oom top-level memcg hierarchy for allowing memory to be available for userspace system oom handlers, then this has become important when in the past it may have been a minor point. I am not sure it would be _that_ important and if that really becomes to be the case then we should deal with it. So far I haven't see any evidence there is a lot of memory charged on the exit path. I'm debating both fatal_signal_pending() and PF_EXITING here since they are now both bypasses, we need to remove fatal_signal_pending(). My simple question with your patch: how do you guarantee memory to processes attached to a and b? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Fri, 10 Jan 2014, Michal Hocko wrote: > > > --- > > > mm/memcontrol.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index b8dfed1b9d87..b86fbb04b7c6 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct > > > *mm, > > >* MEMDIE process. > > >*/ > > > if (unlikely(test_thread_flag(TIF_MEMDIE) > > > - || fatal_signal_pending(current))) > > > + || fatal_signal_pending(current)) > > > + || current->flags & PF_EXITING) > > > goto bypass; > > > > > > if (unlikely(task_in_memcg_oom(current))) > > > > This would become problematic if significant amount of memory is charged > > in the exit() path. > > But this would hurt also for fatal_signal_pending tasks, wouldn't it? Yes, and as I've said twice now, that should be removed. These bypasses should be given to one thread and one thread only, which would be the oom killed thread if it needs access to memory reserves to either allocate memory or charge memory. If you are suggesting we use the "user" and "oom" top-level memcg hierarchy for allowing memory to be available for userspace system oom handlers, then this has become important when in the past it may have been a minor point. > Besides that I do not see any source of allocation after exit_signals. > That's fine for today but may not be in the future. If memory allocation is done after PF_EXITING in the future, are people going to check memcg bypasses? No. And now we have additional memory bypass to root that will cause our userspace system oom hanlders to be oom themselves with the suggested configuration. Using the "user" and "oom" top-level memcg hierarchy is a double edged sword, we must attempt to prevent all of these bypasses as much as possible. The only relevant bypass here is for TIF_MEMDIE which would be set if necessary for the one thread that needs it. > > I don't know of an egregious amount of memory being > > allocated and charged after PF_EXITING is set, but if it happens in the > > future then this could potentially cause system oom conditions even in > > memcg configurations > > Even if that happens then the global OOM killer would give the exiting > task access to memory reserves and wouldn't kill anything else. > > So I am not sure what problem do you see exactly. > Userspace system oom handlers being able to handle memcg oom conditions in the top-level "user" memcg as proposed by Tejun. If the global oom killer becomes a part of that discussion at all, then the userspace system oom handler never got a chance to handle the "user" oom. > Besides that allocating egregious amount of memory after exit_signals > sounds fundamentally broken to me. > Egregious could be defined as allocating a few bytes multiplied by thousands of threads in PF_EXITING. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Thu 09-01-14 13:40:10, David Rientjes wrote: > On Thu, 9 Jan 2014, Michal Hocko wrote: > > > Eric has reported that he can see task(s) stuck in memcg OOM handler > > regularly. The only way out is to > > echo 0 > $GROUP/memory.oom_controll > > His usecase is: > > - Setup a hierarchy with memory and the freezer > > (disable kernel oom and have a process watch for oom). > > - In that memory cgroup add a process with one thread per cpu. > > - In one thread slowly allocate once per second I think it is 16M of ram > > and mlock and dirty it (just to force the pages into ram and stay there). > > - When oom is achieved loop: > > * attempt to freeze all of the tasks. > > * if frozen send every task SIGKILL, unfreeze, remove the directory in > > cgroupfs. > > > > Eric has then pinpointed the issue to be memcg specific. > > > > All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. > > Those that have received fatal signal will bypass the charge and should > > continue on their way out. The tricky part is that the exit path might > > trigger a page fault (e.g. exit_robust_list), thus the memcg charge, > > while its memcg is still under OOM because nobody has released any > > charges yet. > > Unlike with the in-kernel OOM handler the exiting task doesn't get > > TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task > > and falls to the memcg OOM again without any way out of it as there are > > no fatal signals pending anymore. > > > > This patch fixes the issue by checking PF_EXITING early in > > __mem_cgroup_try_charge and bypass the charge same as if it had fatal > > signal pending or TIF_MEMDIE set. > > > > Normally exiting tasks (aka not killed) will bypass the charge now but > > this should be OK as the task is leaving and will release memory and > > increasing the memory pressure just to release it in a moment seems > > dubious wasting of cycles. Besides that charges after exit_signals > > should be rare. > > > > Reported-by: Eric W. Biederman > > Signed-off-by: Michal Hocko > > Is this tested? By Eric? No AFAIK. I wasn't able to reproduce the issue myself. > > --- > > mm/memcontrol.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index b8dfed1b9d87..b86fbb04b7c6 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct > > *mm, > > * MEMDIE process. > > */ > > if (unlikely(test_thread_flag(TIF_MEMDIE) > > -|| fatal_signal_pending(current))) > > +|| fatal_signal_pending(current)) > > +|| current->flags & PF_EXITING) > > goto bypass; > > > > if (unlikely(task_in_memcg_oom(current))) > > This would become problematic if significant amount of memory is charged > in the exit() path. But this would hurt also for fatal_signal_pending tasks, wouldn't it? Besides that I do not see any source of allocation after exit_signals. > I don't know of an egregious amount of memory being > allocated and charged after PF_EXITING is set, but if it happens in the > future then this could potentially cause system oom conditions even in > memcg configurations Even if that happens then the global OOM killer would give the exiting task access to memory reserves and wouldn't kill anything else. So I am not sure what problem do you see exactly. Besides that allocating egregious amount of memory after exit_signals sounds fundamentally broken to me. > that are designed such as the one Tejun suggested to > be able to handle such conditions in userspace: > >___root___ > / \ > useroom > / \/ \ > A BC D > > where the limit of user is equal to the amount of system memory minus > whatever amount of memory is needed by the system oom handler attached as > a descendant of oom and still allows the limits of A + B to exceed the > limit of user. > > So how do we ensure that memory allocations in the exit() path don't cause > system oom conditions whereas the above configuration no longer provides > any strict guarantee? > > Thanks. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Thu 09-01-14 13:40:10, David Rientjes wrote: On Thu, 9 Jan 2014, Michal Hocko wrote: Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 $GROUP/memory.oom_controll His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that the exit path might trigger a page fault (e.g. exit_robust_list), thus the memcg charge, while its memcg is still under OOM because nobody has released any charges yet. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch fixes the issue by checking PF_EXITING early in __mem_cgroup_try_charge and bypass the charge same as if it had fatal signal pending or TIF_MEMDIE set. Normally exiting tasks (aka not killed) will bypass the charge now but this should be OK as the task is leaving and will release memory and increasing the memory pressure just to release it in a moment seems dubious wasting of cycles. Besides that charges after exit_signals should be rare. Reported-by: Eric W. Biederman ebied...@xmission.com Signed-off-by: Michal Hocko mho...@suse.cz Is this tested? By Eric? No AFAIK. I wasn't able to reproduce the issue myself. --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) -|| fatal_signal_pending(current))) +|| fatal_signal_pending(current)) +|| current-flags PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) This would become problematic if significant amount of memory is charged in the exit() path. But this would hurt also for fatal_signal_pending tasks, wouldn't it? Besides that I do not see any source of allocation after exit_signals. I don't know of an egregious amount of memory being allocated and charged after PF_EXITING is set, but if it happens in the future then this could potentially cause system oom conditions even in memcg configurations Even if that happens then the global OOM killer would give the exiting task access to memory reserves and wouldn't kill anything else. So I am not sure what problem do you see exactly. Besides that allocating egregious amount of memory after exit_signals sounds fundamentally broken to me. that are designed such as the one Tejun suggested to be able to handle such conditions in userspace: ___root___ / \ useroom / \/ \ A BC D where the limit of user is equal to the amount of system memory minus whatever amount of memory is needed by the system oom handler attached as a descendant of oom and still allows the limits of A + B to exceed the limit of user. So how do we ensure that memory allocations in the exit() path don't cause system oom conditions whereas the above configuration no longer provides any strict guarantee? Thanks. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Fri, 10 Jan 2014, Michal Hocko wrote: --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) - || fatal_signal_pending(current))) + || fatal_signal_pending(current)) + || current-flags PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) This would become problematic if significant amount of memory is charged in the exit() path. But this would hurt also for fatal_signal_pending tasks, wouldn't it? Yes, and as I've said twice now, that should be removed. These bypasses should be given to one thread and one thread only, which would be the oom killed thread if it needs access to memory reserves to either allocate memory or charge memory. If you are suggesting we use the user and oom top-level memcg hierarchy for allowing memory to be available for userspace system oom handlers, then this has become important when in the past it may have been a minor point. Besides that I do not see any source of allocation after exit_signals. That's fine for today but may not be in the future. If memory allocation is done after PF_EXITING in the future, are people going to check memcg bypasses? No. And now we have additional memory bypass to root that will cause our userspace system oom hanlders to be oom themselves with the suggested configuration. Using the user and oom top-level memcg hierarchy is a double edged sword, we must attempt to prevent all of these bypasses as much as possible. The only relevant bypass here is for TIF_MEMDIE which would be set if necessary for the one thread that needs it. I don't know of an egregious amount of memory being allocated and charged after PF_EXITING is set, but if it happens in the future then this could potentially cause system oom conditions even in memcg configurations Even if that happens then the global OOM killer would give the exiting task access to memory reserves and wouldn't kill anything else. So I am not sure what problem do you see exactly. Userspace system oom handlers being able to handle memcg oom conditions in the top-level user memcg as proposed by Tejun. If the global oom killer becomes a part of that discussion at all, then the userspace system oom handler never got a chance to handle the user oom. Besides that allocating egregious amount of memory after exit_signals sounds fundamentally broken to me. Egregious could be defined as allocating a few bytes multiplied by thousands of threads in PF_EXITING. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Thu, 9 Jan 2014, Michal Hocko wrote: > Eric has reported that he can see task(s) stuck in memcg OOM handler > regularly. The only way out is to > echo 0 > $GROUP/memory.oom_controll > His usecase is: > - Setup a hierarchy with memory and the freezer > (disable kernel oom and have a process watch for oom). > - In that memory cgroup add a process with one thread per cpu. > - In one thread slowly allocate once per second I think it is 16M of ram > and mlock and dirty it (just to force the pages into ram and stay there). > - When oom is achieved loop: > * attempt to freeze all of the tasks. > * if frozen send every task SIGKILL, unfreeze, remove the directory in > cgroupfs. > > Eric has then pinpointed the issue to be memcg specific. > > All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. > Those that have received fatal signal will bypass the charge and should > continue on their way out. The tricky part is that the exit path might > trigger a page fault (e.g. exit_robust_list), thus the memcg charge, > while its memcg is still under OOM because nobody has released any > charges yet. > Unlike with the in-kernel OOM handler the exiting task doesn't get > TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task > and falls to the memcg OOM again without any way out of it as there are > no fatal signals pending anymore. > > This patch fixes the issue by checking PF_EXITING early in > __mem_cgroup_try_charge and bypass the charge same as if it had fatal > signal pending or TIF_MEMDIE set. > > Normally exiting tasks (aka not killed) will bypass the charge now but > this should be OK as the task is leaving and will release memory and > increasing the memory pressure just to release it in a moment seems > dubious wasting of cycles. Besides that charges after exit_signals > should be rare. > > Reported-by: Eric W. Biederman > Signed-off-by: Michal Hocko Is this tested? > --- > mm/memcontrol.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b8dfed1b9d87..b86fbb04b7c6 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, >* MEMDIE process. >*/ > if (unlikely(test_thread_flag(TIF_MEMDIE) > - || fatal_signal_pending(current))) > + || fatal_signal_pending(current)) > + || current->flags & PF_EXITING) > goto bypass; > > if (unlikely(task_in_memcg_oom(current))) This would become problematic if significant amount of memory is charged in the exit() path. I don't know of an egregious amount of memory being allocated and charged after PF_EXITING is set, but if it happens in the future then this could potentially cause system oom conditions even in memcg configurations that are designed such as the one Tejun suggested to be able to handle such conditions in userspace: ___root___ / \ useroom / \/ \ A BC D where the limit of user is equal to the amount of system memory minus whatever amount of memory is needed by the system oom handler attached as a descendant of oom and still allows the limits of A + B to exceed the limit of user. So how do we ensure that memory allocations in the exit() path don't cause system oom conditions whereas the above configuration no longer provides any strict guarantee? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Wed 08-01-14 11:33:19, Michal Hocko wrote: > On Tue 07-01-14 16:25:03, Andrew Morton wrote: > [...] > > > OK, so can we at least agree on the patch posted here: > > > https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely > > > worth fixing. > > > > Yes, can we please get Eric's bug fixed? I don't believe that Eric has > > tested either https://lkml.org/lkml/2013/12/12/129 or > > http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch. > > Is he the only person who can reproduce this? > > I have gathered 3 patches from all the discussion and plan to post them > today or tomorrow as the time permits. https://lkml.org/lkml/2013/12/12/129 > will be a part of it. OK, I've decided to post the oom notification parts later because they will likely generate some discussion which might distract from the actual fix so here it goes (can be applied on both mmotm and the current Linus' tree): --- >From 6185aaf3cd429b1ecf2d11d424e29b597feb7811 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 12 Dec 2013 11:37:27 +0100 Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 > $GROUP/memory.oom_controll His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that the exit path might trigger a page fault (e.g. exit_robust_list), thus the memcg charge, while its memcg is still under OOM because nobody has released any charges yet. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch fixes the issue by checking PF_EXITING early in __mem_cgroup_try_charge and bypass the charge same as if it had fatal signal pending or TIF_MEMDIE set. Normally exiting tasks (aka not killed) will bypass the charge now but this should be OK as the task is leaving and will release memory and increasing the memory pressure just to release it in a moment seems dubious wasting of cycles. Besides that charges after exit_signals should be rare. Reported-by: Eric W. Biederman Signed-off-by: Michal Hocko --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) -|| fatal_signal_pending(current))) +|| fatal_signal_pending(current)) +|| current->flags & PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) -- 1.8.5.2 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Wed 08-01-14 11:33:19, Michal Hocko wrote: On Tue 07-01-14 16:25:03, Andrew Morton wrote: [...] OK, so can we at least agree on the patch posted here: https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely worth fixing. Yes, can we please get Eric's bug fixed? I don't believe that Eric has tested either https://lkml.org/lkml/2013/12/12/129 or http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch. Is he the only person who can reproduce this? I have gathered 3 patches from all the discussion and plan to post them today or tomorrow as the time permits. https://lkml.org/lkml/2013/12/12/129 will be a part of it. OK, I've decided to post the oom notification parts later because they will likely generate some discussion which might distract from the actual fix so here it goes (can be applied on both mmotm and the current Linus' tree): --- From 6185aaf3cd429b1ecf2d11d424e29b597feb7811 Mon Sep 17 00:00:00 2001 From: Michal Hocko mho...@suse.cz Date: Thu, 12 Dec 2013 11:37:27 +0100 Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 $GROUP/memory.oom_controll His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that the exit path might trigger a page fault (e.g. exit_robust_list), thus the memcg charge, while its memcg is still under OOM because nobody has released any charges yet. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch fixes the issue by checking PF_EXITING early in __mem_cgroup_try_charge and bypass the charge same as if it had fatal signal pending or TIF_MEMDIE set. Normally exiting tasks (aka not killed) will bypass the charge now but this should be OK as the task is leaving and will release memory and increasing the memory pressure just to release it in a moment seems dubious wasting of cycles. Besides that charges after exit_signals should be rare. Reported-by: Eric W. Biederman ebied...@xmission.com Signed-off-by: Michal Hocko mho...@suse.cz --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) -|| fatal_signal_pending(current))) +|| fatal_signal_pending(current)) +|| current-flags PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) -- 1.8.5.2 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
On Thu, 9 Jan 2014, Michal Hocko wrote: Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 $GROUP/memory.oom_controll His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that the exit path might trigger a page fault (e.g. exit_robust_list), thus the memcg charge, while its memcg is still under OOM because nobody has released any charges yet. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch fixes the issue by checking PF_EXITING early in __mem_cgroup_try_charge and bypass the charge same as if it had fatal signal pending or TIF_MEMDIE set. Normally exiting tasks (aka not killed) will bypass the charge now but this should be OK as the task is leaving and will release memory and increasing the memory pressure just to release it in a moment seems dubious wasting of cycles. Besides that charges after exit_signals should be rare. Reported-by: Eric W. Biederman ebied...@xmission.com Signed-off-by: Michal Hocko mho...@suse.cz Is this tested? --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8dfed1b9d87..b86fbb04b7c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) - || fatal_signal_pending(current))) + || fatal_signal_pending(current)) + || current-flags PF_EXITING) goto bypass; if (unlikely(task_in_memcg_oom(current))) This would become problematic if significant amount of memory is charged in the exit() path. I don't know of an egregious amount of memory being allocated and charged after PF_EXITING is set, but if it happens in the future then this could potentially cause system oom conditions even in memcg configurations that are designed such as the one Tejun suggested to be able to handle such conditions in userspace: ___root___ / \ useroom / \/ \ A BC D where the limit of user is equal to the amount of system memory minus whatever amount of memory is needed by the system oom handler attached as a descendant of oom and still allows the limits of A + B to exceed the limit of user. So how do we ensure that memory allocations in the exit() path don't cause system oom conditions whereas the above configuration no longer provides any strict guarantee? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/