Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Fri, 13 Oct 2017, Roman Gushchin wrote: > > Think about it in a different way: we currently compare per-process usage > > and userspace has /proc/pid/oom_score_adj to adjust that usage depending > > on priorities of that process and still oom kill if there's a memory leak. > > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer > > after all. We don't need a strict memory.oom_priority that outranks all > > other sibling cgroups regardless of usage. We need a memory.oom_score_adj > > to adjust the per-cgroup usage. The decisionmaking in your earlier > > example would be under the control of C/memory.oom_score_adj and > > D/memory.oom_score_adj. Problem solved. > > > > It also solves the problem of userspace being able to influence oom victim > > selection so now they can protect important cgroups just like we can > > protect important processes today. > > > > And since this would be hierarchical usage, you can trivially infer root > > mem cgroup usage by subtraction of top-level mem cgroup usage. > > > > This is a powerful solution to the problem and gives userspace the control > > they need so that it can work in all usecases, not a subset of usecases. > > You're right that per-cgroup oom_score_adj may resolve the issue with > too strict semantics of oom_priorities. But I believe nobody likes > the existing per-process oom_score_adj interface, and there are reasons > behind. The previous heuristic before I rewrote the oom killer used /proc/pid/oom_adj which acted as a bitshift on mm->total_vm, which was a much more difficult interface to use as I'm sure you can imagine. People ended up only using it to polarize selection: either -17 to oom disable a process, -16 to bias against it, and 15 to prefer it. Nobody used anything in between and I worked with openssh, udev, kde, and chromium to get a consensus on the oom_score_adj semantics. People do use it to protect against memory leaks and to prevent oom killing important processes when something else can be sacrificed, unless there's a leak. > Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj > will work is not trivial. I suggest defining it in the terms used for previous iterations of the patchset: do hierarchical scoring so that each level of the hierarchy has usage information for each subtree. You can get root mem cgroup usage with complete fairness by subtraction with this method. When comparing usage at each level of the hierarchy, you can propagate the eligibility of processes in that subtree much like you do today. I agree with your change to make the oom killer a no-op if selection races with the actual killing rather than falling back to the old heuristic. I'm happy to help add a Tested-by once we settle the other issues with that change. At each level, I would state that memory.oom_score_adj has the exact same semantics as /proc/pid/oom_score_adj. In this case, it would simply be defined as a proportion of the parent's limit. If the hierarchy is iterated starting at the root mem cgroup for system ooms and at the root of the oom memcg for memcg ooms, this should lead to the exact same oom killing behavior, which is desired. This solution would address the three concerns that I had: it allows the root mem cgroup to be compared fairly with leaf mem cgroups (with the bonus of not requiring root mem cgroup accounting thanks to your heuristic using global vmstats), it allows userspace to influence the decisionmaking so that users can protect cgroups that use 50% of memory because they are important, and it completely avoids users being able to change victim selection simply by creating child mem cgroups. This would be a very powerful patchset. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Thu, Oct 12, 2017 at 02:50:38PM -0700, David Rientjes wrote: > On Wed, 11 Oct 2017, Roman Gushchin wrote: > > Think about it in a different way: we currently compare per-process usage > and userspace has /proc/pid/oom_score_adj to adjust that usage depending > on priorities of that process and still oom kill if there's a memory leak. > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer > after all. We don't need a strict memory.oom_priority that outranks all > other sibling cgroups regardless of usage. We need a memory.oom_score_adj > to adjust the per-cgroup usage. The decisionmaking in your earlier > example would be under the control of C/memory.oom_score_adj and > D/memory.oom_score_adj. Problem solved. > > It also solves the problem of userspace being able to influence oom victim > selection so now they can protect important cgroups just like we can > protect important processes today. > > And since this would be hierarchical usage, you can trivially infer root > mem cgroup usage by subtraction of top-level mem cgroup usage. > > This is a powerful solution to the problem and gives userspace the control > they need so that it can work in all usecases, not a subset of usecases. You're right that per-cgroup oom_score_adj may resolve the issue with too strict semantics of oom_priorities. But I believe nobody likes the existing per-process oom_score_adj interface, and there are reasons behind. Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj will work is not trivial. For example, earlier in this thread I've shown an example, when a decision which of two processes should be killed depends on whether it's global or memcg-wide oom, despite both belong to a single cgroup! Of course, it's technically trivial to implement some analog of oom_score_adj for cgroups (and early versions of this patchset did that). But the right question is: is this an interface we want to support for the next many years? I'm not sure. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Wed, 11 Oct 2017, Roman Gushchin wrote: > > But let's move the discussion forward to fix it. To avoid necessarily > > accounting memory to the root mem cgroup, have we considered if it is even > > necessary to address the root mem cgroup? For the users who opt-in to > > this heuristic, would it be possible to discount the root mem cgroup from > > the heuristic entirely so that oom kills originate from leaf mem cgroups? > > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if > > the victim rss is greater than an eligible victim rss attached to the root > > mem cgroup? > > David, I'm not pretending for implementing the best possible accounting > for the root memory cgroup, and I'm sure there is a place for further > enhancement. But if it's not leading to some obviously stupid victim > selection (like ignoring leaking task, which consumes most of the memory), > I don't see why it should be treated as a blocker for the whole patchset. > I also doubt that any of us has these examples, and the best way to get > them is to get some real usage feedback. > > Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc, > these all are perfect ideas which can be implemented on top of this patchset. > For the root mem cgroup to be compared to leaf mem cgroups, it needs a fair comparison, not something that we leave to some future patches on top of this patchset. We can't compare some cgroups with other cgroups based on different criteria depending on which cgroup is involved. It's actually a quite trivial problem to address, it was a small modiifcation to your hierarchical usage patchset if that's the way that you elect to fix it. I know that some of our customers use cgroups only for one or two jobs on the system, and that isn't necessarily just for memory limitation. The fact remains, that without considering the root mem cgroup fairly, that these customers are unfairly biased against because they have aggregated their processes in a cgroup. This a not a highly specialized usecase, I am positive that many users use cgroups only for a subset of processes. This heuristic penalizes that behavior to prefer them as oom victims. The problem needs to be fixed instead of asking for the patchset to be merged and hope that we'll address these issues later. If you account for hierarchical usage, you can easily subtract this from global vmstats to get an implicit root usage. > > You would be required to discount oom_score_adj because the heuristic > > doesn't account for oom_score_adj when comparing the anon + unevictable + > > unreclaimable slab of leaf mem cgroups. This wouldn't result in the > > correct victim selection in real-world scenarios where processes attached > > to the root mem cgroup are vital to the system and not part of any user > > job, i.e. they are important system daemons and the "activity manager" > > responsible for orchestrating the cgroup hierarchy. > > > > It's also still unfair because it now compares > > [sum of rss of processes attached to a cgroup] to > > [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't > > going to be a solution, regardless if its one process or all processes, if > > it's being compared to more types of memory in leaf cgroups. > > > > If we really don't want root mem cgroup accounting so this is a fair > > comparison, I think the heuristic needs to special case the root mem > > cgroup either by discounting root oom kills if there are eligible oom > > kills from leaf cgroups (the user would be opting-in to this behavior) or > > comparing the badness of a victim from a leaf cgroup to the badness of a > > victim from the root cgroup when deciding which to kill and allow the user > > to protect root mem cgroup processes with oom_score_adj. > > > > That aside, all of this has only addressed one of the three concerns with > > the patchset. > > > > I believe the solution to avoid allowing users to circumvent oom kill is > > to account usage up the hierarchy as you have done in the past. Cgroup > > hierarchies can be managed by the user so they can create their own > > subcontainers, this is nothing new, and I would hope that you wouldn't > > limit your feature to only a very specific set of usecases. That may be > > your solution for the root mem cgroup itself: if the hierarchical usage of > > all top-level mem cgroups is known, it's possible to find the root mem > > cgroup usage by subtraction, you are using stats that are global vmstats > > in your heuristic. > > > > Accounting usage up the hierarchy avoids the first two concerns with the > > patchset. It allows you to implicitly understand the usage of the root > > mem cgroup itself, and does not allow users to circumvent oom kill by > > creating subcontainers, either purposefully or not. The third concern, > > userspace influence, can allow users to attack leaf mem cgroups deeper in > > the tree if
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Wed 11-10-17 13:27:44, David Rientjes wrote: > On Wed, 11 Oct 2017, Michal Hocko wrote: > > > > For these reasons: unfair comparison of root mem cgroup usage to bias > > > against that mem cgroup from oom kill in system oom conditions, the > > > ability of users to completely evade the oom killer by attaching all > > > processes to child cgroups either purposefully or unpurposefully, and the > > > inability of userspace to effectively control oom victim selection: > > > > > > Nacked-by: David Rientjes> > > > I consider this NACK rather dubious. Evading the heuristic as you > > describe requires root privileges in default configuration because > > normal users are not allowed to create subtrees. If you > > really want to delegate subtree to an untrusted entity then you do not > > have to opt-in for this oom strategy. We can work on an additional means > > which would allow to cover those as well (e.g. priority based one which > > is requested for other usecases). > > > > You're missing the point that the user is trusted and it may be doing > something to circumvent oom kill unknowingly. I would really like to see a practical example of something like that. I am not saying this is completely impossible but as already pointed out this _can_ be addressed _on top_ of the current implementation. We will need some way to consider hierarchies anyway. So I really fail to see why this would be a blocker. After all it is no different than skipping oom selection by splitting a process (knowingly or otherwise) into subprocesses which is possible even now. OOM killer selection has never been, will not be and cannot be perfect in principal. Quite contrary, the more clever the heuristics are trying to be the more corner cases they might generate as we could see in the past. > With a single unified > hierarchy, the user is forced to attach its processes to subcontainers if > it wants to constrain resources with other controllers. Doing so ends up > completely avoiding oom kill because of this implementation detail. It > has nothing to do with trust and the admin who is opting-in will not know > a user has cirumvented oom kill purely because it constrains its processes > with controllers other than the memory controller. > > > A similar argument applies to the root memcg evaluation. While the > > proposed behavior is not optimal it would work for general usecase > > described here where the root memcg doesn't really run any large number > > of tasks. If somebody who explicitly opts-in for the new strategy and it > > doesn't work well for that usecase we can enhance the behavior. That > > alone is not a reason to nack the whole thing. > > > > I find it really disturbing that you keep nacking this approach just > > because it doesn't suite your specific usecase while it doesn't break > > it. Moreover it has been stated several times already that future > > improvements are possible and cover what you have described already. > > This has nothing to do with my specific usecase. Well, I might be really wrong but it is hard to not notice how most of your complains push towards hierarchical level-by-level comparisons. Which has been considered and deemed unsuitable for the default cgroup aware oom selection because it imposes structural constrains on how the hierarchy is organized and thus disallow many usecases. So pushing for this just because it resembles your current inhouse implementation leaves me with a feeling that you care more about your usecase than a general usability. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Wed, Oct 11, 2017 at 01:21:47PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > We don't need a better approximation, we need a fair comparison. The > > > heuristic that this patchset is implementing is based on the usage of > > > individual mem cgroups. For the root mem cgroup to be considered > > > eligible, we need to understand its usage. That usage is _not_ what is > > > implemented by this patchset, which is the largest rss of a single > > > attached process. This, in fact, is not an "approximation" at all. In > > > the example of 1 processes attached with 80MB rss each, the usage of > > > the root mem cgroup is _not_ 80MB. > > > > It's hard to imagine a "healthy" setup with 1 process in the root > > memory cgroup, and even if we kill 1 process we will still have > > remaining process. I agree with you at some point, but it's not > > a real world example. > > > > It's an example that illustrates the problem with the unfair comparison > between the root mem cgroup and leaf mem cgroups. It's unfair to compare > [largest rss of a single process attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. It's not an > approximation, as previously stated: the usage of the root mem cgroup is > not 100MB if there are 10 such processes attached to the root mem cgroup, > it's off by orders of magnitude. > > For the root mem cgroup to be treated equally as a leaf mem cgroup as this > patchset proposes, it must have a fair comparison. That can be done by > accounting memory to the root mem cgroup in the same way it is to leaf mem > cgroups. > > But let's move the discussion forward to fix it. To avoid necessarily > accounting memory to the root mem cgroup, have we considered if it is even > necessary to address the root mem cgroup? For the users who opt-in to > this heuristic, would it be possible to discount the root mem cgroup from > the heuristic entirely so that oom kills originate from leaf mem cgroups? > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if > the victim rss is greater than an eligible victim rss attached to the root > mem cgroup? David, I'm not pretending for implementing the best possible accounting for the root memory cgroup, and I'm sure there is a place for further enhancement. But if it's not leading to some obviously stupid victim selection (like ignoring leaking task, which consumes most of the memory), I don't see why it should be treated as a blocker for the whole patchset. I also doubt that any of us has these examples, and the best way to get them is to get some real usage feedback. Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc, these all are perfect ideas which can be implemented on top of this patchset. > > > > For these reasons: unfair comparison of root mem cgroup usage to bias > > > against that mem cgroup from oom kill in system oom conditions, the > > > ability of users to completely evade the oom killer by attaching all > > > processes to child cgroups either purposefully or unpurposefully, and the > > > inability of userspace to effectively control oom victim selection: > > > > > > Nacked-by: David Rientjes> > > > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, > > will it fix the problem? > > > > It might have some drawbacks as well (especially around oom_score_adj), > > but it's doable, if we'll ignore tasks which are not owners of their's mm > > struct. > > > > You would be required to discount oom_score_adj because the heuristic > doesn't account for oom_score_adj when comparing the anon + unevictable + > unreclaimable slab of leaf mem cgroups. This wouldn't result in the > correct victim selection in real-world scenarios where processes attached > to the root mem cgroup are vital to the system and not part of any user > job, i.e. they are important system daemons and the "activity manager" > responsible for orchestrating the cgroup hierarchy. > > It's also still unfair because it now compares > [sum of rss of processes attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't > going to be a solution, regardless if its one process or all processes, if > it's being compared to more types of memory in leaf cgroups. > > If we really don't want root mem cgroup accounting so this is a fair > comparison, I think the heuristic needs to special case the root mem > cgroup either by discounting root oom kills if there are eligible oom > kills from leaf cgroups (the user would be opting-in to this behavior) or > comparing the badness of a victim from a leaf cgroup to the badness of a > victim from the root cgroup when deciding which to kill and allow the user > to protect root mem cgroup processes with oom_score_adj. > > That aside, all of this has only addressed one of the three concerns with
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Wed, 11 Oct 2017, Michal Hocko wrote: > > For these reasons: unfair comparison of root mem cgroup usage to bias > > against that mem cgroup from oom kill in system oom conditions, the > > ability of users to completely evade the oom killer by attaching all > > processes to child cgroups either purposefully or unpurposefully, and the > > inability of userspace to effectively control oom victim selection: > > > > Nacked-by: David Rientjes> > I consider this NACK rather dubious. Evading the heuristic as you > describe requires root privileges in default configuration because > normal users are not allowed to create subtrees. If you > really want to delegate subtree to an untrusted entity then you do not > have to opt-in for this oom strategy. We can work on an additional means > which would allow to cover those as well (e.g. priority based one which > is requested for other usecases). > You're missing the point that the user is trusted and it may be doing something to circumvent oom kill unknowingly. With a single unified hierarchy, the user is forced to attach its processes to subcontainers if it wants to constrain resources with other controllers. Doing so ends up completely avoiding oom kill because of this implementation detail. It has nothing to do with trust and the admin who is opting-in will not know a user has cirumvented oom kill purely because it constrains its processes with controllers other than the memory controller. > A similar argument applies to the root memcg evaluation. While the > proposed behavior is not optimal it would work for general usecase > described here where the root memcg doesn't really run any large number > of tasks. If somebody who explicitly opts-in for the new strategy and it > doesn't work well for that usecase we can enhance the behavior. That > alone is not a reason to nack the whole thing. > > I find it really disturbing that you keep nacking this approach just > because it doesn't suite your specific usecase while it doesn't break > it. Moreover it has been stated several times already that future > improvements are possible and cover what you have described already. This has nothing to do with my specific usecase. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Tue, 10 Oct 2017, Roman Gushchin wrote: > > We don't need a better approximation, we need a fair comparison. The > > heuristic that this patchset is implementing is based on the usage of > > individual mem cgroups. For the root mem cgroup to be considered > > eligible, we need to understand its usage. That usage is _not_ what is > > implemented by this patchset, which is the largest rss of a single > > attached process. This, in fact, is not an "approximation" at all. In > > the example of 1 processes attached with 80MB rss each, the usage of > > the root mem cgroup is _not_ 80MB. > > It's hard to imagine a "healthy" setup with 1 process in the root > memory cgroup, and even if we kill 1 process we will still have > remaining process. I agree with you at some point, but it's not > a real world example. > It's an example that illustrates the problem with the unfair comparison between the root mem cgroup and leaf mem cgroups. It's unfair to compare [largest rss of a single process attached to a cgroup] to [anon + unevictable + unreclaimable slab usage of a cgroup]. It's not an approximation, as previously stated: the usage of the root mem cgroup is not 100MB if there are 10 such processes attached to the root mem cgroup, it's off by orders of magnitude. For the root mem cgroup to be treated equally as a leaf mem cgroup as this patchset proposes, it must have a fair comparison. That can be done by accounting memory to the root mem cgroup in the same way it is to leaf mem cgroups. But let's move the discussion forward to fix it. To avoid necessarily accounting memory to the root mem cgroup, have we considered if it is even necessary to address the root mem cgroup? For the users who opt-in to this heuristic, would it be possible to discount the root mem cgroup from the heuristic entirely so that oom kills originate from leaf mem cgroups? Or, perhaps better, oom kill from non-memory.oom_group cgroups only if the victim rss is greater than an eligible victim rss attached to the root mem cgroup? > > For these reasons: unfair comparison of root mem cgroup usage to bias > > against that mem cgroup from oom kill in system oom conditions, the > > ability of users to completely evade the oom killer by attaching all > > processes to child cgroups either purposefully or unpurposefully, and the > > inability of userspace to effectively control oom victim selection: > > > > Nacked-by: David Rientjes> > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, > will it fix the problem? > > It might have some drawbacks as well (especially around oom_score_adj), > but it's doable, if we'll ignore tasks which are not owners of their's mm > struct. > You would be required to discount oom_score_adj because the heuristic doesn't account for oom_score_adj when comparing the anon + unevictable + unreclaimable slab of leaf mem cgroups. This wouldn't result in the correct victim selection in real-world scenarios where processes attached to the root mem cgroup are vital to the system and not part of any user job, i.e. they are important system daemons and the "activity manager" responsible for orchestrating the cgroup hierarchy. It's also still unfair because it now compares [sum of rss of processes attached to a cgroup] to [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't going to be a solution, regardless if its one process or all processes, if it's being compared to more types of memory in leaf cgroups. If we really don't want root mem cgroup accounting so this is a fair comparison, I think the heuristic needs to special case the root mem cgroup either by discounting root oom kills if there are eligible oom kills from leaf cgroups (the user would be opting-in to this behavior) or comparing the badness of a victim from a leaf cgroup to the badness of a victim from the root cgroup when deciding which to kill and allow the user to protect root mem cgroup processes with oom_score_adj. That aside, all of this has only addressed one of the three concerns with the patchset. I believe the solution to avoid allowing users to circumvent oom kill is to account usage up the hierarchy as you have done in the past. Cgroup hierarchies can be managed by the user so they can create their own subcontainers, this is nothing new, and I would hope that you wouldn't limit your feature to only a very specific set of usecases. That may be your solution for the root mem cgroup itself: if the hierarchical usage of all top-level mem cgroups is known, it's possible to find the root mem cgroup usage by subtraction, you are using stats that are global vmstats in your heuristic. Accounting usage up the hierarchy avoids the first two concerns with the patchset. It allows you to implicitly understand the usage of the root mem cgroup itself, and does not allow users to circumvent oom kill by creating
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > This seems to unfairly bias the root mem cgroup depending on process > > > size. > > > It isn't treated fairly as a leaf mem cgroup if they are being compared > > > based on different criteria: the root mem cgroup as (mostly) the largest > > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > > > unreclaimable slab pages charged to it by all processes. > > > > > > I imagine a configuration where the root mem cgroup has 100 processes > > > attached each with rss of 80MB, compared to a leaf cgroup with 100 > > > processes of 1MB rss each. How does this logic prevent repeatedly oom > > > killing the processes of 1MB rss? > > > > > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > > > quite fair, it can simply hide large processes from being selected. > > > Users > > > who configure cgroups in a unified hierarchy for other resource > > > constraints are penalized for this choice even though the mem cgroup with > > > 100 processes of 1MB rss each may not be limited itself. > > > > > > I think for this comparison to be fair, it requires accounting for the > > > root mem cgroup itself or for a different accounting methodology for leaf > > > memory cgroups. > > > > This is basically a workaround, because we don't have necessary stats for > > root > > memory cgroup. If we'll start gathering them at some point, we can change > > this > > and treat root memcg exactly as other leaf cgroups. > > > > I understand why it currently cannot be an apples vs apples comparison > without, as I suggest in the last paragraph, that the same accounting is > done for the root mem cgroup, which is intuitive if it is to be considered > on the same basis as leaf mem cgroups. > > I understand for the design to work that leaf mem cgroups and the root mem > cgroup must be compared if processes can be attached to the root mem > cgroup. My point is that it is currently completely unfair as I've > stated: you can have 1 processes attached to the root mem cgroup with > rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each > and the oom killer is going to target the leaf mem cgroup as a result of > this apples vs oranges comparison. > > In case it's not clear, the 1 processes of 80MB rss each is the most > likely contributor to a system-wide oom kill. Unfortunately, the > heuristic introduced by this patchset is broken wrt a fair comparison of > the root mem cgroup usage. > > > Or, if someone will come with an idea of a better approximation, it can be > > implemented as a separate enhancement on top of the initial implementation. > > This is more than welcome. > > > > We don't need a better approximation, we need a fair comparison. The > heuristic that this patchset is implementing is based on the usage of > individual mem cgroups. For the root mem cgroup to be considered > eligible, we need to understand its usage. That usage is _not_ what is > implemented by this patchset, which is the largest rss of a single > attached process. This, in fact, is not an "approximation" at all. In > the example of 1 processes attached with 80MB rss each, the usage of > the root mem cgroup is _not_ 80MB. > > I'll restate that oom killing a process is a last resort for the kernel, > but it also must be able to make a smart decision. Targeting dozens of > 1MB processes instead of 80MB processes because of a shortcoming in this > implementation is not the appropriate selection, it's the opposite of the > correct selection. > > > > I'll reiterate what I did on the last version of the patchset: > > > considering > > > only leaf memory cgroups easily allows users to defeat this heuristic and > > > bias against all of their memory usage up to the largest process size > > > amongst the set of processes attached. If the user creates N child mem > > > cgroups for their N processes and attaches one process to each child, the > > > _only_ thing this achieved is to defeat your heuristic and prefer other > > > leaf cgroups simply because those other leaf cgroups did not do this. > > > > > > Effectively: > > > > > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > > > > > will radically shift the heuristic from a score of all anonymous + > > > unevictable memory for all processes to a score of the largest anonymous + > > > unevictable memory for a single process. There is no downside or > > > ramifaction for the end user in doing this. When comparing cgroups based > > > on usage, it only makes sense to compare the hierarchical usage of that > > > cgroup so that attaching processes to descendants or splitting the > > > implementation of a process into several smaller individual processes > > > does > > > not allow this heuristic to be defeated. > > > > To all
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Tue 10-10-17 14:13:00, David Rientjes wrote: [...] > For these reasons: unfair comparison of root mem cgroup usage to bias > against that mem cgroup from oom kill in system oom conditions, the > ability of users to completely evade the oom killer by attaching all > processes to child cgroups either purposefully or unpurposefully, and the > inability of userspace to effectively control oom victim selection: > > Nacked-by: David RientjesI consider this NACK rather dubious. Evading the heuristic as you describe requires root privileges in default configuration because normal users are not allowed to create subtrees. If you really want to delegate subtree to an untrusted entity then you do not have to opt-in for this oom strategy. We can work on an additional means which would allow to cover those as well (e.g. priority based one which is requested for other usecases). A similar argument applies to the root memcg evaluation. While the proposed behavior is not optimal it would work for general usecase described here where the root memcg doesn't really run any large number of tasks. If somebody who explicitly opts-in for the new strategy and it doesn't work well for that usecase we can enhance the behavior. That alone is not a reason to nack the whole thing. I find it really disturbing that you keep nacking this approach just because it doesn't suite your specific usecase while it doesn't break it. Moreover it has been stated several times already that future improvements are possible and cover what you have described already. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > This seems to unfairly bias the root mem cgroup depending on process > > > size. > > > It isn't treated fairly as a leaf mem cgroup if they are being compared > > > based on different criteria: the root mem cgroup as (mostly) the largest > > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > > > unreclaimable slab pages charged to it by all processes. > > > > > > I imagine a configuration where the root mem cgroup has 100 processes > > > attached each with rss of 80MB, compared to a leaf cgroup with 100 > > > processes of 1MB rss each. How does this logic prevent repeatedly oom > > > killing the processes of 1MB rss? > > > > > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > > > quite fair, it can simply hide large processes from being selected. > > > Users > > > who configure cgroups in a unified hierarchy for other resource > > > constraints are penalized for this choice even though the mem cgroup with > > > 100 processes of 1MB rss each may not be limited itself. > > > > > > I think for this comparison to be fair, it requires accounting for the > > > root mem cgroup itself or for a different accounting methodology for leaf > > > memory cgroups. > > > > This is basically a workaround, because we don't have necessary stats for > > root > > memory cgroup. If we'll start gathering them at some point, we can change > > this > > and treat root memcg exactly as other leaf cgroups. > > > > I understand why it currently cannot be an apples vs apples comparison > without, as I suggest in the last paragraph, that the same accounting is > done for the root mem cgroup, which is intuitive if it is to be considered > on the same basis as leaf mem cgroups. > > I understand for the design to work that leaf mem cgroups and the root mem > cgroup must be compared if processes can be attached to the root mem > cgroup. My point is that it is currently completely unfair as I've > stated: you can have 1 processes attached to the root mem cgroup with > rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each > and the oom killer is going to target the leaf mem cgroup as a result of > this apples vs oranges comparison. > > In case it's not clear, the 1 processes of 80MB rss each is the most > likely contributor to a system-wide oom kill. Unfortunately, the > heuristic introduced by this patchset is broken wrt a fair comparison of > the root mem cgroup usage. > > > Or, if someone will come with an idea of a better approximation, it can be > > implemented as a separate enhancement on top of the initial implementation. > > This is more than welcome. > > > > We don't need a better approximation, we need a fair comparison. The > heuristic that this patchset is implementing is based on the usage of > individual mem cgroups. For the root mem cgroup to be considered > eligible, we need to understand its usage. That usage is _not_ what is > implemented by this patchset, which is the largest rss of a single > attached process. This, in fact, is not an "approximation" at all. In > the example of 1 processes attached with 80MB rss each, the usage of > the root mem cgroup is _not_ 80MB. It's hard to imagine a "healthy" setup with 1 process in the root memory cgroup, and even if we kill 1 process we will still have remaining process. I agree with you at some point, but it's not a real world example. > > I'll restate that oom killing a process is a last resort for the kernel, > but it also must be able to make a smart decision. Targeting dozens of > 1MB processes instead of 80MB processes because of a shortcoming in this > implementation is not the appropriate selection, it's the opposite of the > correct selection. > > > > I'll reiterate what I did on the last version of the patchset: > > > considering > > > only leaf memory cgroups easily allows users to defeat this heuristic and > > > bias against all of their memory usage up to the largest process size > > > amongst the set of processes attached. If the user creates N child mem > > > cgroups for their N processes and attaches one process to each child, the > > > _only_ thing this achieved is to defeat your heuristic and prefer other > > > leaf cgroups simply because those other leaf cgroups did not do this. > > > > > > Effectively: > > > > > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > > > > > will radically shift the heuristic from a score of all anonymous + > > > unevictable memory for all processes to a score of the largest anonymous + > > > unevictable memory for a single process. There is no downside or > > > ramifaction for the end user in doing this. When comparing cgroups based > > > on usage, it only makes sense to compare the hierarchical usage of that >
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Tue, 10 Oct 2017, Roman Gushchin wrote: > > This seems to unfairly bias the root mem cgroup depending on process size. > > It isn't treated fairly as a leaf mem cgroup if they are being compared > > based on different criteria: the root mem cgroup as (mostly) the largest > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > > unreclaimable slab pages charged to it by all processes. > > > > I imagine a configuration where the root mem cgroup has 100 processes > > attached each with rss of 80MB, compared to a leaf cgroup with 100 > > processes of 1MB rss each. How does this logic prevent repeatedly oom > > killing the processes of 1MB rss? > > > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > > quite fair, it can simply hide large processes from being selected. Users > > who configure cgroups in a unified hierarchy for other resource > > constraints are penalized for this choice even though the mem cgroup with > > 100 processes of 1MB rss each may not be limited itself. > > > > I think for this comparison to be fair, it requires accounting for the > > root mem cgroup itself or for a different accounting methodology for leaf > > memory cgroups. > > This is basically a workaround, because we don't have necessary stats for root > memory cgroup. If we'll start gathering them at some point, we can change this > and treat root memcg exactly as other leaf cgroups. > I understand why it currently cannot be an apples vs apples comparison without, as I suggest in the last paragraph, that the same accounting is done for the root mem cgroup, which is intuitive if it is to be considered on the same basis as leaf mem cgroups. I understand for the design to work that leaf mem cgroups and the root mem cgroup must be compared if processes can be attached to the root mem cgroup. My point is that it is currently completely unfair as I've stated: you can have 1 processes attached to the root mem cgroup with rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each and the oom killer is going to target the leaf mem cgroup as a result of this apples vs oranges comparison. In case it's not clear, the 1 processes of 80MB rss each is the most likely contributor to a system-wide oom kill. Unfortunately, the heuristic introduced by this patchset is broken wrt a fair comparison of the root mem cgroup usage. > Or, if someone will come with an idea of a better approximation, it can be > implemented as a separate enhancement on top of the initial implementation. > This is more than welcome. > We don't need a better approximation, we need a fair comparison. The heuristic that this patchset is implementing is based on the usage of individual mem cgroups. For the root mem cgroup to be considered eligible, we need to understand its usage. That usage is _not_ what is implemented by this patchset, which is the largest rss of a single attached process. This, in fact, is not an "approximation" at all. In the example of 1 processes attached with 80MB rss each, the usage of the root mem cgroup is _not_ 80MB. I'll restate that oom killing a process is a last resort for the kernel, but it also must be able to make a smart decision. Targeting dozens of 1MB processes instead of 80MB processes because of a shortcoming in this implementation is not the appropriate selection, it's the opposite of the correct selection. > > I'll reiterate what I did on the last version of the patchset: considering > > only leaf memory cgroups easily allows users to defeat this heuristic and > > bias against all of their memory usage up to the largest process size > > amongst the set of processes attached. If the user creates N child mem > > cgroups for their N processes and attaches one process to each child, the > > _only_ thing this achieved is to defeat your heuristic and prefer other > > leaf cgroups simply because those other leaf cgroups did not do this. > > > > Effectively: > > > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > > > will radically shift the heuristic from a score of all anonymous + > > unevictable memory for all processes to a score of the largest anonymous + > > unevictable memory for a single process. There is no downside or > > ramifaction for the end user in doing this. When comparing cgroups based > > on usage, it only makes sense to compare the hierarchical usage of that > > cgroup so that attaching processes to descendants or splitting the > > implementation of a process into several smaller individual processes does > > not allow this heuristic to be defeated. > > To all previously said words I can only add that cgroup v2 allows to limit > the amount of cgroups in the sub-tree: > 1a926e0bbab8 ("cgroup: implement hierarchy limits"). > So the solution to for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done evading all oom
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Mon, Oct 09, 2017 at 02:52:53PM -0700, David Rientjes wrote: > On Thu, 5 Oct 2017, Roman Gushchin wrote: > > > Traditionally, the OOM killer is operating on a process level. > > Under oom conditions, it finds a process with the highest oom score > > and kills it. > > > > This behavior doesn't suit well the system with many running > > containers: > > > > 1) There is no fairness between containers. A small container with > > few large processes will be chosen over a large one with huge > > number of small processes. > > > > 2) Containers often do not expect that some random process inside > > will be killed. In many cases much safer behavior is to kill > > all tasks in the container. Traditionally, this was implemented > > in userspace, but doing it in the kernel has some advantages, > > especially in a case of a system-wide OOM. > > > > I'd move the second point to the changelog for the next patch since this > patch doesn't implement any support for memory.oom_group. There is a special remark later in the changelog explaining that this functionality will be added by following patches. I've thought it's useful to have all basic ideas in the one place. > > > To address these issues, the cgroup-aware OOM killer is introduced. > > > > Under OOM conditions, it looks for the biggest leaf memory cgroup > > and kills the biggest task belonging to it. The following patches > > will extend this functionality to consider non-leaf memory cgroups > > as well, and also provide an ability to kill all tasks belonging > > to the victim cgroup. > > > > The root cgroup is treated as a leaf memory cgroup, so it's score > > is compared with leaf memory cgroups. > > Due to memcg statistics implementation a special algorithm > > is used for estimating it's oom_score: we define it as maximum > > oom_score of the belonging tasks. > > > > This seems to unfairly bias the root mem cgroup depending on process size. > It isn't treated fairly as a leaf mem cgroup if they are being compared > based on different criteria: the root mem cgroup as (mostly) the largest > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > unreclaimable slab pages charged to it by all processes. > > I imagine a configuration where the root mem cgroup has 100 processes > attached each with rss of 80MB, compared to a leaf cgroup with 100 > processes of 1MB rss each. How does this logic prevent repeatedly oom > killing the processes of 1MB rss? > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > quite fair, it can simply hide large processes from being selected. Users > who configure cgroups in a unified hierarchy for other resource > constraints are penalized for this choice even though the mem cgroup with > 100 processes of 1MB rss each may not be limited itself. > > I think for this comparison to be fair, it requires accounting for the > root mem cgroup itself or for a different accounting methodology for leaf > memory cgroups. This is basically a workaround, because we don't have necessary stats for root memory cgroup. If we'll start gathering them at some point, we can change this and treat root memcg exactly as other leaf cgroups. Or, if someone will come with an idea of a better approximation, it can be implemented as a separate enhancement on top of the initial implementation. This is more than welcome. > > I'll reiterate what I did on the last version of the patchset: considering > only leaf memory cgroups easily allows users to defeat this heuristic and > bias against all of their memory usage up to the largest process size > amongst the set of processes attached. If the user creates N child mem > cgroups for their N processes and attaches one process to each child, the > _only_ thing this achieved is to defeat your heuristic and prefer other > leaf cgroups simply because those other leaf cgroups did not do this. > > Effectively: > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > will radically shift the heuristic from a score of all anonymous + > unevictable memory for all processes to a score of the largest anonymous + > unevictable memory for a single process. There is no downside or > ramifaction for the end user in doing this. When comparing cgroups based > on usage, it only makes sense to compare the hierarchical usage of that > cgroup so that attaching processes to descendants or splitting the > implementation of a process into several smaller individual processes does > not allow this heuristic to be defeated. To all previously said words I can only add that cgroup v2 allows to limit the amount of cgroups in the sub-tree: 1a926e0bbab8 ("cgroup: implement hierarchy limits"). > This is racy because mem_cgroup_select_oom_victim() found an eligible > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible > process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a >
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Mon 09-10-17 14:52:53, David Rientjes wrote: > On Thu, 5 Oct 2017, Roman Gushchin wrote: > > > Traditionally, the OOM killer is operating on a process level. > > Under oom conditions, it finds a process with the highest oom score > > and kills it. > > > > This behavior doesn't suit well the system with many running > > containers: > > > > 1) There is no fairness between containers. A small container with > > few large processes will be chosen over a large one with huge > > number of small processes. > > > > 2) Containers often do not expect that some random process inside > > will be killed. In many cases much safer behavior is to kill > > all tasks in the container. Traditionally, this was implemented > > in userspace, but doing it in the kernel has some advantages, > > especially in a case of a system-wide OOM. > > > > I'd move the second point to the changelog for the next patch since this > patch doesn't implement any support for memory.oom_group. > > > To address these issues, the cgroup-aware OOM killer is introduced. > > > > Under OOM conditions, it looks for the biggest leaf memory cgroup > > and kills the biggest task belonging to it. The following patches > > will extend this functionality to consider non-leaf memory cgroups > > as well, and also provide an ability to kill all tasks belonging > > to the victim cgroup. > > > > The root cgroup is treated as a leaf memory cgroup, so it's score > > is compared with leaf memory cgroups. > > Due to memcg statistics implementation a special algorithm > > is used for estimating it's oom_score: we define it as maximum > > oom_score of the belonging tasks. > > > > This seems to unfairly bias the root mem cgroup depending on process size. > It isn't treated fairly as a leaf mem cgroup if they are being compared > based on different criteria: the root mem cgroup as (mostly) the largest > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > unreclaimable slab pages charged to it by all processes. > > I imagine a configuration where the root mem cgroup has 100 processes > attached each with rss of 80MB, compared to a leaf cgroup with 100 > processes of 1MB rss each. How does this logic prevent repeatedly oom > killing the processes of 1MB rss? > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > quite fair, it can simply hide large processes from being selected. Users > who configure cgroups in a unified hierarchy for other resource > constraints are penalized for this choice even though the mem cgroup with > 100 processes of 1MB rss each may not be limited itself. > > I think for this comparison to be fair, it requires accounting for the > root mem cgroup itself or for a different accounting methodology for leaf > memory cgroups. I believe this is documented in the patch. I agree with you but I also assume this will not be such a big problem in practice because usecases which are going to opt-in for the cgroup aware OOM killer will have the all workloads running in memcgs and the root will basically run only some essential system wide services needed for the overall system operation. Risk of the runaway of this should be reasonably small and killing any of those will put the system into an unstable state anyway. That being said future improvements are possible but I guess that shouldn't be a roadblock for the feature to be merged. > > +/* > > + * Checks if the given memcg is a valid OOM victim and returns a number, > > + * which means the folowing: > > + * -1: there are inflight OOM victim tasks, belonging to the memcg > > + *0: memcg is not eligible, e.g. all belonging tasks are protected > > + * by oom_score_adj set to OOM_SCORE_ADJ_MIN > > + * >0: memcg is eligible, and the returned value is an estimation > > + * of the memory footprint > > + */ > > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > > + const nodemask_t *nodemask, > > + unsigned long totalpages) > > +{ > > + struct css_task_iter it; > > + struct task_struct *task; > > + int eligible = 0; > > + > > + /* > > +* Memcg is OOM eligible if there are OOM killable tasks inside. > > +* > > +* We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN > > +* as unkillable. > > +* > > +* If there are inflight OOM victim tasks inside the memcg, > > +* we return -1. > > +*/ > > + css_task_iter_start(>css, 0, ); > > + while ((task = css_task_iter_next())) { > > + if (!eligible && > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) > > + eligible = 1; > > + > > + if (tsk_is_oom_victim(task) && > > + !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) { > > + eligible = -1; > > + break; > > + } > > + } > > + css_task_iter_end(); > > + > > + if (eligible <= 0) > > + return
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Thu, 5 Oct 2017, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fairness between containers. A small container with > few large processes will be chosen over a large one with huge > number of small processes. > > 2) Containers often do not expect that some random process inside > will be killed. In many cases much safer behavior is to kill > all tasks in the container. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > I'd move the second point to the changelog for the next patch since this patch doesn't implement any support for memory.oom_group. > To address these issues, the cgroup-aware OOM killer is introduced. > > Under OOM conditions, it looks for the biggest leaf memory cgroup > and kills the biggest task belonging to it. The following patches > will extend this functionality to consider non-leaf memory cgroups > as well, and also provide an ability to kill all tasks belonging > to the victim cgroup. > > The root cgroup is treated as a leaf memory cgroup, so it's score > is compared with leaf memory cgroups. > Due to memcg statistics implementation a special algorithm > is used for estimating it's oom_score: we define it as maximum > oom_score of the belonging tasks. > This seems to unfairly bias the root mem cgroup depending on process size. It isn't treated fairly as a leaf mem cgroup if they are being compared based on different criteria: the root mem cgroup as (mostly) the largest rss of a single process vs leaf mem cgroups as all anon, unevictable, and unreclaimable slab pages charged to it by all processes. I imagine a configuration where the root mem cgroup has 100 processes attached each with rss of 80MB, compared to a leaf cgroup with 100 processes of 1MB rss each. How does this logic prevent repeatedly oom killing the processes of 1MB rss? In this case, "the root cgroup is treated as a leaf memory cgroup" isn't quite fair, it can simply hide large processes from being selected. Users who configure cgroups in a unified hierarchy for other resource constraints are penalized for this choice even though the mem cgroup with 100 processes of 1MB rss each may not be limited itself. I think for this comparison to be fair, it requires accounting for the root mem cgroup itself or for a different accounting methodology for leaf memory cgroups. > +/* > + * Checks if the given memcg is a valid OOM victim and returns a number, > + * which means the folowing: > + * -1: there are inflight OOM victim tasks, belonging to the memcg > + *0: memcg is not eligible, e.g. all belonging tasks are protected > + * by oom_score_adj set to OOM_SCORE_ADJ_MIN > + * >0: memcg is eligible, and the returned value is an estimation > + * of the memory footprint > + */ > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > +const nodemask_t *nodemask, > +unsigned long totalpages) > +{ > + struct css_task_iter it; > + struct task_struct *task; > + int eligible = 0; > + > + /* > + * Memcg is OOM eligible if there are OOM killable tasks inside. > + * > + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN > + * as unkillable. > + * > + * If there are inflight OOM victim tasks inside the memcg, > + * we return -1. > + */ > + css_task_iter_start(>css, 0, ); > + while ((task = css_task_iter_next())) { > + if (!eligible && > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) > + eligible = 1; > + > + if (tsk_is_oom_victim(task) && > + !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) { > + eligible = -1; > + break; > + } > + } > + css_task_iter_end(); > + > + if (eligible <= 0) > + return eligible; > + > + return memcg_oom_badness(memcg, nodemask, totalpages); > +} > + > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control > *oc) > +{ > + struct mem_cgroup *iter; > + > + oc->chosen_memcg = NULL; > + oc->chosen_points = 0; > + > + /* > + * The oom_score is calculated for leaf memory cgroups (including > + * the root memcg). > + */ > + rcu_read_lock(); > + for_each_mem_cgroup_tree(iter, root) { > + long score; > + > + if (memcg_has_children(iter) && iter != root_mem_cgroup) > + continue; I'll reiterate what I did on the last version of the patchset: considering only leaf memory cgroups easily allows users to defeat this heuristic and bias against all of their memory
Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
On Thu 05-10-17 14:04:51, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fairness between containers. A small container with > few large processes will be chosen over a large one with huge > number of small processes. > > 2) Containers often do not expect that some random process inside > will be killed. In many cases much safer behavior is to kill > all tasks in the container. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > > To address these issues, the cgroup-aware OOM killer is introduced. > > Under OOM conditions, it looks for the biggest leaf memory cgroup > and kills the biggest task belonging to it. The following patches > will extend this functionality to consider non-leaf memory cgroups > as well, and also provide an ability to kill all tasks belonging > to the victim cgroup. > > The root cgroup is treated as a leaf memory cgroup, so it's score > is compared with leaf memory cgroups. > Due to memcg statistics implementation a special algorithm > is used for estimating it's oom_score: we define it as maximum > oom_score of the belonging tasks. > > Signed-off-by: Roman Gushchin> Cc: Michal Hocko > Cc: Vladimir Davydov > Cc: Johannes Weiner > Cc: Tetsuo Handa > Cc: David Rientjes > Cc: Andrew Morton > Cc: Tejun Heo > Cc: kernel-t...@fb.com > Cc: cgro...@vger.kernel.org > Cc: linux-doc@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux...@kvack.org Assuming this is an opt-in Acked-by: Michal Hocko > --- > include/linux/memcontrol.h | 17 + > include/linux/oom.h| 12 +++- > mm/memcontrol.c| 172 > + > mm/oom_kill.c | 70 +- > 4 files changed, 251 insertions(+), 20 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 69966c461d1c..75b63b68846e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -35,6 +35,7 @@ struct mem_cgroup; > struct page; > struct mm_struct; > struct kmem_cache; > +struct oom_control; > > /* Cgroup-specific page state, on top of universal node page state */ > enum memcg_stat_item { > @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct > cgroup_subsys_state *css){ > return css ? container_of(css, struct mem_cgroup, css) : NULL; > } > > +static inline void mem_cgroup_put(struct mem_cgroup *memcg) > +{ > + css_put(>css); > +} > + > #define mem_cgroup_from_counter(counter, member) \ > container_of(counter, struct mem_cgroup, member) > > @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct > *p) > > bool mem_cgroup_oom_synchronize(bool wait); > > +bool mem_cgroup_select_oom_victim(struct oom_control *oc); > + > #ifdef CONFIG_MEMCG_SWAP > extern int do_swap_account; > #endif > @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct > *task, > return true; > } > > +static inline void mem_cgroup_put(struct mem_cgroup *memcg) > +{ > +} > + > static inline struct mem_cgroup * > mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup *prev, > @@ -936,6 +948,11 @@ static inline > void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) > { > } > + > +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) > +{ > + return false; > +} > #endif /* CONFIG_MEMCG */ > > /* idx can be of type enum memcg_stat_item or node_stat_item */ > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 76aac4ce39bc..ca78e2d5956e 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -9,6 +9,13 @@ > #include /* MMF_* */ > #include /* VM_FAULT* */ > > + > +/* > + * Special value returned by victim selection functions to indicate > + * that are inflight OOM victims. > + */ > +#define INFLIGHT_VICTIM ((void *)-1UL) > + > struct zonelist; > struct notifier_block; > struct mem_cgroup; > @@ -39,7 +46,8 @@ struct oom_control { > > /* Used by oom implementation, do not set */ > unsigned long totalpages; > - struct task_struct *chosen; > + struct task_struct *chosen_task; > + struct mem_cgroup *chosen_memcg; > unsigned long chosen_points; > }; > > @@ -101,6 +109,8 @@ extern void oom_killer_enable(void); > > extern struct task_struct *find_lock_task_mm(struct task_struct *p); > > +extern int oom_evaluate_task(struct task_struct *task, void *arg); > + > /* sysctls */ >