Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-13 Thread David Rientjes
On Fri, 13 Oct 2017, Roman Gushchin wrote:

> > Think about it in a different way: we currently compare per-process usage 
> > and userspace has /proc/pid/oom_score_adj to adjust that usage depending 
> > on priorities of that process and still oom kill if there's a memory leak.  
> > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer 
> > after all.  We don't need a strict memory.oom_priority that outranks all 
> > other sibling cgroups regardless of usage.  We need a memory.oom_score_adj 
> > to adjust the per-cgroup usage.  The decisionmaking in your earlier 
> > example would be under the control of C/memory.oom_score_adj and 
> > D/memory.oom_score_adj.  Problem solved.
> > 
> > It also solves the problem of userspace being able to influence oom victim 
> > selection so now they can protect important cgroups just like we can 
> > protect important processes today.
> > 
> > And since this would be hierarchical usage, you can trivially infer root 
> > mem cgroup usage by subtraction of top-level mem cgroup usage.
> > 
> > This is a powerful solution to the problem and gives userspace the control 
> > they need so that it can work in all usecases, not a subset of usecases.
> 
> You're right that per-cgroup oom_score_adj may resolve the issue with
> too strict semantics of oom_priorities. But I believe nobody likes
> the existing per-process oom_score_adj interface, and there are reasons 
> behind.

The previous heuristic before I rewrote the oom killer used 
/proc/pid/oom_adj which acted as a bitshift on mm->total_vm, which was a 
much more difficult interface to use as I'm sure you can imagine.  People 
ended up only using it to polarize selection: either -17 to oom disable a 
process, -16 to bias against it, and 15 to prefer it.  Nobody used 
anything in between and I worked with openssh, udev, kde, and chromium to 
get a consensus on the oom_score_adj semantics.  People do use it to 
protect against memory leaks and to prevent oom killing important 
processes when something else can be sacrificed, unless there's a leak.

> Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj
> will work is not trivial.

I suggest defining it in the terms used for previous iterations of the 
patchset: do hierarchical scoring so that each level of the hierarchy has 
usage information for each subtree.  You can get root mem cgroup usage 
with complete fairness by subtraction with this method.  When comparing 
usage at each level of the hierarchy, you can propagate the eligibility of 
processes in that subtree much like you do today.  I agree with your 
change to make the oom killer a no-op if selection races with the actual 
killing rather than falling back to the old heuristic.  I'm happy to help 
add a Tested-by once we settle the other issues with that change.

At each level, I would state that memory.oom_score_adj has the exact same 
semantics as /proc/pid/oom_score_adj.  In this case, it would simply be 
defined as a proportion of the parent's limit.  If the hierarchy is 
iterated starting at the root mem cgroup for system ooms and at the root 
of the oom memcg for memcg ooms, this should lead to the exact same oom 
killing behavior, which is desired.

This solution would address the three concerns that I had: it allows the 
root mem cgroup to be compared fairly with leaf mem cgroups (with the 
bonus of not requiring root mem cgroup accounting thanks to your heuristic 
using global vmstats), it allows userspace to influence the decisionmaking 
so that users can protect cgroups that use 50% of memory because they are 
important, and it completely avoids users being able to change victim 
selection simply by creating child mem cgroups.

This would be a very powerful patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-13 Thread Roman Gushchin
On Thu, Oct 12, 2017 at 02:50:38PM -0700, David Rientjes wrote:
> On Wed, 11 Oct 2017, Roman Gushchin wrote:
> 
> Think about it in a different way: we currently compare per-process usage 
> and userspace has /proc/pid/oom_score_adj to adjust that usage depending 
> on priorities of that process and still oom kill if there's a memory leak.  
> Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer 
> after all.  We don't need a strict memory.oom_priority that outranks all 
> other sibling cgroups regardless of usage.  We need a memory.oom_score_adj 
> to adjust the per-cgroup usage.  The decisionmaking in your earlier 
> example would be under the control of C/memory.oom_score_adj and 
> D/memory.oom_score_adj.  Problem solved.
> 
> It also solves the problem of userspace being able to influence oom victim 
> selection so now they can protect important cgroups just like we can 
> protect important processes today.
> 
> And since this would be hierarchical usage, you can trivially infer root 
> mem cgroup usage by subtraction of top-level mem cgroup usage.
> 
> This is a powerful solution to the problem and gives userspace the control 
> they need so that it can work in all usecases, not a subset of usecases.

You're right that per-cgroup oom_score_adj may resolve the issue with
too strict semantics of oom_priorities. But I believe nobody likes
the existing per-process oom_score_adj interface, and there are reasons behind.
Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj
will work is not trivial.
For example, earlier in this thread I've shown an example, when a decision
which of two processes should be killed depends on whether it's global or
memcg-wide oom, despite both belong to a single cgroup!

Of course, it's technically trivial to implement some analog of oom_score_adj
for cgroups (and early versions of this patchset did that).
But the right question is: is this an interface we want to support
for the next many years? I'm not sure.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-12 Thread David Rientjes
On Wed, 11 Oct 2017, Roman Gushchin wrote:

> > But let's move the discussion forward to fix it.  To avoid necessarily 
> > accounting memory to the root mem cgroup, have we considered if it is even 
> > necessary to address the root mem cgroup?  For the users who opt-in to 
> > this heuristic, would it be possible to discount the root mem cgroup from 
> > the heuristic entirely so that oom kills originate from leaf mem cgroups?  
> > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
> > the victim rss is greater than an eligible victim rss attached to the root 
> > mem cgroup?
> 
> David, I'm not pretending for implementing the best possible accounting
> for the root memory cgroup, and I'm sure there is a place for further
> enhancement. But if it's not leading to some obviously stupid victim
> selection (like ignoring leaking task, which consumes most of the memory),
> I don't see why it should be treated as a blocker for the whole patchset.
> I also doubt that any of us has these examples, and the best way to get
> them is to get some real usage feedback.
> 
> Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc,
> these all are perfect ideas which can be implemented on top of this patchset.
> 

For the root mem cgroup to be compared to leaf mem cgroups, it needs a 
fair comparison, not something that we leave to some future patches on top 
of this patchset.  We can't compare some cgroups with other cgroups based 
on different criteria depending on which cgroup is involved.  It's 
actually a quite trivial problem to address, it was a small modiifcation 
to your hierarchical usage patchset if that's the way that you elect to 
fix it.

I know that some of our customers use cgroups only for one or two jobs on 
the system, and that isn't necessarily just for memory limitation.  The 
fact remains, that without considering the root mem cgroup fairly, that 
these customers are unfairly biased against because they have aggregated 
their processes in a cgroup.  This a not a highly specialized usecase, I 
am positive that many users use cgroups only for a subset of processes.  
This heuristic penalizes that behavior to prefer them as oom victims.

The problem needs to be fixed instead of asking for the patchset to be 
merged and hope that we'll address these issues later.  If you account for 
hierarchical usage, you can easily subtract this from global vmstats to 
get an implicit root usage.

> > You would be required to discount oom_score_adj because the heuristic 
> > doesn't account for oom_score_adj when comparing the anon + unevictable + 
> > unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
> > correct victim selection in real-world scenarios where processes attached 
> > to the root mem cgroup are vital to the system and not part of any user 
> > job, i.e. they are important system daemons and the "activity manager" 
> > responsible for orchestrating the cgroup hierarchy.
> > 
> > It's also still unfair because it now compares
> > [sum of rss of processes attached to a cgroup] to
> > [anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
> > going to be a solution, regardless if its one process or all processes, if 
> > it's being compared to more types of memory in leaf cgroups.
> > 
> > If we really don't want root mem cgroup accounting so this is a fair 
> > comparison, I think the heuristic needs to special case the root mem 
> > cgroup either by discounting root oom kills if there are eligible oom 
> > kills from leaf cgroups (the user would be opting-in to this behavior) or 
> > comparing the badness of a victim from a leaf cgroup to the badness of a 
> > victim from the root cgroup when deciding which to kill and allow the user 
> > to protect root mem cgroup processes with oom_score_adj.
> > 
> > That aside, all of this has only addressed one of the three concerns with 
> > the patchset.
> > 
> > I believe the solution to avoid allowing users to circumvent oom kill is 
> > to account usage up the hierarchy as you have done in the past.  Cgroup 
> > hierarchies can be managed by the user so they can create their own 
> > subcontainers, this is nothing new, and I would hope that you wouldn't 
> > limit your feature to only a very specific set of usecases.  That may be 
> > your solution for the root mem cgroup itself: if the hierarchical usage of 
> > all top-level mem cgroups is known, it's possible to find the root mem 
> > cgroup usage by subtraction, you are using stats that are global vmstats 
> > in your heuristic.
> > 
> > Accounting usage up the hierarchy avoids the first two concerns with the 
> > patchset.  It allows you to implicitly understand the usage of the root 
> > mem cgroup itself, and does not allow users to circumvent oom kill by 
> > creating subcontainers, either purposefully or not.  The third concern, 
> > userspace influence, can allow users to attack leaf mem cgroups deeper in 
> > the tree if 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-12 Thread Michal Hocko
On Wed 11-10-17 13:27:44, David Rientjes wrote:
> On Wed, 11 Oct 2017, Michal Hocko wrote:
> 
> > > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > > against that mem cgroup from oom kill in system oom conditions, the 
> > > ability of users to completely evade the oom killer by attaching all 
> > > processes to child cgroups either purposefully or unpurposefully, and the 
> > > inability of userspace to effectively control oom victim selection:
> > > 
> > > Nacked-by: David Rientjes 
> > 
> > I consider this NACK rather dubious. Evading the heuristic as you
> > describe requires root privileges in default configuration because
> > normal users are not allowed to create subtrees. If you
> > really want to delegate subtree to an untrusted entity then you do not
> > have to opt-in for this oom strategy. We can work on an additional means
> > which would allow to cover those as well (e.g. priority based one which
> > is requested for other usecases).
> > 
> 
> You're missing the point that the user is trusted and it may be doing 
> something to circumvent oom kill unknowingly.

I would really like to see a practical example of something like that. I
am not saying this is completely impossible but as already pointed out
this _can_ be addressed _on top_ of the current implementation. We will
need some way to consider hierarchies anyway.

So I really fail to see why this would be a blocker. After all it
is no different than skipping oom selection by splitting a process
(knowingly or otherwise) into subprocesses which is possible even
now. OOM killer selection has never been, will not be and cannot be
perfect in principal. Quite contrary, the more clever the heuristics are
trying to be the more corner cases they might generate as we could see
in the past.

> With a single unified 
> hierarchy, the user is forced to attach its processes to subcontainers if 
> it wants to constrain resources with other controllers.  Doing so ends up 
> completely avoiding oom kill because of this implementation detail.  It 
> has nothing to do with trust and the admin who is opting-in will not know 
> a user has cirumvented oom kill purely because it constrains its processes 
> with controllers other than the memory controller.
> 
> > A similar argument applies to the root memcg evaluation. While the
> > proposed behavior is not optimal it would work for general usecase
> > described here where the root memcg doesn't really run any large number
> > of tasks. If somebody who explicitly opts-in for the new strategy and it
> > doesn't work well for that usecase we can enhance the behavior. That
> > alone is not a reason to nack the whole thing.
> > 
> > I find it really disturbing that you keep nacking this approach just
> > because it doesn't suite your specific usecase while it doesn't break
> > it. Moreover it has been stated several times already that future
> > improvements are possible and cover what you have described already.
> 
> This has nothing to do with my specific usecase.

Well, I might be really wrong but it is hard to not notice how most of
your complains push towards hierarchical level-by-level comparisons.
Which has been considered and deemed unsuitable for the default cgroup
aware oom selection because it imposes structural constrains on how
the hierarchy is organized and thus disallow many usecases. So pushing
for this just because it resembles your current inhouse implementation
leaves me with a feeling that you care more about your usecase than a
general usability.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread Roman Gushchin
On Wed, Oct 11, 2017 at 01:21:47PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > We don't need a better approximation, we need a fair comparison.  The 
> > > heuristic that this patchset is implementing is based on the usage of 
> > > individual mem cgroups.  For the root mem cgroup to be considered 
> > > eligible, we need to understand its usage.  That usage is _not_ what is 
> > > implemented by this patchset, which is the largest rss of a single 
> > > attached process.  This, in fact, is not an "approximation" at all.  In 
> > > the example of 1 processes attached with 80MB rss each, the usage of 
> > > the root mem cgroup is _not_ 80MB.
> > 
> > It's hard to imagine a "healthy" setup with 1 process in the root
> > memory cgroup, and even if we kill 1 process we will still have 
> > remaining process. I agree with you at some point, but it's not
> > a real world example.
> > 
> 
> It's an example that illustrates the problem with the unfair comparison 
> between the root mem cgroup and leaf mem cgroups.  It's unfair to compare 
> [largest rss of a single process attached to a cgroup] to
> [anon + unevictable + unreclaimable slab usage of a cgroup].  It's not an 
> approximation, as previously stated: the usage of the root mem cgroup is 
> not 100MB if there are 10 such processes attached to the root mem cgroup, 
> it's off by orders of magnitude.
> 
> For the root mem cgroup to be treated equally as a leaf mem cgroup as this 
> patchset proposes, it must have a fair comparison.  That can be done by 
> accounting memory to the root mem cgroup in the same way it is to leaf mem 
> cgroups.
> 
> But let's move the discussion forward to fix it.  To avoid necessarily 
> accounting memory to the root mem cgroup, have we considered if it is even 
> necessary to address the root mem cgroup?  For the users who opt-in to 
> this heuristic, would it be possible to discount the root mem cgroup from 
> the heuristic entirely so that oom kills originate from leaf mem cgroups?  
> Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
> the victim rss is greater than an eligible victim rss attached to the root 
> mem cgroup?

David, I'm not pretending for implementing the best possible accounting
for the root memory cgroup, and I'm sure there is a place for further
enhancement. But if it's not leading to some obviously stupid victim
selection (like ignoring leaking task, which consumes most of the memory),
I don't see why it should be treated as a blocker for the whole patchset.
I also doubt that any of us has these examples, and the best way to get
them is to get some real usage feedback.

Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc,
these all are perfect ideas which can be implemented on top of this patchset.

> 
> > > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > > against that mem cgroup from oom kill in system oom conditions, the 
> > > ability of users to completely evade the oom killer by attaching all 
> > > processes to child cgroups either purposefully or unpurposefully, and the 
> > > inability of userspace to effectively control oom victim selection:
> > > 
> > > Nacked-by: David Rientjes 
> > 
> > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
> > will it fix the problem?
> > 
> > It might have some drawbacks as well (especially around oom_score_adj),
> > but it's doable, if we'll ignore tasks which are not owners of their's mm 
> > struct.
> > 
> 
> You would be required to discount oom_score_adj because the heuristic 
> doesn't account for oom_score_adj when comparing the anon + unevictable + 
> unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
> correct victim selection in real-world scenarios where processes attached 
> to the root mem cgroup are vital to the system and not part of any user 
> job, i.e. they are important system daemons and the "activity manager" 
> responsible for orchestrating the cgroup hierarchy.
> 
> It's also still unfair because it now compares
> [sum of rss of processes attached to a cgroup] to
> [anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
> going to be a solution, regardless if its one process or all processes, if 
> it's being compared to more types of memory in leaf cgroups.
> 
> If we really don't want root mem cgroup accounting so this is a fair 
> comparison, I think the heuristic needs to special case the root mem 
> cgroup either by discounting root oom kills if there are eligible oom 
> kills from leaf cgroups (the user would be opting-in to this behavior) or 
> comparing the badness of a victim from a leaf cgroup to the badness of a 
> victim from the root cgroup when deciding which to kill and allow the user 
> to protect root mem cgroup processes with oom_score_adj.
> 
> That aside, all of this has only addressed one of the three concerns with 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread David Rientjes
On Wed, 11 Oct 2017, Michal Hocko wrote:

> > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > against that mem cgroup from oom kill in system oom conditions, the 
> > ability of users to completely evade the oom killer by attaching all 
> > processes to child cgroups either purposefully or unpurposefully, and the 
> > inability of userspace to effectively control oom victim selection:
> > 
> > Nacked-by: David Rientjes 
> 
> I consider this NACK rather dubious. Evading the heuristic as you
> describe requires root privileges in default configuration because
> normal users are not allowed to create subtrees. If you
> really want to delegate subtree to an untrusted entity then you do not
> have to opt-in for this oom strategy. We can work on an additional means
> which would allow to cover those as well (e.g. priority based one which
> is requested for other usecases).
> 

You're missing the point that the user is trusted and it may be doing 
something to circumvent oom kill unknowingly.  With a single unified 
hierarchy, the user is forced to attach its processes to subcontainers if 
it wants to constrain resources with other controllers.  Doing so ends up 
completely avoiding oom kill because of this implementation detail.  It 
has nothing to do with trust and the admin who is opting-in will not know 
a user has cirumvented oom kill purely because it constrains its processes 
with controllers other than the memory controller.

> A similar argument applies to the root memcg evaluation. While the
> proposed behavior is not optimal it would work for general usecase
> described here where the root memcg doesn't really run any large number
> of tasks. If somebody who explicitly opts-in for the new strategy and it
> doesn't work well for that usecase we can enhance the behavior. That
> alone is not a reason to nack the whole thing.
> 
> I find it really disturbing that you keep nacking this approach just
> because it doesn't suite your specific usecase while it doesn't break
> it. Moreover it has been stated several times already that future
> improvements are possible and cover what you have described already.

This has nothing to do with my specific usecase.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread David Rientjes
On Tue, 10 Oct 2017, Roman Gushchin wrote:

> > We don't need a better approximation, we need a fair comparison.  The 
> > heuristic that this patchset is implementing is based on the usage of 
> > individual mem cgroups.  For the root mem cgroup to be considered 
> > eligible, we need to understand its usage.  That usage is _not_ what is 
> > implemented by this patchset, which is the largest rss of a single 
> > attached process.  This, in fact, is not an "approximation" at all.  In 
> > the example of 1 processes attached with 80MB rss each, the usage of 
> > the root mem cgroup is _not_ 80MB.
> 
> It's hard to imagine a "healthy" setup with 1 process in the root
> memory cgroup, and even if we kill 1 process we will still have 
> remaining process. I agree with you at some point, but it's not
> a real world example.
> 

It's an example that illustrates the problem with the unfair comparison 
between the root mem cgroup and leaf mem cgroups.  It's unfair to compare 
[largest rss of a single process attached to a cgroup] to
[anon + unevictable + unreclaimable slab usage of a cgroup].  It's not an 
approximation, as previously stated: the usage of the root mem cgroup is 
not 100MB if there are 10 such processes attached to the root mem cgroup, 
it's off by orders of magnitude.

For the root mem cgroup to be treated equally as a leaf mem cgroup as this 
patchset proposes, it must have a fair comparison.  That can be done by 
accounting memory to the root mem cgroup in the same way it is to leaf mem 
cgroups.

But let's move the discussion forward to fix it.  To avoid necessarily 
accounting memory to the root mem cgroup, have we considered if it is even 
necessary to address the root mem cgroup?  For the users who opt-in to 
this heuristic, would it be possible to discount the root mem cgroup from 
the heuristic entirely so that oom kills originate from leaf mem cgroups?  
Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
the victim rss is greater than an eligible victim rss attached to the root 
mem cgroup?

> > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > against that mem cgroup from oom kill in system oom conditions, the 
> > ability of users to completely evade the oom killer by attaching all 
> > processes to child cgroups either purposefully or unpurposefully, and the 
> > inability of userspace to effectively control oom victim selection:
> > 
> > Nacked-by: David Rientjes 
> 
> So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
> will it fix the problem?
> 
> It might have some drawbacks as well (especially around oom_score_adj),
> but it's doable, if we'll ignore tasks which are not owners of their's mm 
> struct.
> 

You would be required to discount oom_score_adj because the heuristic 
doesn't account for oom_score_adj when comparing the anon + unevictable + 
unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
correct victim selection in real-world scenarios where processes attached 
to the root mem cgroup are vital to the system and not part of any user 
job, i.e. they are important system daemons and the "activity manager" 
responsible for orchestrating the cgroup hierarchy.

It's also still unfair because it now compares
[sum of rss of processes attached to a cgroup] to
[anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
going to be a solution, regardless if its one process or all processes, if 
it's being compared to more types of memory in leaf cgroups.

If we really don't want root mem cgroup accounting so this is a fair 
comparison, I think the heuristic needs to special case the root mem 
cgroup either by discounting root oom kills if there are eligible oom 
kills from leaf cgroups (the user would be opting-in to this behavior) or 
comparing the badness of a victim from a leaf cgroup to the badness of a 
victim from the root cgroup when deciding which to kill and allow the user 
to protect root mem cgroup processes with oom_score_adj.

That aside, all of this has only addressed one of the three concerns with 
the patchset.

I believe the solution to avoid allowing users to circumvent oom kill is 
to account usage up the hierarchy as you have done in the past.  Cgroup 
hierarchies can be managed by the user so they can create their own 
subcontainers, this is nothing new, and I would hope that you wouldn't 
limit your feature to only a very specific set of usecases.  That may be 
your solution for the root mem cgroup itself: if the hierarchical usage of 
all top-level mem cgroups is known, it's possible to find the root mem 
cgroup usage by subtraction, you are using stats that are global vmstats 
in your heuristic.

Accounting usage up the hierarchy avoids the first two concerns with the 
patchset.  It allows you to implicitly understand the usage of the root 
mem cgroup itself, and does not allow users to circumvent oom kill by 
creating 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread Roman Gushchin
On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process 
> > > size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  
> > > Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for 
> > root
> > memory cgroup. If we'll start gathering them at some point, we can change 
> > this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 1 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 1 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 1 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.
> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: 
> > > considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristic from a score of all anonymous + 
> > > unevictable memory for all processes to a score of the largest anonymous +
> > > unevictable memory for a single process.  There is no downside or 
> > > ramifaction for the end user in doing this.  When comparing cgroups based 
> > > on usage, it only makes sense to compare the hierarchical usage of that 
> > > cgroup so that attaching processes to descendants or splitting the 
> > > implementation of a process into several smaller individual processes 
> > > does 
> > > not allow this heuristic to be defeated.
> > 
> > To all 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread Michal Hocko
On Tue 10-10-17 14:13:00, David Rientjes wrote:
[...]
> For these reasons: unfair comparison of root mem cgroup usage to bias 
> against that mem cgroup from oom kill in system oom conditions, the 
> ability of users to completely evade the oom killer by attaching all 
> processes to child cgroups either purposefully or unpurposefully, and the 
> inability of userspace to effectively control oom victim selection:
> 
> Nacked-by: David Rientjes 

I consider this NACK rather dubious. Evading the heuristic as you
describe requires root privileges in default configuration because
normal users are not allowed to create subtrees. If you
really want to delegate subtree to an untrusted entity then you do not
have to opt-in for this oom strategy. We can work on an additional means
which would allow to cover those as well (e.g. priority based one which
is requested for other usecases).

A similar argument applies to the root memcg evaluation. While the
proposed behavior is not optimal it would work for general usecase
described here where the root memcg doesn't really run any large number
of tasks. If somebody who explicitly opts-in for the new strategy and it
doesn't work well for that usecase we can enhance the behavior. That
alone is not a reason to nack the whole thing.

I find it really disturbing that you keep nacking this approach just
because it doesn't suite your specific usecase while it doesn't break
it. Moreover it has been stated several times already that future
improvements are possible and cover what you have described already.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread Roman Gushchin
On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process 
> > > size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  
> > > Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for 
> > root
> > memory cgroup. If we'll start gathering them at some point, we can change 
> > this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 1 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 1 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 1 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.

It's hard to imagine a "healthy" setup with 1 process in the root
memory cgroup, and even if we kill 1 process we will still have 
remaining process. I agree with you at some point, but it's not
a real world example.

> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: 
> > > considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristic from a score of all anonymous + 
> > > unevictable memory for all processes to a score of the largest anonymous +
> > > unevictable memory for a single process.  There is no downside or 
> > > ramifaction for the end user in doing this.  When comparing cgroups based 
> > > on usage, it only makes sense to compare the hierarchical usage of that 
> 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread David Rientjes
On Tue, 10 Oct 2017, Roman Gushchin wrote:

> > This seems to unfairly bias the root mem cgroup depending on process size.  
> > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > based on different criteria: the root mem cgroup as (mostly) the largest 
> > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > unreclaimable slab pages charged to it by all processes.
> > 
> > I imagine a configuration where the root mem cgroup has 100 processes 
> > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > killing the processes of 1MB rss?
> > 
> > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > quite fair, it can simply hide large processes from being selected.  Users 
> > who configure cgroups in a unified hierarchy for other resource 
> > constraints are penalized for this choice even though the mem cgroup with 
> > 100 processes of 1MB rss each may not be limited itself.
> > 
> > I think for this comparison to be fair, it requires accounting for the 
> > root mem cgroup itself or for a different accounting methodology for leaf 
> > memory cgroups.
> 
> This is basically a workaround, because we don't have necessary stats for root
> memory cgroup. If we'll start gathering them at some point, we can change this
> and treat root memcg exactly as other leaf cgroups.
> 

I understand why it currently cannot be an apples vs apples comparison 
without, as I suggest in the last paragraph, that the same accounting is 
done for the root mem cgroup, which is intuitive if it is to be considered 
on the same basis as leaf mem cgroups.

I understand for the design to work that leaf mem cgroups and the root mem 
cgroup must be compared if processes can be attached to the root mem 
cgroup.  My point is that it is currently completely unfair as I've 
stated: you can have 1 processes attached to the root mem cgroup with 
rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
and the oom killer is going to target the leaf mem cgroup as a result of 
this apples vs oranges comparison.

In case it's not clear, the 1 processes of 80MB rss each is the most 
likely contributor to a system-wide oom kill.  Unfortunately, the 
heuristic introduced by this patchset is broken wrt a fair comparison of 
the root mem cgroup usage.

> Or, if someone will come with an idea of a better approximation, it can be
> implemented as a separate enhancement on top of the initial implementation.
> This is more than welcome.
> 

We don't need a better approximation, we need a fair comparison.  The 
heuristic that this patchset is implementing is based on the usage of 
individual mem cgroups.  For the root mem cgroup to be considered 
eligible, we need to understand its usage.  That usage is _not_ what is 
implemented by this patchset, which is the largest rss of a single 
attached process.  This, in fact, is not an "approximation" at all.  In 
the example of 1 processes attached with 80MB rss each, the usage of 
the root mem cgroup is _not_ 80MB.

I'll restate that oom killing a process is a last resort for the kernel, 
but it also must be able to make a smart decision.  Targeting dozens of 
1MB processes instead of 80MB processes because of a shortcoming in this 
implementation is not the appropriate selection, it's the opposite of the 
correct selection.

> > I'll reiterate what I did on the last version of the patchset: considering 
> > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > bias against all of their memory usage up to the largest process size 
> > amongst the set of processes attached.  If the user creates N child mem 
> > cgroups for their N processes and attaches one process to each child, the 
> > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > leaf cgroups simply because those other leaf cgroups did not do this.
> > 
> > Effectively:
> > 
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > 
> > will radically shift the heuristic from a score of all anonymous + 
> > unevictable memory for all processes to a score of the largest anonymous +
> > unevictable memory for a single process.  There is no downside or 
> > ramifaction for the end user in doing this.  When comparing cgroups based 
> > on usage, it only makes sense to compare the hierarchical usage of that 
> > cgroup so that attaching processes to descendants or splitting the 
> > implementation of a process into several smaller individual processes does 
> > not allow this heuristic to be defeated.
> 
> To all previously said words I can only add that cgroup v2 allows to limit
> the amount of cgroups in the sub-tree:
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 

So the solution to 

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

evading all oom 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread Roman Gushchin
On Mon, Oct 09, 2017 at 02:52:53PM -0700, David Rientjes wrote:
> On Thu, 5 Oct 2017, Roman Gushchin wrote:
> 
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> 
> I'd move the second point to the changelog for the next patch since this 
> patch doesn't implement any support for memory.oom_group.

There is a special remark later in the changelog explaining that
this functionality will be added by following patches. I've thought
it's useful to have all basic ideas in the one place.

> 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest leaf memory cgroup
> > and kills the biggest task belonging to it. The following patches
> > will extend this functionality to consider non-leaf memory cgroups
> > as well, and also provide an ability to kill all tasks belonging
> > to the victim cgroup.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with leaf memory cgroups.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating it's oom_score: we define it as maximum
> > oom_score of the belonging tasks.
> > 
> 
> This seems to unfairly bias the root mem cgroup depending on process size.  
> It isn't treated fairly as a leaf mem cgroup if they are being compared 
> based on different criteria: the root mem cgroup as (mostly) the largest 
> rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> unreclaimable slab pages charged to it by all processes.
> 
> I imagine a configuration where the root mem cgroup has 100 processes 
> attached each with rss of 80MB, compared to a leaf cgroup with 100 
> processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> killing the processes of 1MB rss?
> 
> In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> quite fair, it can simply hide large processes from being selected.  Users 
> who configure cgroups in a unified hierarchy for other resource 
> constraints are penalized for this choice even though the mem cgroup with 
> 100 processes of 1MB rss each may not be limited itself.
> 
> I think for this comparison to be fair, it requires accounting for the 
> root mem cgroup itself or for a different accounting methodology for leaf 
> memory cgroups.

This is basically a workaround, because we don't have necessary stats for root
memory cgroup. If we'll start gathering them at some point, we can change this
and treat root memcg exactly as other leaf cgroups.

Or, if someone will come with an idea of a better approximation, it can be
implemented as a separate enhancement on top of the initial implementation.
This is more than welcome.

> 
> I'll reiterate what I did on the last version of the patchset: considering 
> only leaf memory cgroups easily allows users to defeat this heuristic and 
> bias against all of their memory usage up to the largest process size 
> amongst the set of processes attached.  If the user creates N child mem 
> cgroups for their N processes and attaches one process to each child, the 
> _only_ thing this achieved is to defeat your heuristic and prefer other 
> leaf cgroups simply because those other leaf cgroups did not do this.
> 
> Effectively:
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> will radically shift the heuristic from a score of all anonymous + 
> unevictable memory for all processes to a score of the largest anonymous +
> unevictable memory for a single process.  There is no downside or 
> ramifaction for the end user in doing this.  When comparing cgroups based 
> on usage, it only makes sense to compare the hierarchical usage of that 
> cgroup so that attaching processes to descendants or splitting the 
> implementation of a process into several smaller individual processes does 
> not allow this heuristic to be defeated.

To all previously said words I can only add that cgroup v2 allows to limit
the amount of cgroups in the sub-tree:
1a926e0bbab8 ("cgroup: implement hierarchy limits").

> This is racy because mem_cgroup_select_oom_victim() found an eligible 
> oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread Michal Hocko
On Mon 09-10-17 14:52:53, David Rientjes wrote:
> On Thu, 5 Oct 2017, Roman Gushchin wrote:
> 
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> 
> I'd move the second point to the changelog for the next patch since this 
> patch doesn't implement any support for memory.oom_group.
> 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest leaf memory cgroup
> > and kills the biggest task belonging to it. The following patches
> > will extend this functionality to consider non-leaf memory cgroups
> > as well, and also provide an ability to kill all tasks belonging
> > to the victim cgroup.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with leaf memory cgroups.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating it's oom_score: we define it as maximum
> > oom_score of the belonging tasks.
> > 
> 
> This seems to unfairly bias the root mem cgroup depending on process size.  
> It isn't treated fairly as a leaf mem cgroup if they are being compared 
> based on different criteria: the root mem cgroup as (mostly) the largest 
> rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> unreclaimable slab pages charged to it by all processes.
> 
> I imagine a configuration where the root mem cgroup has 100 processes 
> attached each with rss of 80MB, compared to a leaf cgroup with 100 
> processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> killing the processes of 1MB rss?
> 
> In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> quite fair, it can simply hide large processes from being selected.  Users 
> who configure cgroups in a unified hierarchy for other resource 
> constraints are penalized for this choice even though the mem cgroup with 
> 100 processes of 1MB rss each may not be limited itself.
> 
> I think for this comparison to be fair, it requires accounting for the 
> root mem cgroup itself or for a different accounting methodology for leaf 
> memory cgroups.

I believe this is documented in the patch. I agree with you but I also
assume this will not be such a big problem in practice because usecases
which are going to opt-in for the cgroup aware OOM killer will have the
all workloads running in memcgs and the root will basically run only
some essential system wide services needed for the overall system
operation. Risk of the runaway of this should be reasonably small and
killing any of those will put the system into an unstable state anyway.

That being said future improvements are possible but I guess that
shouldn't be a roadblock for the feature to be merged.

> > +/*
> > + * Checks if the given memcg is a valid OOM victim and returns a number,
> > + * which means the folowing:
> > + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> > + *0: memcg is not eligible, e.g. all belonging tasks are protected
> > + *   by oom_score_adj set to OOM_SCORE_ADJ_MIN
> > + *   >0: memcg is eligible, and the returned value is an estimation
> > + *   of the memory footprint
> > + */
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > +  const nodemask_t *nodemask,
> > +  unsigned long totalpages)
> > +{
> > +   struct css_task_iter it;
> > +   struct task_struct *task;
> > +   int eligible = 0;
> > +
> > +   /*
> > +* Memcg is OOM eligible if there are OOM killable tasks inside.
> > +*
> > +* We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> > +* as unkillable.
> > +*
> > +* If there are inflight OOM victim tasks inside the memcg,
> > +* we return -1.
> > +*/
> > +   css_task_iter_start(>css, 0, );
> > +   while ((task = css_task_iter_next())) {
> > +   if (!eligible &&
> > +   task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> > +   eligible = 1;
> > +
> > +   if (tsk_is_oom_victim(task) &&
> > +   !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
> > +   eligible = -1;
> > +   break;
> > +   }
> > +   }
> > +   css_task_iter_end();
> > +
> > +   if (eligible <= 0)
> > +   return 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-09 Thread David Rientjes
On Thu, 5 Oct 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 

I'd move the second point to the changelog for the next patch since this 
patch doesn't implement any support for memory.oom_group.

> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 

This seems to unfairly bias the root mem cgroup depending on process size.  
It isn't treated fairly as a leaf mem cgroup if they are being compared 
based on different criteria: the root mem cgroup as (mostly) the largest 
rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
unreclaimable slab pages charged to it by all processes.

I imagine a configuration where the root mem cgroup has 100 processes 
attached each with rss of 80MB, compared to a leaf cgroup with 100 
processes of 1MB rss each.  How does this logic prevent repeatedly oom 
killing the processes of 1MB rss?

In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
quite fair, it can simply hide large processes from being selected.  Users 
who configure cgroups in a unified hierarchy for other resource 
constraints are penalized for this choice even though the mem cgroup with 
100 processes of 1MB rss each may not be limited itself.

I think for this comparison to be fair, it requires accounting for the 
root mem cgroup itself or for a different accounting methodology for leaf 
memory cgroups.

> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *0: memcg is not eligible, e.g. all belonging tasks are protected
> + *   by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *   of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +const nodemask_t *nodemask,
> +unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> +  * Memcg is OOM eligible if there are OOM killable tasks inside.
> +  *
> +  * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +  * as unkillable.
> +  *
> +  * If there are inflight OOM victim tasks inside the memcg,
> +  * we return -1.
> +  */
> + css_task_iter_start(>css, 0, );
> + while ((task = css_task_iter_next())) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end();
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> +  * The oom_score is calculated for leaf memory cgroups (including
> +  * the root memcg).
> +  */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter) && iter != root_mem_cgroup)
> + continue;

I'll reiterate what I did on the last version of the patchset: considering 
only leaf memory cgroups easily allows users to defeat this heuristic and 
bias against all of their memory 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko
On Thu 05-10-17 14:04:51, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 
> Signed-off-by: Roman Gushchin 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Tetsuo Handa 
> Cc: David Rientjes 
> Cc: Andrew Morton 
> Cc: Tejun Heo 
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Assuming this is an opt-in
Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h |  17 +
>  include/linux/oom.h|  12 +++-
>  mm/memcontrol.c| 172 
> +
>  mm/oom_kill.c  |  70 +-
>  4 files changed, 251 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 69966c461d1c..75b63b68846e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct oom_control;
>  
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
> cgroup_subsys_state *css){
>   return css ? container_of(css, struct mem_cgroup, css) : NULL;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(>css);
> +}
> +
>  #define mem_cgroup_from_counter(counter, member) \
>   container_of(counter, struct mem_cgroup, member)
>  
> @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct 
> *p)
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
> *task,
>   return true;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>   struct mem_cgroup *prev,
> @@ -936,6 +948,11 @@ static inline
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 76aac4ce39bc..ca78e2d5956e 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -9,6 +9,13 @@
>  #include  /* MMF_* */
>  #include  /* VM_FAULT* */
>  
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
>  struct zonelist;
>  struct notifier_block;
>  struct mem_cgroup;
> @@ -39,7 +46,8 @@ struct oom_control {
>  
>   /* Used by oom implementation, do not set */
>   unsigned long totalpages;
> - struct task_struct *chosen;
> + struct task_struct *chosen_task;
> + struct mem_cgroup *chosen_memcg;
>   unsigned long chosen_points;
>  };
>  
> @@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
>  /* sysctls */
>