Re: [PATCH v3 4/4] mm/docs: describe memory.low refinements

2018-04-05 Thread Johannes Weiner
On Thu, Apr 05, 2018 at 07:59:21PM +0100, Roman Gushchin wrote:
> Refine cgroup v2 docs after latest memory.low changes.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: linux...@kvack.org
> Cc: cgro...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-30 Thread Johannes Weiner
On Tue, Jan 30, 2018 at 01:20:11PM +0100, Michal Hocko wrote:
> From 361275a05ad7026b8f721f8aa756a4975a2c42b1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mho...@suse.com>
> Date: Tue, 30 Jan 2018 09:54:15 +0100
> Subject: [PATCH] oom, memcg: clarify root memcg oom accounting
> 
> David Rientjes has pointed out that the current way how the root memcg
> is accounted for the cgroup aware OOM killer is undocumented. Unlike
> regular cgroups there is no accounting going on in the root memcg
> (mostly for performance reasons). Therefore we are suming up oom_badness
> of its tasks. This might result in an over accounting because of the
> oom_score_adj setting. Document this for now.
> 
> Acked-by: Roman Gushchin <g...@fb.com>
> Signed-off-by: Michal Hocko <mho...@suse.com>

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-15 Thread Johannes Weiner
On Sun, Jan 14, 2018 at 03:44:09PM -0800, David Rientjes wrote:
> On Sat, 13 Jan 2018, Johannes Weiner wrote:
> 
> > You don't have any control and no accounting of the stuff situated
> > inside the root cgroup, so it doesn't make sense to leave anything in
> > there while also using sophisticated containerization mechanisms like
> > this group oom setting.
> > 
> > In fact, the laptop I'm writing this email on runs an unmodified
> > mainstream Linux distribution. The only thing in the root cgroup are
> > kernel threads.
> > 
> > The decisions are good enough for the rare cases you forget something
> > in there and it explodes.
> 
> It's quite trivial to allow the root mem cgroup to be compared exactly the 
> same as another cgroup.  Please see 
> https://marc.info/?l=linux-kernel=151579459920305.

This only says "that will be fixed" and doesn't address why I care.

> > This assumes you even need one. Right now, the OOM killer picks the
> > biggest MM, so you can evade selection by forking your MM. This patch
> > allows picking the biggest cgroup, so you can evade by forking groups.
> 
> It's quite trivial to prevent any cgroup from evading the oom killer by 
> either forking their mm or attaching all their processes to subcontainers.  
> Please see https://marc.info/?l=linux-kernel=151579459920305.

This doesn't address anything I wrote.

> > It's not a new vector, and clearly nobody cares. This has never been
> > brought up against the current design that I know of.
> 
> As cgroup v2 becomes more popular, people will organize their cgroup 
> hierarchies for all controllers they need to use.  We do this today, for 
> example, by attaching some individual consumers to child mem cgroups 
> purely for the rich statistics and vmscan stats that mem cgroup provides 
> without any limitation on those cgroups.

There is no connecting tissue between what I wrote and what you wrote.

> > Note, however, that there actually *is* a way to guard against it: in
> > cgroup2 there is a hierarchical limit you can configure for the number
> > of cgroups that are allowed to be created in the subtree. See
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 
> Not allowing the user to create subcontainers to track statistics to paper 
> over an obvious and acknowledged shortcoming in the design of the cgroup 
> aware oom killer seems like a pretty nasty shortcoming itself.

It's not what I proposed. There is a big difference between cgroup
fork bombs and having a couple of groups for statistics.

> > > > > I proposed a solution in 
> > > > > https://marc.info/?l=linux-kernel=150956897302725, which was never 
> > > > > responded to, for all of these issues.  The idea is to do 
> > > > > hierarchical 
> > > > > accounting of mem cgroup hierarchies so that the hierarchy is 
> > > > > traversed 
> > > > > comparing total usage at each level to select target cgroups.  Admins 
> > > > > and 
> > > > > users can use memory.oom_score_adj to influence that decisionmaking 
> > > > > at 
> > > > > each level.
> > 
> > We did respond repeatedly: this doesn't work for a lot of setups.
> 
> We need to move this discussion to the active proposal at 
> https://marc.info/?l=linux-kernel=151579459920305, because it does 
> address your setup, so it's not good use of anyones time to further 
> discuss simply memory.oom_score_adj.

No, we don't.

We have a patch set that was developed, iterated and improved over 10+
revisions, based on evaluating and weighing trade-offs. It's reached a
state where the memcg maintainers are happy with it and don't have any
concern about future extendabilty to cover more specialized setups.

You've had nine months to contribute and shape this patch series
productively, and instead resorted to a cavalcade of polemics,
evasion, and repetition of truisms and refuted points. A ten paragraph
proposal of vague ideas at this point is simply not a valid argument.

You can send patches to replace or improve on Roman's code and make
the case for them.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-13 Thread Johannes Weiner
On Wed, Jan 10, 2018 at 11:33:45AM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin  wrote:
> > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > > > > This patchset makes the OOM killer cgroup-aware.
> > > > 
> > > > Thanks, I'll grab these.
> > > > 
> > > > There has been controversy over this patchset, to say the least.  I
> > > > can't say that I followed it closely!  Could those who still have
> > > > reservations please summarise their concerns and hopefully suggest a
> > > > way forward?
> > > > 
> > > 
> > > Yes, I'll summarize what my concerns have been in the past and what they 
> > > are wrt the patchset as it stands in -mm.  None of them originate from my 
> > > current usecase or anticipated future usecase of the oom killer for 
> > > system-wide or memcg-constrained oom conditions.  They are based purely 
> > > on 
> > > the patchset's use of an incomplete and unfair heuristic for deciding 
> > > which cgroup to target.
> > > 
> > > I'll also suggest simple changes to the patchset, which I have in the 
> > > past, that can be made to address all of these concerns.
> > > 
> > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > > 
> > > The patchset uses two different heuristics to compare root and leaf mem 
> > > cgroups and scores them based on number of pages.  For the root mem 
> > > cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> > > that's based on rss, swap, pgtables, and, most importantly, 
> > > oom_score_adj.  
> > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> > > unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> > > different independent of /proc/pid/oom_score_adj, but the most obvious 
> > > unfairness comes from users who tune oom_score_adj.
> > > 
> > > An example: start a process that faults 1GB of anonymous memory and leave 
> > > it attached to the root mem cgroup.  Start six more processes that each 
> > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  
> > > Set 
> > > all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> > > will always kill the 1GB process attached to the root mem cgroup.  It's 
> > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> > > it.
> > > 
> > > In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> > > pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> > > 12,652,907, eight times larger even though its usage is six times smaller.
> > > 
> > > This is caused by the patchset disregarding oom_score_adj entirely for 
> > > leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> > > the complete opposite result of what the cgroup aware oom killer 
> > > advertises.
> > > 
> > > It also works the other way, if a large memory hog is attached to the 
> > > root 
> > > mem cgroup but has a negative oom_score_adj it is never killed and random 
> > > processes are nuked solely because they happened to be attached to a leaf 
> > > mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> > > so I can't presume that it is either known nor tested.
> > > 
> > > Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> > > the same criteria by doing hierarchical accounting of usage and 
> > > subtracting from total system usage to find root usage.
> > 
> > I find this problem quite minor, because I haven't seen any practical 
> > problems
> > caused by accounting of the root cgroup memory.
> > If it's a serious problem for you, it can be solved without switching to the
> > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > substract them from global values. So, it can be a relatively small 
> > enhancement
> > on top of the current mm tree. This has nothing to do with global victim 
> > selection
> > approach.
> 
> It sounds like a significant shortcoming to me - the oom-killing
> decisions which David describes are clearly incorrect?

As others have pointed out, it's an academic problem.

You don't have any control and no accounting of the stuff situated
inside the root cgroup, so it doesn't make sense to leave anything in
there while also using sophisticated containerization mechanisms like
this group oom setting.

In fact, the laptop I'm writing this email on runs an unmodified
mainstream Linux distribution. The only thing in the root cgroup are
kernel threads.

The decisions are good enough for the rare cases you forget something
in there and it explodes.

> If this can be fixed against the -mm patchset with a "relatively small
> enhancement" then please let's get that done so it can be reviewed and
> tested.

You'd have to sum up all the memory consumed by first-level 

Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

2017-10-31 Thread Johannes Weiner
On Tue, Oct 31, 2017 at 10:50:43AM -0700, Shakeel Butt wrote:
> On Tue, Oct 31, 2017 at 9:40 AM, Johannes Weiner <han...@cmpxchg.org> wrote:
> > On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote:
> >> > +
> >> > +static void select_victim_memcg(struct mem_cgroup *root, struct 
> >> > oom_control *oc)
> >> > +{
> >> > +   struct mem_cgroup *iter;
> >> > +
> >> > +   oc->chosen_memcg = NULL;
> >> > +   oc->chosen_points = 0;
> >> > +
> >> > +   /*
> >> > +* The oom_score is calculated for leaf memory cgroups (including
> >> > +* the root memcg).
> >> > +*/
> >> > +   rcu_read_lock();
> >> > +   for_each_mem_cgroup_tree(iter, root) {
> >> > +   long score;
> >> > +
> >> > +   if (memcg_has_children(iter) && iter != root_mem_cgroup)
> >> > +   continue;
> >> > +
> >>
> >> Cgroup v2 does not support charge migration between memcgs. So, there
> >> can be intermediate nodes which may contain the major charge of the
> >> processes in their leave descendents. Skipping such intermediate nodes
> >> will kind of protect such processes from oom-killer (lower on the list
> >> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> >> we document it?
> >
> > Tasks cannot be in intermediate nodes, so the only way you can end up
> > in a situation like this is to start tasks fully, let them fault in
> > their full workingset, then create child groups and move them there.
> >
> > That has attribution problems much wider than the OOM killer: any
> > local limits you would set on a leaf cgroup like this ALSO won't
> > control the memory of its tasks - as it's all sitting in the parent.
> >
> > We created the "no internal competition" rule exactly to prevent this
> > situation.
> 
> Rather than the "no internal competition" restriction I think "charge
> migration" would have resolved that situation? Also "no internal
> competition" restriction (I am assuming 'no internal competition' is
> no tasks in internal nodes, please correct me if I am wrong) has made
> "charge migration" hard to implement and thus not added in cgroup v2.
> 
> I know this is parallel discussion and excuse my ignorance, what are
> other reasons behind "no internal competition" specifically for memory
> controller?

Sorry, but this is completely off-topic.

The rationale for this decisions is in Documentation/cgroup-v2.txt.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

2017-10-31 Thread Johannes Weiner
On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote:
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct 
> > oom_control *oc)
> > +{
> > +   struct mem_cgroup *iter;
> > +
> > +   oc->chosen_memcg = NULL;
> > +   oc->chosen_points = 0;
> > +
> > +   /*
> > +* The oom_score is calculated for leaf memory cgroups (including
> > +* the root memcg).
> > +*/
> > +   rcu_read_lock();
> > +   for_each_mem_cgroup_tree(iter, root) {
> > +   long score;
> > +
> > +   if (memcg_has_children(iter) && iter != root_mem_cgroup)
> > +   continue;
> > +
> 
> Cgroup v2 does not support charge migration between memcgs. So, there
> can be intermediate nodes which may contain the major charge of the
> processes in their leave descendents. Skipping such intermediate nodes
> will kind of protect such processes from oom-killer (lower on the list
> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> we document it?

Tasks cannot be in intermediate nodes, so the only way you can end up
in a situation like this is to start tasks fully, let them fault in
their full workingset, then create child groups and move them there.

That has attribution problems much wider than the OOM killer: any
local limits you would set on a leaf cgroup like this ALSO won't
control the memory of its tasks - as it's all sitting in the parent.

We created the "no internal competition" rule exactly to prevent this
situation. To be consistent with that rule, we might want to disallow
the creation of child groups once a cgroup has local memory charges.

It's trivial to change the setup sequence to create the leaf cgroup
first, then launch the workload from within.

Either way, this is nothing specific about the OOM killer.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-27 Thread Johannes Weiner
On Thu, Oct 26, 2017 at 02:03:41PM -0700, David Rientjes wrote:
> On Thu, 26 Oct 2017, Johannes Weiner wrote:
> 
> > > The nack is for three reasons:
> > > 
> > >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> > >  cgroup from oom kill in system oom conditions,
> > > 
> > >  (2) the ability of users to completely evade the oom killer by attaching
> > >  all processes to child cgroups either purposefully or unpurposefully,
> > >  and
> > > 
> > >  (3) the inability of userspace to effectively control oom victim  
> > >  selection.
> > 
> > My apologies if my summary was too reductionist.
> > 
> > That being said, the arguments you repeat here have come up in
> > previous threads and been responded to. This doesn't change my
> > conclusion that your NAK is bogus.
> 
> They actually haven't been responded to, Roman was working through v11 and 
> made a change on how the root mem cgroup usage was calculated that was 
> better than previous iterations but still not an apples to apples 
> comparison with other cgroups.  The problem is that it the calculation for 
> leaf cgroups includes additional memory classes, so it biases against 
> processes that are moved to non-root mem cgroups.  Simply creating mem 
> cgroups and attaching processes should not independently cause them to 
> become more preferred: it should be a fair comparison between the root mem 
> cgroup and the set of leaf mem cgroups as implemented.  That is very 
> trivial to do with hierarchical oom cgroup scoring.

There is absolutely no value in your repeating the same stuff over and
over again without considering what other people are telling you.

Hierarchical oom scoring has other downsides, and most of us agree
that they aren't preferable over the differences in scoring the root
vs scoring other cgroups - in particular because the root cannot be
controlled, doesn't even have local statistics, and so is unlikely to
contain important work on a containerized system. Getting the ballpark
right for the vast majority of usecases is more than good enough here.

> Since the ability of userspace to control oom victim selection is not 
> addressed whatsoever by this patchset, and the suggested method cannot be 
> implemented on top of this patchset as you have argued because it requires 
> a change to the heuristic itself, the patchset needs to become complete 
> before being mergeable.

It is complete. It just isn't a drop-in replacement for what you've
been doing out-of-tree for years. Stop making your problem everybody
else's problem.

You can change the the heuristics later, as you have done before. Or
you can add another configuration flag and we can phase out the old
mode, like we do all the time.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-19 Thread Johannes Weiner
On Thu, Oct 19, 2017 at 07:52:12PM +0100, Roman Gushchin wrote:
> This patchset makes the OOM killer cgroup-aware.

Hi Andrew,

I believe this code is ready for merging upstream, and it seems Michal
is in agreement. There are two main things to consider, however.

David would have really liked for this patchset to include knobs to
influence how the algorithm picks cgroup victims. The rest of us
agreed that this is beyond the scope of these patches, that the
patches don't need it to be useful, and that there is nothing
preventing anyone from adding configurability later on. David
subsequently nacked the series as he considers it incomplete. Neither
Michal nor I see technical merit in David's nack.

Michal acked the implementation, but on the condition that the new
behavior be opt-in, to not surprise existing users. I *think* we agree
that respecting the cgroup topography during global OOM is what we
should have been doing when cgroups were initially introduced; where
we disagree is that I think users shouldn't have to opt in to
improvements. We have done much more invasive changes to the victim
selection without actual regressions in the past. Further, this change
only applies to mounts of the new cgroup2. Tejun also wasn't convinced
of the risk for regression, and too would prefer cgroup-awareness to
be the default in cgroup2. I would ask for patch 5/6 to be dropped.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Johannes Weiner
On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> [...]
> > That will silently ignore what the user writes to the memory.oom_group
> > control files across the system's cgroup tree.
> > 
> > We'll have a knob that lets the workload declare itself an indivisible
> > memory consumer, that it would like to get killed in one piece, and
> > it's silently ignored because of a mount option they forgot to pass.
> > 
> > That's not good from an interface perspective.
> 
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

A boot time knob makes less sense to me than the mount option. It
doesn't require a reboot to change this behavior, we shouldn't force
the user to reboot when a runtime configuration is possible.

But I don't see how dropping this patch as part of this series would
prevent adding modular oom policies in the future?

That said, selectable OOM policies sound like a total deadend to
me. The kernel OOM happens way too late to be useful for any kind of
resource policy already. Even now it won't prevent you from thrashing
indefinitely, with only 5% of your workload's time spent productively.

What kind of service quality do you have at this point?

The *minority* of our OOM situations (in terms of "this isn't making
real progress anymore due to a lack of memory") is even *seeing* OOM
kills at this point. And it'll get worse as storage gets faster and
memory bigger.

How is that useful as a resource arbitration point?

Then there is the question of reliability. I mean, we still don't have
a global OOM killer that is actually free from deadlocks. We don't
have reserves measured to the exact requirements of reclaim that would
guarantee recovery, the OOM reaper requires a lock that we hope isn't
taken, etc. I wouldn't want any of my fleet to rely on this for
regular operation - I'm just glad that, when we do mess up and hit
this event, we don't have to reboot.

It makes much more sense to monitor memory pressure from userspace and
smartly intervene when things turn unproductive, which is a long way
from the point where the kernel is about to *deadlock* due to memory.

Global OOM kills can still happen, but their goal should really be 1)
to save the kernel, 2) respect the integrity of a memory consumer and
3) be comprehensible to userspace. (These patches are about 2 and 3.)

But abstracting such a rudimentary and fragile deadlock avoidance
mechanism into higher-level resource management, or co-opting it as a
policy enforcement tool, is crazy to me.

And it seems reckless to present it as those things to our users by
encoding any such elaborate policy interfaces.

> > On the other hand, the only benefit of this patch is to shield users
> > from changes to the OOM killing heuristics. Yet, it's really hard to
> > imagine that modifying the victim selection process slightly could be
> > called a regression in any way. We have done that many times over,
> > without a second thought on backwards compatibility:
> > 
> > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > a63d83f427fb oom: badness heuristic rewrite
> > 778c14affaf9 mm, oom: base root bonus on current usage
> 
> yes we have changed that without a deeper considerations. Some of those
> changes are arguable (e.g. child scarification). The oom badness
> heuristic rewrite has triggered quite some complains AFAIR (I remember
> Kosaki has made several attempts to revert it). I think that we are
> trying to be more careful about user visible changes than we used to be.

Whatever grumbling might have come up, it has not resulted in a revert
or a way to switch back to the old behavior. So I don't think this can
be considered an actual regression.

We change heuristics in the MM all the time. If you track for example
allocator behavior over different kernel versions, you can see how
much our caching policy, our huge page policy etc. fluctuates. The
impact of that is way bigger to regular workloads than how we go about
choosing an OOM victim.

We don't want to regress anybody, but let's also keep perspective here
and especially consider the userspace interfaces we are willing to put
in for at least the next few years, the promises we want to make, the
further fragmentation of the config space, for such a negligible risk.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Johannes Weiner
On Thu, Oct 05, 2017 at 01:40:09AM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Johannes Weiner wrote:
> 
> > > By only considering leaf memcgs, does this penalize users if their memcg 
> > > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > > processes to be members of that memcg, which would otherwise be the 
> > > standard behavior?
> > > 
> > > What prevents me from spreading my memcg with N processes attached over N 
> > > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > > each child memcg specifically to avoid being oom killed?
> > 
> > It's no different from forking out multiple mm to avoid being the
> > biggest process.
> > 
> 
> It is, because it can quite clearly be a DoSand was prevented with 
> Roman's earlier design of iterating usage up the hierarchy and comparing 
> siblings based on that criteria.  I know exactly why he chose that 
> implementation detail early on, and it was to prevent cases such as this 
> and to not let userspace hide from the oom killer.

This doesn't address how it's different from a single process
following the same pattern right now.

> > It's up to the parent to enforce limits on that group and prevent you
> > from being able to cause global OOM in the first place, in particular
> > if you delegate to untrusted and potentially malicious users.
> > 
> 
> Let's resolve that global oom is a real condition and getting into that 
> situation is not a userspace problem.  It's the result of overcommiting 
> the system, and is used in the enterprise to address business goals.  If 
> the above is true, and its up to memcg to prevent global oom in the first 
> place, then this entire patchset is absolutely pointless.  Limit userspace 
> to 95% of memory and when usage is approaching that limit, let userspace 
> attached to the root memcg iterate the hierarchy itself and kill from the 
> largest consumer.
> 
> This patchset exists because overcommit is real, exactly the same as 
> overcommit within memcg hierarchies is real.  99% of the time we don't run 
> into global oom because people aren't using their limits so it just works 
> out.  1% of the time we run into global oom and we need a decision to made 
> based for forward progress.  Using Michal's earlier example of admins and 
> students, a student can easily use all of his limit and also, with v10 of 
> this patchset, 99% of the time avoid being oom killed just by forking N 
> processes over N cgroups.  It's going to oom kill an admin every single 
> time.

We overcommit too, but our workloads organize themselves based on
managing their resources, not based on evading the OOM killer. I'd
wager that's true for many if not most users.

Untrusted workloads can evade the OOM killer now, and they can after
these patches are in. Nothing changed. It's not what this work tries
to address at all.

The changelogs are pretty clear on what the goal and the scope of this
is. Just because it doesn't address your highly specialized usecase
doesn't make it pointless. I think we've established that in the past.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 01:27:14PM -0700, David Rientjes wrote:
> By only considering leaf memcgs, does this penalize users if their memcg 
> becomes oc->chosen_memcg purely because it has aggregated all of its 
> processes to be members of that memcg, which would otherwise be the 
> standard behavior?
> 
> What prevents me from spreading my memcg with N processes attached over N 
> child memcgs instead so that memcg_oom_badness() becomes very small for 
> each child memcg specifically to avoid being oom killed?

It's no different from forking out multiple mm to avoid being the
biggest process.

It's up to the parent to enforce limits on that group and prevent you
from being able to cause global OOM in the first place, in particular
if you delegate to untrusted and potentially malicious users.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 01:17:14PM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Roman Gushchin wrote:
> 
> > > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct 
> > > > *victim)
> > > > struct mm_struct *mm;
> > > > bool can_oom_reap = true;
> > > >  
> > > > +   if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > > +   victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > > +   put_task_struct(victim);
> > > > +   return;
> > > > +   }
> > > > +
> > > > p = find_lock_task_mm(victim);
> > > > if (!p) {
> > > > put_task_struct(victim);
> > > 
> > > Is this necessary? The callers of this function use oom_badness() to
> > > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> > 
> > It is. __oom_kill_process() is used to kill all processes belonging
> > to the selected memory cgroup, so we should perform these checks
> > to avoid killing unkillable processes.
> > 
> 
> That's only true after the next patch in the series which uses the 
> oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
> Would it be possible to move this check to that patch so it's more 
> obvious?

Yup, I realized it when reviewing the next patch. Moving this hunk to
the next patch would probably make sense. Although, us reviewers have
been made aware of this now, so I don't feel strongly about it. Won't
make much of a difference once the patches are merged.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 6/6] mm, oom, docs: describe the cgroup-aware OOM killer

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 04:46:38PM +0100, Roman Gushchin wrote:
> Document the cgroup-aware OOM killer.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov....@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: David Rientjes <rient...@google.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 04:46:37PM +0100, Roman Gushchin wrote:
> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
> 
> The behavior can be changed dynamically by remounting the cgroupfs.
>
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org
> ---
>  include/linux/cgroup-defs.h |  5 +
>  kernel/cgroup/cgroup.c  | 10 ++
>  mm/memcontrol.c |  3 +++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 3e55bbd31ad1..cae5343a8b21 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -80,6 +80,11 @@ enum {
>* Enable cpuset controller in v1 cgroup to use v2 behavior.
>*/
>   CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> + /*
> +  * Enable cgroup-aware OOM killer.
> +  */
> + CGRP_GROUP_OOM = (1 << 5),
>  };
>  
>  /* cftype->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c3421ee0d230..8d8aa46ff930 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned 
> int *root_flags)
>   if (!strcmp(token, "nsdelegate")) {
>   *root_flags |= CGRP_ROOT_NS_DELEGATE;
>   continue;
> + } else if (!strcmp(token, "groupoom")) {
> + *root_flags |= CGRP_GROUP_OOM;
> + continue;
>   }
>  
>   pr_err("cgroup2: unknown option \"%s\"\n", token);
> @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int 
> root_flags)
>   cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
>   else
>   cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> + if (root_flags & CGRP_GROUP_OOM)
> + cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
> + else
> + cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
>   }
>  }
>  
> @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, 
> struct kernfs_root *kf_root
>  {
>   if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
>   seq_puts(seq, ",nsdelegate");
> + if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
> + seq_puts(seq, ",groupoom");
>   return 0;
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1fcd6cc353d5..2e82625bd354 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2865,6 +2865,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control 
> *oc)
>   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>   return false;
>  
> + if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
> + return false;

That will silently ignore what the user writes to the memory.oom_group
control files across the system's cgroup tree.

We'll have a knob that lets the workload declare itself an indivisible
memory consumer, that it would like to get killed in one piece, and
it's silently ignored because of a mount option they forgot to pass.

That's not good from an interface perspective.

On the other hand, the only benefit of this patch is to shield users
from changes to the OOM killing heuristics. Yet, it's really hard to
imagine that modifying the victim selection process slightly could be
called a regression in any way. We have done that many times over,
without a second thought on backwards compatibility:

5e9d834a0e0c oom: sacrifice child with highest badness score for parent
a63d83f427fb oom: badness heuristic rewrite
778c14affaf9 mm, oom: base root bonus on current usage

Let's not make the userspace interface crap because of some misguided
idea that the OOM heuristic is a hard promise to userspace. It's never
been, and nobody has complained about changes in the past.

This case is doubly silly, as the behavior change only applies to
cgroup2, which doesn't exactly have a large base of legacy users yet.

Let's just drop this 5/6 patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 4/6] mm, oom: introduce memory.oom_group

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 04:46:36PM +0100, Roman Gushchin wrote:
> The cgroup-aware OOM killer treats leaf memory cgroups as memory
> consumption entities and performs the victim selection by comparing
> them based on their memory footprint. Then it kills the biggest task
> inside the selected memory cgroup.
> 
> But there are workloads, which are not tolerant to a such behavior.
> Killing a random task may leave the workload in a broken state.
> 
> To solve this problem, memory.oom_group knob is introduced.
> It will define, whether a memory group should be treated as an
> indivisible memory consumer, compared by total memory consumption
> with other memory consumers (leaf memory cgroups and other memory
> cgroups with memory.oom_group set), and whether all belonging tasks
> should be killed if the cgroup is selected.
> 
> If set on memcg A, it means that in case of system-wide OOM or
> memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> belonging to the sub-tree of A will be killed. If OOM event is
> scoped to a descendant cgroup (A/B, for example), only tasks in
> that cgroup can be affected. OOM killer will never touch any tasks
> outside of the scope of the OOM event.
> 
> Also, tasks with oom_score_adj set to -1000 will not be killed.
> 
> The default value is 0.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Those semantics make sense to me and the code looks good.

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 04:46:35PM +0100, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

This looks good to me.

Acked-by: Johannes Weiner <han...@cmpxchg.org>

I just have one question:

> @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct 
> *victim)
>   struct mm_struct *mm;
>   bool can_oom_reap = true;
>  
> + if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> + victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> + put_task_struct(victim);
> + return;
> + }
> +
>   p = find_lock_task_mm(victim);
>   if (!p) {
>   put_task_struct(victim);

Is this necessary? The callers of this function use oom_badness() to
find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 04:46:34PM +0100, Roman Gushchin wrote:
> Implement mem_cgroup_scan_tasks() functionality for the root
> memory cgroup to use this function for looking for a OOM victim
> task in the root memory cgroup by the cgroup-ware OOM killer.
> 
> The root memory cgroup is treated as a leaf cgroup, so only tasks
> which are directly belonging to the root cgroup are iterated over.
> 
> This patch doesn't introduce any functional change as
> mem_cgroup_scan_tasks() is never called for the root memcg.
> This is preparatory work for the cgroup-aware OOM killer,
> which will use this function to iterate over tasks belonging
> to the root memcg.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Acked-by: Michal Hocko <mho...@suse.com>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 1/6] mm, oom: refactor the oom_kill_process() function

2017-10-04 Thread Johannes Weiner
On Wed, Oct 04, 2017 at 04:46:33PM +0100, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Acked-by: Michal Hocko <mho...@kernel.org>
> Acked-by: David Rientjes <rient...@google.com>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-10-02 Thread Johannes Weiner
On Mon, Oct 02, 2017 at 01:24:25PM -0700, Shakeel Butt wrote:
> On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko  wrote:
> > On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
> >> > I am sorry to cut the rest of your proposal because it simply goes over
> >> > the scope of the proposed solution while the usecase you are mentioning
> >> > is still possible. If we want to compare intermediate nodes (which seems
> >> > to be the case) then we can always provide a knob to opt-in - be it your
> >> > oom_gang or others.
> >>
> >> In the Roman's proposed solution we can already force the comparison
> >> of intermediate nodes using 'oom_group', I am just requesting to
> >> separate the killall semantics from it.
> >
> > oom_group _is_ about killall semantic.  And comparing killable entities
> > is just a natural thing to do. So I am not sure what you mean
> >
> 
> I am saying decouple the notion of comparable entities and killable entities.

Feel free to send patches in a new thread.

We don't need this level of control for this series to be useful - to
us, and other users. It can easily be added on top of Roman's work.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-26 Thread Johannes Weiner
On Tue, Sep 26, 2017 at 03:30:40PM +0200, Michal Hocko wrote:
> On Tue 26-09-17 13:13:00, Roman Gushchin wrote:
> > On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
> > > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> > > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> > > > > [...]
> > > > > > I'm not against this model, as I've said before. It feels logical,
> > > > > > and will work fine in most cases.
> > > > > > 
> > > > > > In this case we can drop any mount/boot options, because it 
> > > > > > preserves
> > > > > > the existing behavior in the default configuration. A big advantage.
> > > > > 
> > > > > I am not sure about this. We still need an opt-in, ragardless, because
> > > > > selecting the largest process from the largest memcg != selecting the
> > > > > largest task (just consider memcgs with many processes example).
> > > > 
> > > > As I understand Johannes, he suggested to compare individual processes 
> > > > with
> > > > group_oom mem cgroups. In other words, always select a killable entity 
> > > > with
> > > > the biggest memory footprint.
> > > > 
> > > > This is slightly different from my v8 approach, where I treat leaf 
> > > > memcgs
> > > > as indivisible memory consumers independent on group_oom setting, so
> > > > by default I'm selecting the biggest task in the biggest memcg.
> > > 
> > > My reading is that he is actually proposing the same thing I've been
> > > mentioning. Simply select the biggest killable entity (leaf memcg or
> > > group_oom hierarchy) and either kill the largest task in that entity
> > > (for !group_oom) or the whole memcg/hierarchy otherwise.
> > 
> > He wrote the following:
> > "So I'm leaning toward the second model: compare all oomgroups and
> > standalone tasks in the system with each other, independent of the
> > failed hierarchical control structure. Then kill the biggest of them."
> 
> I will let Johannes to comment but I believe this is just a
> misunderstanding. If we compared only the biggest task from each memcg
> then we are basically losing our fairness objective, aren't we?

Sorry about the confusion.

Yeah I was making the case for what Michal proposed, to kill the
biggest terminal consumer, which is either a task or an oomgroup.

You'd basically iterate through all the tasks and cgroups in the
system and pick the biggest task that isn't in an oom group or the
biggest oom group and then kill that.

Yeah, you'd have to compare the memory footprints of tasks with the
memory footprints of cgroups. These aren't defined identically, and
tasks don't get attributed every type of allocation that a cgroup
would. But it should get us in the ballpark, and I cannot picture a
scenario where this would lead to a completely undesirable outcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-25 Thread Johannes Weiner
On Mon, Sep 25, 2017 at 02:24:00PM +0200, Michal Hocko wrote:
> I would really appreciate some feedback from Tejun, Johannes here.
> 
> On Wed 20-09-17 14:53:41, Roman Gushchin wrote:
> > On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote:
> > > On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> > > > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> [...]
> > > > > But then you just enforce a structural restriction on your 
> > > > > configuration
> > > > > because
> > > > >   root
> > > > > /  \
> > > > >AD
> > > > >   /\   
> > > > >  B  C
> > > > > 
> > > > > is a different thing than
> > > > >   root
> > > > > / | \
> > > > >B  C  D
> > > > >
> > > > 
> > > > I actually don't have a strong argument against an approach to select
> > > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > > no much difference.
> > 
> > I've tried to implement this approach, and it's really arguable.
> > Although your example looks reasonable, the opposite example is also valid:
> > you might want to compare whole hierarchies, and it's a quite typical 
> > usecase.
> > 
> > Assume, you have several containerized workloads on a machine (probably,
> > each will be contained in a memcg with memory.max set), with some hierarchy
> > of cgroups inside. Then in case of global memory shortage we want to reclaim
> > some memory from the biggest workload, and the selection should not depend
> > on group_oom settings. It would be really strange, if setting group_oom will
> > higher the chances to be killed.
> > 
> > In other words, let's imagine processes as leaf nodes in memcg tree. We 
> > decided
> > to select the biggest memcg and kill one or more processes inside (depending
> > on group_oom setting), but the memcg selection doesn't depend on it.
> > We do not compare processes from different cgroups, as well as cgroups with
> > processes. The same should apply to cgroups: why do we want to compare 
> > cgroups
> > from different sub-trees?
> > 
> > While size-based comparison can be implemented with this approach,
> > the priority-based is really weird (as David mentioned).
> > If priorities have no hierarchical meaning at all, we lack the very 
> > important
> > ability to enforce hierarchy oom_priority. Otherwise we have to invent some
> > complex rules of oom_priority propagation (e.g. is someone is raising
> > the oom_priority in parent, should it be applied to children immediately, 
> > etc).
> 
> I would really forget about the priority at this stage. This needs
> really much more thinking and I consider the David's usecase very
> specialized to use it as a template for a general purpose oom
> prioritization. I might be wrong here of course...

No, I agree.

> > In any case, OOM is a last resort mechanism. The goal is to reclaim some 
> > memory
> > and do not crash the system or do not leave it in totally broken state.
> > Any really complex mm in userspace should be applied _before_ OOM happens.
> > So, I don't think we have to support all possible configurations here,
> > if we're able to achieve the main goal (kill some processes and do not leave
> > broken systems/containers).
> 
> True but we want to have the semantic reasonably understandable. And it
> is quite hard to explain that the oom killer hasn't selected the largest
> memcg just because it happened to be in a deeper hierarchy which has
> been configured to cover a different resource.

Going back to Michal's example, say the user configured the following:

   root
  /\
 A  D
/ \
   B   C

A global OOM event happens and we find this:
- A > D
- B, C, D are oomgroups

What the user is telling us is that B, C, and D are compound memory
consumers. They cannot be divided into their task parts from a memory
point of view.

However, the user doesn't say the same for A: the A subtree summarizes
and controls aggregate consumption of B and C, but without groupoom
set on A, the user says that A is in fact divisible into independent
memory consumers B and C.

If we don't have to kill all of A, but we'd have to kill all of D,
does it make sense to compare the two?

Let's consider an extreme case of this conundrum:

root
  / \
 A   B
/|\  |
 A1-A1000B1

Again we find:
- A > B
- A1 to A1000 and B1 are oomgroups
But:
- A1 to A1000 individually are tiny, B1 is huge

Going level by level, we'd pick A as the bigger hierarchy in the
system, and then kill off one of the tiny groups A1 to A1000.

Conversely, going for biggest consumer regardless of hierarchy, we'd
compare A1 to A1000 and B1, then pick B1 as the biggest single atomic
memory consumer in the system and kill all its tasks.

Which one of these two fits both the purpose and our historic approach
to OOM killing better?

As was noted in this thread, OOM is the last resort to avoid a memory
deadlock. Killing the biggest consumer is most likely to resolve this

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread Johannes Weiner
On Thu, Sep 21, 2017 at 02:17:25PM -0700, David Rientjes wrote:
> On Thu, 21 Sep 2017, Johannes Weiner wrote:
> 
> > That's a ridiculous nak.
> > 
> > The fact that this patch series doesn't solve your particular problem
> > is not a technical argument to *reject* somebody else's work to solve
> > a different problem. It's not a regression when behavior is completely
> > unchanged unless you explicitly opt into a new functionality.
> > 
> > So let's stay reasonable here.
> > 
> 
> The issue is that if you opt-in to the new feature, then you are forced to 
> change /proc/pid/oom_score_adj of all processes attached to a cgroup that 
> you do not want oom killed based on size to be oom disabled.

You're assuming that most people would want to influence the oom
behavior in the first place. I think the opposite is the case: most
people don't care as long as the OOM killer takes the intent the user
has expressed wrt runtime containerization/grouping into account.

> The kernel provides no other remedy without oom priorities since the
> new feature would otherwise disregard oom_score_adj.

As of v8, it respects this setting and doesn't kill min score tasks.

> The nack originates from the general need for userspace influence
> over oom victim selection and to avoid userspace needing to take the
> rather drastic measure of setting all processes to be oom disabled
> to prevent oom kill in kernels before oom priorities are introduced.

As I said, we can discuss this in a separate context. Because again, I
really don't see how the lack of configurability in an opt-in feature
would diminish its value for many people who don't even care to adjust
and influence this behavior.

> > The patch series has merit as it currently stands. It makes OOM
> > killing in a cgrouped system fairer and less surprising. Whether you
> > have the ability to influence this in a new way is an entirely
> > separate discussion. It's one that involves ABI and user guarantees.
> > 
> > Right now Roman's patches make no guarantees on how the cgroup tree is
> > descended. But once we define an interface for prioritization, it
> > locks the victim algorithm into place to a certain extent.
> > 
> 
> The patchset compares memory cgroup size relative to sibling cgroups only, 
> the same comparison for memory.oom_priority.  There is a guarantee 
> provided on how cgroup size is compared in select_victim_memcg(), it 
> hierarchically accumulates the "size" from leaf nodes up to the root memcg 
> and then iterates the tree comparing sizes between sibling cgroups to 
> choose a victim memcg.  That algorithm could be more elaborately described 
> in the documentation, but we simply cannot change the implementation of 
> select_victim_memcg() later even without oom priorities since users cannot 
> get inconsistent results after opting into a feature between kernel 
> versions.  I believe the selection criteria should be implemented to be 
> deterministic, as select_victim_memcg() does, and the documentation should 
> fully describe what the selection criteria is, and then allow the user to 
> decide.

I wholeheartedly disagree. We have changed the behavior multiple times
in the past. In fact, you have arguably done the most drastic changes
to the algorithm since the OOM killer was first introduced. E.g.

a63d83f427fb oom: badness heuristic rewrite

And that's completely fine. Because this thing is not a resource
management tool for userspace, it's the kernel saving itself. At best
in a manner that's not too surprising to userspace.

To me, your argument behind the NAK still boils down to "this doesn't
support my highly specialized usecase." But since it doesn't prohibit
your usecase - which isn't even supported upstream, btw - this really
doesn't carry much weight.

I'd say if you want configurability on top of Roman's code, please
submit patches and push the case for these in a separate effort.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread Johannes Weiner
On Mon, Sep 11, 2017 at 01:44:39PM -0700, David Rientjes wrote:
> On Mon, 11 Sep 2017, Roman Gushchin wrote:
> 
> > This patchset makes the OOM killer cgroup-aware.
> > 
> > v8:
> >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> >   - Make the whole thing opt-in with cgroup mount option control
> >   - Drop oom_priority for further discussions
> 
> Nack, we specifically require oom_priority for this to function correctly, 
> otherwise we cannot prefer to kill from low priority leaf memcgs as 
> required.  v8 appears to implement new functionality that we want, to 
> compare two memcgs based on usage, but without the ability to influence 
> that decision to protect important userspace, so now I'm in a position 
> where (1) nothing has changed if I don't use the new mount option or (2) I 
> get completely different oom kill selection with the new mount option but 
> not the ability to influence it.  I was much happier with the direction 
> that v7 was taking, but since v8 causes us to regress without the ability 
> to change memcg priority, this has to be nacked.

That's a ridiculous nak.

The fact that this patch series doesn't solve your particular problem
is not a technical argument to *reject* somebody else's work to solve
a different problem. It's not a regression when behavior is completely
unchanged unless you explicitly opt into a new functionality.

So let's stay reasonable here.

The patch series has merit as it currently stands. It makes OOM
killing in a cgrouped system fairer and less surprising. Whether you
have the ability to influence this in a new way is an entirely
separate discussion. It's one that involves ABI and user guarantees.

Right now Roman's patches make no guarantees on how the cgroup tree is
descended. But once we define an interface for prioritization, it
locks the victim algorithm into place to a certain extent.

It also involves a discussion about how much control userspace should
have over OOM killing in the first place. It's a last-minute effort to
save the kernel from deadlocking on memory. Whether that is the time
and place to have userspace make clever resource management decisions
is an entirely different thing than what Roman is doing.

But this patch series doesn't prevent any such future discussion and
implementations, and it's not useless without it. So let's not
conflate these two things, and hold the priority patch for now.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-07 Thread Johannes Weiner
On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > should have been the default from the beginning, because the user
> > configures a group of tasks to be an interdependent, terminal unit of
> > memory consumption, and it's undesirable for the OOM killer to ignore
> > this intention and compare members across these boundaries.
> 
> I would agree if that was true in general. I can completely see how the
> cgroup awareness is useful in e.g. containerized environments (especially
> with kill-all enabled) but memcgs are used in a large variety of
> usecases and I cannot really say all of them really demand the new
> semantic. Say I have a workload which doesn't want to see reclaim
> interference from others on the same machine. Why should I kill a
> process from that particular memcg just because it is the largest one
> when there is a memory hog/leak outside of this memcg?

Sure, it's always possible to come up with a config for which this
isn't the optimal behavior. But this is about picking a default that
makes sense to most users, and that type of cgroup usage just isn't
the common case.

> From my point of view the safest (in a sense of the least surprise)
> way to go with opt-in for the new heuristic. I am pretty sure all who
> would benefit from the new behavior will enable it while others will not
> regress in unexpected way.

This thinking simply needs to be balanced against the need to make an
unsurprising and consistent final interface.

The current behavior breaks isolation by letting tasks in different
cgroups compete with each other during an OOM kill. While you can
rightfully argue that it's possible for usecases to rely on this, you
cannot tell me that this is the least-surprising thing we can offer
users; certainly not new users, but also not many/most existing ones.

> We can talk about the way _how_ to control these oom strategies, of
> course. But I would be really reluctant to change the default which is
> used for years and people got used to it.

I really doubt there are many cgroup users that rely on that
particular global OOM behavior.

We have to agree to disagree, I guess.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-05 Thread Johannes Weiner
On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> Why is this an opt out rather than opt-in? IMHO the original oom logic
> should be preserved by default and specific workloads should opt in for
> the cgroup aware logic. Changing the global behavior depending on
> whether cgroup v2 interface is in use is more than unexpected and IMHO
> wrong approach to take. I think we should instead go with 
> oom_strategy=[alloc_task,biggest_task,cgroup]
> 
> we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> biggest_task which is the default. You are adding cgroup and the more I
> think about the more I agree that it doesn't really make sense to try to
> fit thew new semantic into the existing one (compare tasks to kill-all
> memcgs). Just introduce a new strategy and define a new semantic from
> scratch. Memcg priority and kill-all are a natural extension of this new
> strategy. This will make the life easier and easier to understand by
> users.

oom_kill_allocating_task is actually a really good example of why
cgroup-awareness *should* be the new default.

Before we had the oom killer victim selection, we simply killed the
faulting/allocating task. While a valid answer to the problem, it's
not very fair or representative of what the user wants or intends.

Then we added code to kill the biggest offender instead, which should
have been the case from the start and was hence made the new default.
The oom_kill_allocating_task was added on the off-chance that there
might be setups who, for historical reasons, rely on the old behavior.
But our default was chosen based on what behavior is fair, expected,
and most reflective of the user's intentions.

The cgroup-awareness in the OOM killer is exactly the same thing. It
should have been the default from the beginning, because the user
configures a group of tasks to be an interdependent, terminal unit of
memory consumption, and it's undesirable for the OOM killer to ignore
this intention and compare members across these boundaries.

We should go the same way here as with kill_alloc_task: the default
should be what's sane and expected by the vast majority of our users,
with a knob (I would prefer a sysctl here, actually) to switch back in
case somebody - unexpectedly - actually relies on the old behavior.

I don't see why couldn't change user-visible behavior here if we don't
expect anyone (quirky setups aside) to rely on it AND we provide a
knob to revert in the field for the exceptions.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-23 Thread Johannes Weiner
Hi,

On Wed, Aug 23, 2017 at 05:20:31PM +0100, Roman Gushchin wrote:
> On Tue, Aug 22, 2017 at 01:03:44PM -0400, Johannes Weiner wrote:
> > > + css_task_iter_start(>css, 0, );
> > > + while ((task = css_task_iter_next())) {
> > > + /*
> > > +  * If there are no tasks, or all tasks have oom_score_adj set
> > > +  * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> > > +  * don't select this memory cgroup.
> > > +  */
> > > + if (!elegible &&
> > > + (memcg->oom_kill_all_tasks ||
> > > +  task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> > > + elegible = 1;
> > 
> > This is a little awkward to read. How about something like this:
> > 
> > /*
> >  * When killing individual tasks, we respect OOM score adjustments:
> >  * at least one task in the group needs to be killable for the group
> >  * to be oomable.
> >  *
> >  * Also check that previous OOM kills have finished, and abort if
> >  * there are any pending OOM victims.
> >  */
> > oomable = memcg->oom_kill_all_tasks;
> > while ((task = css_task_iter_next())) {
> > if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN)
> > oomable = 1;
> > 
> > > + if (tsk_is_oom_victim(task) &&
> > > + !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
> > > + elegible = -1;
> > > + break;
> > > + }
> > > + }
> > > + css_task_iter_end();
> 
> We ignore oom_score_adj if oom_kill_all_tasks is set, it's
> not reflected in your version. Anyway, I've moved the comments block
> outside and rephrased it to make more clear.

Yes it is...? We only respect the score if !oomable, which is set to
oom_kill_all_tasks.

> > > +static void select_victim_memcg(struct mem_cgroup *root, struct 
> > > oom_control *oc)
> > > +{
> > > + struct mem_cgroup *iter, *parent;
> > > +
> > > + for_each_mem_cgroup_tree(iter, root) {
> > > + if (memcg_has_children(iter)) {
> > > + iter->oom_score = 0;
> > > + continue;
> > > + }
> > > +
> > > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> > > + if (iter->oom_score == -1) {
> > 
> > Please add comments to document the special returns. Maybe #defines
> > would be clearer, too.
> > 
> > > + oc->chosen_memcg = (void *)-1UL;
> > > + mem_cgroup_iter_break(root, iter);
> > > + return;
> > > + }
> > > +
> > > + if (!iter->oom_score)
> > > + continue;
> > 
> > Same here.
> > 
> > Maybe a switch would be suitable to handle the abort/no-score cases. 
> 
> Not sure about switch/defines, but I've added several comment blocks
> to describe possible return values, as well as their handling.
> Hope, it will be enough.

Sounds good.

> > >  static int memory_events_show(struct seq_file *m, void *v)
> > >  {
> > >   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> > > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = {
> > >   .write = memory_max_write,
> > >   },
> > >   {
> > > + .name = "oom_kill_all_tasks",
> > > + .flags = CFTYPE_NOT_ON_ROOT,
> > > + .seq_show = memory_oom_kill_all_tasks_show,
> > > + .write = memory_oom_kill_all_tasks_write,
> > > + },
> > 
> > This name is quite a mouthful and reminiscent of the awkward v1
> > interface names. It doesn't really go well with the v2 names.
> > 
> > How about memory.oom_group?
> 
> I'd prefer to have something more obvious. I've renamed
> memory.oom_kill_all_tasks to memory.oom_kill_all, which was earlier suggested
> by Vladimir. Are you ok with it?

No, we should be striving for short and sweet mnemonics that express a
concept (oom applies to group, not member tasks) instead of underscore
sentences that describe an implementation (upon oom, kill all tasks in
the group).

It's better to have newbies consult the documentation once than making
everybody deal with long and cumbersome names for the rest of time.

Like 'ls' being better than 'read_and_print_directory_contents'.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function

2017-08-22 Thread Johannes Weiner
On Mon, Aug 14, 2017 at 07:32:09PM +0100, Roman Gushchin wrote:
> @@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task)
>   return ret;
>  }
>  
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)

oom_kill_task()?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-22 Thread Johannes Weiner
Hi Roman,

great work! This looks mostly good to me now. Below are some nitpicks
concerning naming and code layout, but nothing major.

On Mon, Aug 14, 2017 at 07:32:11PM +0100, Roman Gushchin wrote:
> @@ -39,6 +39,7 @@ struct oom_control {
>   unsigned long totalpages;
>   struct task_struct *chosen;
>   unsigned long chosen_points;
> + struct mem_cgroup *chosen_memcg;
>  };

Please rename 'chosen' to 'chosen_task' to make the distinction to
chosen_memcg clearer.

The ordering is a little weird too, with chosen_points in between.

chosen_task
chosen_memcg
chosen_points

?

> @@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct 
> mem_cgroup *memcg)
>   return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +   const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);

NR_SLAB_UNRECLAIMABLE is now accounted per-lruvec, which takes
nodeness into account, and so would be more accurate here.

You can get it with mem_cgroup_lruvec() and lruvec_page_state().

> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +const nodemask_t *nodemask)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int elegible = 0;

eligible

> +
> + css_task_iter_start(>css, 0, );
> + while ((task = css_task_iter_next())) {
> + /*
> +  * If there are no tasks, or all tasks have oom_score_adj set
> +  * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> +  * don't select this memory cgroup.
> +  */
> + if (!elegible &&
> + (memcg->oom_kill_all_tasks ||
> +  task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> + elegible = 1;

This is a little awkward to read. How about something like this:

/*
 * When killing individual tasks, we respect OOM score adjustments:
 * at least one task in the group needs to be killable for the group
 * to be oomable.
 *
 * Also check that previous OOM kills have finished, and abort if
 * there are any pending OOM victims.
 */
oomable = memcg->oom_kill_all_tasks;
while ((task = css_task_iter_next())) {
if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN)
oomable = 1;

> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
> + elegible = -1;
> + break;
> + }
> + }
> + css_task_iter_end();

etc.

> +
> + return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible;

I find these much easier to read if broken up, even if it's more LOC:

if (eligible <= 0)
return eligible;

return memcg_oom_badness(memcg, nodemask);

> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> + struct mem_cgroup *iter, *parent;
> +
> + for_each_mem_cgroup_tree(iter, root) {
> + if (memcg_has_children(iter)) {
> + iter->oom_score = 0;
> + continue;
> + }
> +
> + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> + if (iter->oom_score == -1) {

Please add comments to document the special returns. Maybe #defines
would be clearer, too.

> + oc->chosen_memcg = (void *)-1UL;
> + mem_cgroup_iter_break(root, iter);
> + return;
> + }
> +
> + if (!iter->oom_score)
> + continue;

Same here.

Maybe a switch would be suitable to handle the abort/no-score cases.

> + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> +  parent = parent_mem_cgroup(parent))
> + parent->oom_score += iter->oom_score;
> + }
> +
> + for (;;) {
> + struct cgroup_subsys_state *css;
> + struct mem_cgroup *memcg = NULL;
> + long score = LONG_MIN;
> +
> + css_for_each_child(css, >css) {
> + struct mem_cgroup *iter = mem_cgroup_from_css(css);
> +
> + if 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-31 Thread Johannes Weiner
On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > Everything the user would want to dynamically program in the kernel,
> > say with bpf, they could do in userspace and then update the scores
> > for each group and task periodically.
> 
> I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> to mere oom disable/enable knobs from my experience.

That doesn't necessarily have to be a deficiency with the scoring
system. I suspect that most people simply don't care as long as the
the picks for OOM victims aren't entirely stupid.

For example, we have a lot of machines that run one class of job. If
we run OOM there isn't much preference we'd need to express; just kill
one job - the biggest, whatever - and move on. (The biggest makes
sense because if all jobs are basically equal it's as good as any
other victim, but if one has a runaway bug it goes for that.)

Where we have more than one job class, it actually is mostly one hipri
and one lopri, in which case setting a hard limit on the lopri or the
-1000 OOM score trick is enough.

How many systems run more than two clearly distinguishable classes of
workloads concurrently?

I'm sure they exist. I'm just saying it doesn't surprise me that
elaborate OOM scoring isn't all that wide-spread.

> > The only limitation is that you have to recalculate and update the
> > scoring tree every once in a while, whereas a bpf program could
> > evaluate things just-in-time. But for that to matter in practice, OOM
> > kills would have to be a fairly hot path.
> 
> I am not really sure how to reliably implement "kill the memcg with the
> largest process" strategy. And who knows how many others strategies will
> pop out.

That seems fairly contrived.

What does it mean to divide memory into subdomains, but when you run
out of physical memory you kill based on biggest task?

Sure, it frees memory and gets the system going again, so it's as good
as any answer to overcommit gone wrong, I guess. But is that something
you'd intentionally want to express from a userspace perspective?

> > > > > > > And both kinds of workloads (services/applications and individual
> > > > > > > processes run by users) can co-exist on the same host - consider 
> > > > > > > the
> > > > > > > default systemd setup, for instance.
> > > > > > > 
> > > > > > > IMHO it would be better to give users a choice regarding what they
> > > > > > > really want for a particular cgroup in case of OOM - killing the 
> > > > > > > whole
> > > > > > > cgroup or one of its descendants. For example, we could introduce 
> > > > > > > a
> > > > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > > > tolerate killing a descendant or not. If it can, the kernel will 
> > > > > > > pick
> > > > > > > the fattest sub-cgroup or process and check it. If it cannot, it 
> > > > > > > will
> > > > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > > > 
> > > > > > The last thing we want to do, is to compare processes with cgroups.
> > > > > > I agree, that we can have some option to disable the cgroup-aware 
> > > > > > OOM at all,
> > > > > > mostly for backward-compatibility. But I don't think it should be a
> > > > > > per-cgroup configuration option, which we will support forever.
> > > > > 
> > > > > I can clearly see a demand for "this is definitely more important
> > > > > container than others so do not kill" usecases. I can also see demand
> > > > > for "do not kill this container running for X days". And more are 
> > > > > likely
> > > > > to pop out.
> > > > 
> > > > That can all be done with scoring.
> > > 
> > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > from trivial.
> > 
> > Why is sorting and picking in userspace harder than sorting and
> > picking in the kernel?
> 
> Because the userspace score based approach would be much more racy
> especially in the busy system. This could lead to unexpected behavior
> when OOM killer would kill a different than a run-away memcgs.

How would it be easier to weigh priority against runaway detection
inside the kernel?

> > > + /*
> > >* If current has a pending SIGKILL or is exiting, t

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-25 Thread Johannes Weiner
On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> [...]
> > > > How to react on an OOM - is definitely a policy, which depends
> > > > on the workload. Nothing is changing here from how it's working now,
> > > > except now kernel will choose a victim cgroup, and kill the victim 
> > > > cgroup
> > > > rather than a process.
> > > 
> > > There is a _big_ difference. The current implementation just tries
> > > to recover from the OOM situation without carying much about the
> > > consequences on the workload. This is the last resort and a services for
> > > the _system_ to get back to sane state. You are trying to make it more
> > > clever and workload aware and that is inevitable going to depend on the
> > > specific workload. I really do think we cannot simply hardcode any
> > > policy into the kernel for this purpose and that is why I would like to
> > > see a discussion about how to do that in a more extensible way. This
> > > might be harder to implement now but it I believe it will turn out
> > > better longerm.
> > 
> > And that's where I still maintain that this isn't really a policy
> > change. Because what this code does ISN'T more clever, and the OOM
> > killer STILL IS a last-resort thing.
> 
> The thing I wanted to point out is that what and how much to kill
> definitely depends on the usecase. We currently kill all tasks which
> share the mm struct because that is the smallest unit that can unpin
> user memory. And that makes a lot of sense to me as a general default.
> I would call any attempt to guess tasks belonging to the same
> workload/job as a "more clever".

Yeah, I agree it needs to be configurable. But a memory domain is not
a random guess. It's a core concept of the VM at this point. The fact
that the OOM killer cannot handle it is pretty weird and goes way
beyond "I wish we could have some smarter heuristics to choose from."

> > We don't need any elaborate
> > just-in-time evaluation of what each entity is worth. We just want to
> > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > just the biggest VMA unmapped and freed, since it leaves your process
> > incoherent, killing one process leaves a job incoherent.
> > 
> > I understand that making it fully configurable is a tempting thought,
> > because you'd offload all responsibility to userspace.
> 
> It is not only tempting it is also the only place which can define
> a more advanced OOM semantic sanely IMHO.

Why do you think that?

Everything the user would want to dynamically program in the kernel,
say with bpf, they could do in userspace and then update the scores
for each group and task periodically.

The only limitation is that you have to recalculate and update the
scoring tree every once in a while, whereas a bpf program could
evaluate things just-in-time. But for that to matter in practice, OOM
kills would have to be a fairly hot path.

> > > > > And both kinds of workloads (services/applications and individual
> > > > > processes run by users) can co-exist on the same host - consider the
> > > > > default systemd setup, for instance.
> > > > > 
> > > > > IMHO it would be better to give users a choice regarding what they
> > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > 
> > > > The last thing we want to do, is to compare processes with cgroups.
> > > > I agree, that we can have some option to disable the cgroup-aware OOM 
> > > > at all,
> > > > mostly for backward-compatibility. But I don't think it should be a
> > > > per-cgroup configuration option, which we will support forever.
> > > 
> > > I can clearly see a demand for "this is definitely more important
> > > container than others so do not kill" usecases. I can also see demand
> > > for "do not kill this container running for X days". And more

Re: [PATCH] mm: per-cgroup memory reclaim stats

2017-05-24 Thread Johannes Weiner
On Thu, May 11, 2017 at 08:16:23PM +0100, Roman Gushchin wrote:
> Track the following reclaim counters for every memory cgroup:
> PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and
> PGLAZYFREED.
> 
> These values are exposed using the memory.stats interface of cgroup v2.
> 
> The meaning of each value is the same as for global counters,
> available using /proc/vmstat.
> 
> Also, for consistency, rename mem_cgroup_count_vm_event() to
> count_memcg_event_mm().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Suggested-by: Johannes Weiner <han...@cmpxchg.org>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Li Zefan <lize...@huawei.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: Johannes Weiner <han...@cmpxchg.org>

Andrew, if there aren't any other objections, could you pick this one
up please?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-23 Thread Johannes Weiner
On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> > On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > > ...
> > > > +5-2-4. Cgroup-aware OOM Killer
> > > > +
> > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > > +It means that it treats memory cgroups as memory consumers
> > > > +rather then individual processes. Under the OOM conditions it tries
> > > > +to find an elegible leaf memory cgroup, and kill all processes
> > > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > > +to the root cgroup), it falls back to the traditional per-process
> > > > +behaviour.
> > > 
> > > I agree that the current OOM victim selection algorithm is totally
> > > unfair in a system using containers and it has been crying for rework
> > > for the last few years now, so it's great to see this finally coming.
> > > 
> > > However, I don't reckon that killing a whole leaf cgroup is always the
> > > best practice. It does make sense when cgroups are used for
> > > containerizing services or applications, because a service is unlikely
> > > to remain operational after one of its processes is gone, but one can
> > > also use cgroups to containerize processes started by a user. Kicking a
> > > user out for one of her process has gone mad doesn't sound right to me.
> > 
> > I agree, that it's not always a best practise, if you're not allowed
> > to change the cgroup configuration (e.g. create new cgroups).
> > IMHO, this case is mostly covered by using the v1 cgroup interface,
> > which remains unchanged.
> 
> But there are features which are v2 only and users might really want to
> use it. So I really do not buy this v2-only argument.

I have to agree here. We won't get around making the leaf killing
opt-in or opt-out in some fashion.

> > > Another example when the policy you're suggesting fails in my opinion is
> > > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > > run processes. The main service may stop working normally if one of its
> > > sub-services is killed. So it might make sense to kill not just an
> > > individual process or a leaf cgroup, but the whole main service with all
> > > its sub-services.
> > 
> > I agree, although I do not pretend for solving all possible
> > userspace problems caused by an OOM.
> > 
> > How to react on an OOM - is definitely a policy, which depends
> > on the workload. Nothing is changing here from how it's working now,
> > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > rather than a process.
> 
> There is a _big_ difference. The current implementation just tries
> to recover from the OOM situation without carying much about the
> consequences on the workload. This is the last resort and a services for
> the _system_ to get back to sane state. You are trying to make it more
> clever and workload aware and that is inevitable going to depend on the
> specific workload. I really do think we cannot simply hardcode any
> policy into the kernel for this purpose and that is why I would like to
> see a discussion about how to do that in a more extensible way. This
> might be harder to implement now but it I believe it will turn out
> better longerm.

And that's where I still maintain that this isn't really a policy
change. Because what this code does ISN'T more clever, and the OOM
killer STILL IS a last-resort thing. We don't need any elaborate
just-in-time evaluation of what each entity is worth. We just want to
kill the biggest job, not the biggest MM. Just like you wouldn't want
just the biggest VMA unmapped and freed, since it leaves your process
incoherent, killing one process leaves a job incoherent.

I understand that making it fully configurable is a tempting thought,
because you'd offload all responsibility to userspace. But on the
other hand, this was brought up years ago and nothing has happened
since. And to me this is evidence that nobody really cares all that
much. Because it's still a rather rare event, and there isn't much you
cannot accomplish with periodic score adjustments.

> > > And both kinds of workloads (services/applications and individual
> > > processes run by users) can co-exist on the same host - consider the
> > > default systemd setup, for instance.
> > > 
> > > IMHO it would be better to give users a choice regarding what they
> > > really want for a particular cgroup in case of OOM - killing the whole
> > > cgroup or one of its descendants. For example, we could introduce a
> > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > kill the whole cgroup and all its processes and sub-cgroups.
> > 
> > The last thing we want to do, is to 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-18 Thread Johannes Weiner
On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko  wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is 
> something
> else. For example
> 
> 1. Did we overcommit a particular container too much?
> 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> the problem?

The occasional OOM kill is an unavoidable reality on our systems (and
I bet on most deployments). If we tried not to overcommit, we'd waste
a *lot* of memory.

The problem is when OOM happens, we really want the biggest *job* to
get killed. Before cgroups, we assumed jobs were processes. But with
cgroups, the user is able to define a group of processes as a job, and
then an individual process is no longer a first-class memory consumer.

Without a patch like this, the OOM killer will compare the sizes of
the random subparticles that the jobs in the system are composed of
and kill the single biggest particle, leaving behind the incoherent
remains of one of the jobs. That doesn't make a whole lot of sense.

If you want to determine the most expensive car in a parking lot, you
can't go off and compare the price of one car's muffler with the door
handle of another, then point to a wind shield and yell "This is it!"

You need to compare the cars as a whole with each other.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

Right now, we watch for OOM notifications and then have userspace kill
the rest of a job. That works - somewhat. What remains is the problem
that I described above, that comparing individual process sizes is not
meaningful when the terminal memory consumer is a cgroup.

> 4. How do we determine limits for these containers? From a fariness
> perspective

How do you mean?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm, docs: update memory.stat description with workingset* entries

2017-05-15 Thread Johannes Weiner
On Thu, May 11, 2017 at 08:18:13PM +0100, Roman Gushchin wrote:
> Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in
> cache workingset transition") introduced three new entries in memory
> stat file:
>  - workingset_refault,
>  - workingset_activate,
>  - workingset_nodereclaim.
> 
> This commit adds a corresponding description to the cgroup v2 docs.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Li Zefan <lize...@huawei.com>
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org

Acked-by: Johannes Weiner <han...@cmpxchg.org>

Thanks Roman!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: per-cgroup memory reclaim stats

2017-05-12 Thread Johannes Weiner
On Fri, May 12, 2017 at 12:25:22PM +1000, Balbir Singh wrote:
> On Thu, 2017-05-11 at 20:16 +0100, Roman Gushchin wrote:
> > The meaning of each value is the same as for global counters,
> > available using /proc/vmstat.
> > 
> > Also, for consistency, rename mem_cgroup_count_vm_event() to
> > count_memcg_event_mm().
> > 
> 
> I still prefer the mem_cgroup_count_vm_event() name, or 
> memcg_count_vm_event(),
> the namespace upfront makes it easier to parse where to look for the the
> implementation and also grep. In any case the rename should be independent
> patch, but I don't like the name you've proposed.

The memory controller is no longer a tacked-on feature to the VM - the
entire reclaim path is designed around cgroups at this point. The
namespacing is just cumbersome and doesn't add add any value, IMO.

This name is also more consistent with the stats interface, where we
use nodes, zones, memcgs all equally to describe scopes/containers:

inc_node_state(), inc_zone_state(), inc_memcg_state()

> > @@ -357,6 +357,17 @@ static inline unsigned short mem_cgroup_id(struct 
> > mem_cgroup *memcg)
> >  }
> >  struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
> >  
> > +static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec)
> 
> mem_cgroup_from_lruvec()?

This name is consistent with other lruvec accessors such as
lruvec_pgdat() and lruvec_lru_size() etc.

> > @@ -1741,11 +1748,16 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > struct lruvec *lruvec,
> >  
> > spin_lock_irq(>lru_lock);
> >  
> > -   if (global_reclaim(sc)) {
> > -   if (current_is_kswapd())
> > +   if (current_is_kswapd()) {
> > +   if (global_reclaim(sc))
> > __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
> > -   else
> > +   count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_KSWAPD,
> > +  nr_reclaimed);
> 
> Has the else gone missing? What happens if it's global_reclaim(), do
> we still account the count in memcg?
> 
> > +   } else {
> > +   if (global_reclaim(sc))
> > __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
> > +   count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_DIRECT,
> > +  nr_reclaimed);
> 
> It sounds like memcg accumlates both global and memcg reclaim driver
> counts -- is this what we want?

Yes.

Consider a fully containerized system that is using only memory.low
and thus exclusively global reclaim to enforce the partitioning, NOT
artificial limits and limit reclaim. In this case, we still want to
know how much reclaim activity each group is experiencing.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: add watermark_scale_factor to the list of vm systcl file

2016-07-12 Thread Johannes Weiner
On Tue, Jul 12, 2016 at 12:05:59PM +0200, Jerome Marchand wrote:
> Commit 795ae7a0de6b ("mm: scale kswapd watermarks in proportion to
> memory") properly added the description of the new knob to
> Documentation/sysctl/vm.txt, but forgot to add it to the list of files
> in /proc/sys/vm. Let's fix that.
> 
> Signed-off-by: Jerome Marchand <jmarc...@redhat.com>

Thanks Jerome!

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: vmpressure: make vmpressure_window a tunable.

2016-02-04 Thread Johannes Weiner
On Thu, Feb 04, 2016 at 12:18:34PM +0100, Martijn Coenen wrote:
> I like this idea; I'm happy to come up with a window size and scaling
> factors that we think works well, and get your feedback on that. My
> only concern again would be that what works well for us may not work
> well for others.

Thanks for doing this. There is a good chance that this will work just
fine for others as well, so I think it's preferable to speculatively
change the implementation than adding ABI for potentially no reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: vmpressure: make vmpressure_window a tunable.

2016-02-03 Thread Johannes Weiner
On Wed, Feb 03, 2016 at 11:06:20AM +0100, Martijn Coenen wrote:
> The window size used for calculating vm pressure
> events was previously fixed at 512 pages. The
> window size has a big impact on the rate of notifications
> sent off to userspace, in particular when using the
> "low" level. On machines with a lot of memory, the
> current value may be excessive.
> 
> On the other hand, making the window size depend on
> machine size does not allow userspace to change the
> notification rate based on the current state of the
> system. For example, when a lot of memory is still
> available, userspace may want to increase the window
> since it's not interested in receiving notifications
> for every 2MB scanned.
>
> This patch makes vmpressure_window a sysctl tunable.

If the machine is just cleaning up use-once cache, frequent events
make no sense. And if the machine is struggling, the notifications
better be in time.

That's hardly a tunable. It's a factor that needs constant dynamic
adjustment depending on VM state. The same state this mechanism is
supposed to report. If we can't get this right, how will userspace?

A better approach here would be to 1) find a minimum window size that
makes us confident that there are no false positives - this is likely
to be based on machine size, maybe the low watermark? - and 2) limit
reporting of lower levels, so you're not flooded with ALLGOOD! events.

VMPRESSURE_CRITICAL: report every vmpressure_win
VMPRESSURE_MEDIUM: report every vmpressure_win*2
VMPRESSURE_LOW: report every vmpressure_win*4

Pick your favorite scaling factor here.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html