Re: [v8 0/4] cgroup-aware OOM killer

2017-09-18 Thread Michal Hocko
On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> > On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> > > On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > > > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > I strongly believe that comparing only leaf memcgs
> > > > > > is more straightforward and it doesn't lead to unexpected results as
> > > > > > mentioned before (kill a small memcg which is a part of the larger
> > > > > > sub-hierarchy).
> > > > > 
> > > > > One of two main goals of this patchset is to introduce cgroup-level
> > > > > fairness: bigger cgroups should be affected more than smaller,
> > > > > despite the size of tasks inside. I believe the same principle
> > > > > should be used for cgroups.
> > > > 
> > > > Yes bigger cgroups should be preferred but I fail to see why bigger
> > > > hierarchies should be considered as well if they are not kill-all. And
> > > > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > > > me. What would be the usecase?
> > > 
> > > We definitely want to support kill-all for non-leaf cgroups.
> > > A workload can consist of several cgroups and we want to clean up
> > > the whole thing on OOM.
> > 
> > Could you be more specific about such a workload? E.g. how can be such a
> > hierarchy handled consistently when its sub-tree gets killed due to
> > internal memory pressure?
> 
> Or just system-wide OOM.
> 
> > Or do you expect that none of the subtree will
> > have hard limit configured?
> 
> And this can also be a case: the whole workload may have hard limit
> configured, while internal memcgs have only memory.low set for "soft"
> prioritization.
> 
> > 
> > But then you just enforce a structural restriction on your configuration
> > because
> > root
> > /  \
> >AD
> >   /\   
> >  B  C
> > 
> > is a different thing than
> > root
> > / | \
> >B  C  D
> >
> 
> I actually don't have a strong argument against an approach to select
> largest leaf or kill-all-set memcg. I think, in practice there will be
> no much difference.

Well, I am worried that the difference will come unexpected when a
deeper hierarchy is needed because of the structural needs.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-15 Thread Michal Hocko
On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > [...]
> > > > I strongly believe that comparing only leaf memcgs
> > > > is more straightforward and it doesn't lead to unexpected results as
> > > > mentioned before (kill a small memcg which is a part of the larger
> > > > sub-hierarchy).
> > > 
> > > One of two main goals of this patchset is to introduce cgroup-level
> > > fairness: bigger cgroups should be affected more than smaller,
> > > despite the size of tasks inside. I believe the same principle
> > > should be used for cgroups.
> > 
> > Yes bigger cgroups should be preferred but I fail to see why bigger
> > hierarchies should be considered as well if they are not kill-all. And
> > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > me. What would be the usecase?
> 
> We definitely want to support kill-all for non-leaf cgroups.
> A workload can consist of several cgroups and we want to clean up
> the whole thing on OOM.

Could you be more specific about such a workload? E.g. how can be such a
hierarchy handled consistently when its sub-tree gets killed due to
internal memory pressure? Or do you expect that none of the subtree will
have hard limit configured?

> I don't see any reasons to limit this functionality to leaf cgroups
> only.

Well, I wanted to start simple first and extend on top. Memcg v1 is full
of seemingly interesting and very generic concepts which turned out
being a headache long term.

> Hierarchies are memory consumers, we do account their usage,
> we do apply limits and guarantees for the hierarchies. The same is
> with OOM victim selection: we are reclaiming memory from the
> biggest consumer. Kill-all knob only defines the way _how_ we do that:
> by killing one or all processes.

But then you just enforce a structural restriction on your configuration
because
root
/  \
   AD
  /\   
 B  C

is a different thing than
root
/ | \
   B  C  D

And consider that the sole purpose of A might be a control over
a non-memory resource (e.g. a cpu share distribution). Why should we
discriminate B and C in such a case?

> Just for example, we might want to take memory.low into account at
> some point: prefer cgroups which are above their guarantees, avoid
> killing those who fit. It would be hard if we're comparing cgroups
> from different hierarchies.

This can be reflected in the memcg oom score calculation I believe. We
already do something similar during the reclaim.

> The same will be with introducing oom_priorities, which is much more
> required functionality.

More on that below.

> > Consider that it might be not your choice (as a user) how deep is your
> > leaf memcg. I can already see how people complain that their memcg has
> > been killed just because it was one level deeper in the hierarchy...
> 
> The kill-all functionality is enforced by parent, and it seems to be
> following the overall memcg design. The parent cgroup enforces memory
> limit, memory low limit, etc.

And the same is true for the memcg oom killer. It enforces the selection
to the out-of-memory subtree. We are trying to be proportional on the
size of the _reclaimable_ memory in that scope. Same way as with the
LRU reclaim. We do not prefer larger hierarchies over smaller. We just
iterate over those that have pages on LRUs (leaf memcgs with v2) and
scan/reclaim proportionally to their size. Why should the oom killer
decision be any different in that regards?

> I don't know why OOM control should be different.

I am not arguing that kill-all functionality on non-leaf is wrong. I
just haven't heard the usecase for it yet. I am also not opposed to
consider the cumulative size of non-leaf memcg if it is kill-all as the
cumulative size will be reclaimed then. But I fail to see why we should
prefer larger hierarchies when the resulting memcg victim is much
smaller in the end.

> > I would really start simple and only allow kill-all on leaf memcgs and
> > only compare leaf memcgs & root. If we ever need to kill whole
> > hierarchies then allow kill-all on intermediate memcgs as well and then
> > consider cumulative consumptions only on those that have kill-all
> > enabled.
> 
> This sounds hacky to me: the whole thing is depending on cgroup v2 and
> is additionally explicitly opt-in.
> 
> Why do we need to introduce such incomplete functionality first,
> and then suffer trying to extend it and provide backward compatibility?

Why would a backward compatibility be a problem? k

Re: [v8 1/4] mm, oom: refactor the oom_kill_process() function

2017-09-14 Thread Michal Hocko
On Mon 11-09-17 14:17:39, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: Michal Hocko <mho...@suse.com>

> ---
>  mm/oom_kill.c | 123 
> +++---
>  1 file changed, 65 insertions(+), 58 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 99736e026712..f061b627092c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
>   return ret;
>  }
>  
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)
>  {
> - struct task_struct *p = oc->chosen;
> - unsigned int points = oc->chosen_points;
> - struct task_struct *victim = p;
> - struct task_struct *child;
> - struct task_struct *t;
> + struct task_struct *p;
>   struct mm_struct *mm;
> - unsigned int victim_points = 0;
> - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> -   DEFAULT_RATELIMIT_BURST);
>   bool can_oom_reap = true;
>  
> - /*
> -  * If the task is already exiting, don't alarm the sysadmin or kill
> -  * its children or threads, just give it access to memory reserves
> -  * so it can die quickly
> -  */
> - task_lock(p);
> - if (task_will_free_mem(p)) {
> - mark_oom_victim(p);
> - wake_oom_reaper(p);
> - task_unlock(p);
> - put_task_struct(p);
> - return;
> - }
> - task_unlock(p);
> -
> - if (__ratelimit(_rs))
> - dump_header(oc, p);
> -
> - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> - message, task_pid_nr(p), p->comm, points);
> -
> - /*
> -  * If any of p's children has a different mm and is eligible for kill,
> -  * the one with the highest oom_badness() score is sacrificed for its
> -  * parent.  This attempts to lose the minimal amount of work done while
> -  * still freeing memory.
> -  */
> - read_lock(_lock);
> - for_each_thread(p, t) {
> - list_for_each_entry(child, >children, sibling) {
> - unsigned int child_points;
> -
> - if (process_shares_mm(child, p->mm))
> - continue;
> - /*
> -  * oom_badness() returns 0 if the thread is unkillable
> -  */
> - child_points = oom_badness(child,
> - oc->memcg, oc->nodemask, oc->totalpages);
> - if (child_points > victim_points) {
> - put_task_struct(victim);
> - victim = child;
> - victim_points = child_points;
> - get_task_struct(victim);
> - }
> - }
> - }
> - read_unlock(_lock);
> -
>   p = find_lock_task_mm(victim);
>   if (!p) {
>   put_task_struct(victim);
> @@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>  }
>  #undef K
>  
> +static void oom_kill_process(struct oom_control *oc, const char *message)
> +{
> + struct task_struct *p = oc->chosen;
>

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-14 Thread Michal Hocko
On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
[...]
> > I strongly believe that comparing only leaf memcgs
> > is more straightforward and it doesn't lead to unexpected results as
> > mentioned before (kill a small memcg which is a part of the larger
> > sub-hierarchy).
> 
> One of two main goals of this patchset is to introduce cgroup-level
> fairness: bigger cgroups should be affected more than smaller,
> despite the size of tasks inside. I believe the same principle
> should be used for cgroups.

Yes bigger cgroups should be preferred but I fail to see why bigger
hierarchies should be considered as well if they are not kill-all. And
whether non-leaf memcgs should allow kill-all is not entirely clear to
me. What would be the usecase?
Consider that it might be not your choice (as a user) how deep is your
leaf memcg. I can already see how people complain that their memcg has
been killed just because it was one level deeper in the hierarchy...

I would really start simple and only allow kill-all on leaf memcgs and
only compare leaf memcgs & root. If we ever need to kill whole
hierarchies then allow kill-all on intermediate memcgs as well and then
consider cumulative consumptions only on those that have kill-all
enabled.

Or do I miss any reasonable usecase that would suffer from such a
semantic?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-14 Thread Michal Hocko
On Wed 13-09-17 13:46:08, David Rientjes wrote:
> On Wed, 13 Sep 2017, Michal Hocko wrote:
> 
> > > > This patchset makes the OOM killer cgroup-aware.
> > > > 
> > > > v8:
> > > >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> > > >   - Make the whole thing opt-in with cgroup mount option control
> > > >   - Drop oom_priority for further discussions
> > > 
> > > Nack, we specifically require oom_priority for this to function 
> > > correctly, 
> > > otherwise we cannot prefer to kill from low priority leaf memcgs as 
> > > required.
> > 
> > While I understand that your usecase might require priorities I do not
> > think this part missing is a reason to nack the cgroup based selection
> > and kill-all parts. This can be done on top. The only important part
> > right now is the current selection semantic - only leaf memcgs vs. size
> > of the hierarchy). I strongly believe that comparing only leaf memcgs
> > is more straightforward and it doesn't lead to unexpected results as
> > mentioned before (kill a small memcg which is a part of the larger
> > sub-hierarchy).
> > 
> 
> The problem is that we cannot enable the cgroup-aware oom killer and 
> oom_group behavior because, without oom priorities, we have no ability to 
> influence the cgroup that it chooses.  It is doing two things: providing 
> more fairness amongst cgroups by selecting based on cumulative usage 
> rather than single large process (good!), and effectively is removing all 
> userspace control of oom selection (bad).  We want the former, but it 
> needs to be coupled with support so that we can protect vital cgroups, 
> regardless of their usage.

I understand that your usecase needs a more fine grained control over
the selection but that alone is not a reason to nack the implementation
which doesn't provide it (yet).

> It is certainly possible to add oom priorities on top before it is merged, 
> but I don't see why it isn't part of the patchset.

Because the semantic of the priority for non-leaf memcgs is not fully
clear and I would rather have the core of the functionality merged
before this is sorted out.

> We need it before its 
> merged to avoid users playing with /proc/pid/oom_score_adj to prevent any 
> killing in the most preferable memcg when they could have simply changed 
> the oom priority.

I am sorry but I do not really understand your concern. Are you
suggesting that users would start oom disable all tasks in a memcg to
give it a higher priority? Even if that was the case why should such an
abuse be a blocker for generic memcg aware oom killer being merged?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-13 Thread Michal Hocko
On Mon 11-09-17 13:44:39, David Rientjes wrote:
> On Mon, 11 Sep 2017, Roman Gushchin wrote:
> 
> > This patchset makes the OOM killer cgroup-aware.
> > 
> > v8:
> >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> >   - Make the whole thing opt-in with cgroup mount option control
> >   - Drop oom_priority for further discussions
> 
> Nack, we specifically require oom_priority for this to function correctly, 
> otherwise we cannot prefer to kill from low priority leaf memcgs as 
> required.

While I understand that your usecase might require priorities I do not
think this part missing is a reason to nack the cgroup based selection
and kill-all parts. This can be done on top. The only important part
right now is the current selection semantic - only leaf memcgs vs. size
of the hierarchy). I strongly believe that comparing only leaf memcgs
is more straightforward and it doesn't lead to unexpected results as
mentioned before (kill a small memcg which is a part of the larger
sub-hierarchy).

I didn't get to read the new version of this series yet and hope to get
to it soon.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-09-13 Thread Michal Hocko
On Tue 12-09-17 21:01:15, Roman Gushchin wrote:
> On Mon, Sep 11, 2017 at 01:48:39PM -0700, David Rientjes wrote:
> > On Mon, 11 Sep 2017, Roman Gushchin wrote:
> > 
> > > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > > OOM killer. If not set, the OOM selection is performed in
> > > a "traditional" per-process way.
> > > 
> > > The behavior can be changed dynamically by remounting the cgroupfs.
> > 
> > I can't imagine that Tejun would be happy with a new mount option, 
> > especially when it's not required.
> > 
> > OOM behavior does not need to be defined at mount time and for the entire 
> > hierarchy.  It's possible to very easily implement a tunable as part of 
> > mem cgroup that is propagated to descendants and controls the oom scoring 
> > behavior for that hierarchy.  It does not need to be system wide and 
> > affect scoring of all processes based on which mem cgroup they are 
> > attached to at any given time.
> 
> No, I don't think that mixing per-cgroup and per-process OOM selection
> algorithms is a good idea.
> 
> So, there are 3 reasonable options:
> 1) boot option
> 2) sysctl
> 3) cgroup mount option
> 
> I believe, 3) is better, because it allows changing the behavior dynamically,
> and explicitly depends on v2 (what sysctl lacks).

I see your argument here. I would just be worried that we end up really
needing more oom strategies in future and those wouldn't fit into memcg
mount option scope. So 1/2 sounds more exensible to me long term. Boot
time would be easier because we do not have to bother dynamic selection
in that case.

> So, the only question is should it be opt-in or opt-out option.
> Personally, I would prefer opt-out, but Michal has a very strong opinion here.

Yes I still strongly believe this has to be opt-in.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-11 Thread Michal Hocko
On Thu 07-09-17 12:14:57, Johannes Weiner wrote:
> On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> > On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > > should have been the default from the beginning, because the user
> > > configures a group of tasks to be an interdependent, terminal unit of
> > > memory consumption, and it's undesirable for the OOM killer to ignore
> > > this intention and compare members across these boundaries.
> > 
> > I would agree if that was true in general. I can completely see how the
> > cgroup awareness is useful in e.g. containerized environments (especially
> > with kill-all enabled) but memcgs are used in a large variety of
> > usecases and I cannot really say all of them really demand the new
> > semantic. Say I have a workload which doesn't want to see reclaim
> > interference from others on the same machine. Why should I kill a
> > process from that particular memcg just because it is the largest one
> > when there is a memory hog/leak outside of this memcg?
> 
> Sure, it's always possible to come up with a config for which this
> isn't the optimal behavior. But this is about picking a default that
> makes sense to most users, and that type of cgroup usage just isn't
> the common case.

How can you tell, really? Even if cgroup2 is a new interface we still
want as many legacy (v1) users to be migrated to the new hierarchy.
I have seen quite different usecases over time and I have hard time to
tell which of them to call common enough.

> > From my point of view the safest (in a sense of the least surprise)
> > way to go with opt-in for the new heuristic. I am pretty sure all who
> > would benefit from the new behavior will enable it while others will not
> > regress in unexpected way.
> 
> This thinking simply needs to be balanced against the need to make an
> unsurprising and consistent final interface.

Sure. And I _think_ we can come up with a clear interface to configure
the oom behavior - e.g. a kernel command line parameter with a default
based on a config option.
 
> The current behavior breaks isolation by letting tasks in different
> cgroups compete with each other during an OOM kill. While you can
> rightfully argue that it's possible for usecases to rely on this, you
> cannot tell me that this is the least-surprising thing we can offer
> users; certainly not new users, but also not many/most existing ones.

I would argue that a global OOM has been always a special case and
people got used to "kill the largest task" strategy. I have seen
multiple reports where people were complaining when this wasn't the case
(e.g. when the NUMA policies were involved).

> > We can talk about the way _how_ to control these oom strategies, of
> > course. But I would be really reluctant to change the default which is
> > used for years and people got used to it.
> 
> I really doubt there are many cgroup users that rely on that
> particular global OOM behavior.
> 
> We have to agree to disagree, I guess.

Yes, I am afraid so. And I do not hear this would be a feature so many
users have been asking for a long time to simply say "yeah everybody
wants that, make it a default". And as such I do not see a reason why we
should enforce it on all users. It is really trivial to enable it when
it is considered useful.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

2017-09-11 Thread Michal Hocko
On Thu 07-09-17 11:18:18, Cristopher Lameter wrote:
> On Mon, 4 Sep 2017, Roman Gushchin wrote
> 
> > To address these issues, cgroup-aware OOM killer is introduced.
> 
> You are missing a major issue here. Processes may have allocation
> constraints to memory nodes, special DMA zones etc etc. OOM conditions on
> such resource constricted allocations need to be dealt with. Killing
> processes that do not allocate with the same restrictions may not do
> anything to improve conditions.

memcg_oom_badness tries to be node aware - very similar to what the
oom_badness does for the regular oom killer. Or do you have anything
else in mind?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko
On Wed 06-09-17 18:40:43, Roman Gushchin wrote:
[...]
> >From f6e2339926a07500834d86548f3f116af7335d71 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <g...@fb.com>
> Date: Wed, 6 Sep 2017 17:43:44 +0100
> Subject: [PATCH] mm, oom: first step towards oom_kill_allocating_task
>  deprecation
> 
> The oom_kill_allocating_task sysctl which causes the OOM killer
> to simple kill the allocating task is useless. Killing the random

useless is quite strong ;) I would say dubious.

> task is not the best idea.
> 
> Nobody likes it, and hopefully nobody uses it.
> We want to completely deprecate it at some point.
> 
> To make a first step towards deprecation, let's warn potential
> users about deprecation plans.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: David Rientjes <rient...@google.com>
> Cc: Vladimir Davydov <vdavydov....@gmail.com>
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org

Other than that
Acked-by: Michal Hocko <mho...@suse.com>

> ---
>  kernel/sysctl.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 655686d546cb..9158f1980584 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -220,6 +220,17 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
> int write,
>  
>  #endif
>  
> +static int proc_oom_kill_allocating_tasks(struct ctl_table *table, int write,
> +void __user *buffer, size_t *lenp,
> +loff_t *ppos)
> +{
> + pr_warn_once("The oom_kill_allocating_task sysctl will be deprecated.\n"
> +  "If you're using it, please, report to "
> +  "linux...@kvack.kernel.org.\n");
> +
> + return proc_dointvec(table, write, buffer, lenp, ppos);
> +}
> +
>  static struct ctl_table kern_table[];
>  static struct ctl_table vm_table[];
>  static struct ctl_table fs_table[];
> @@ -1235,7 +1246,7 @@ static struct ctl_table vm_table[] = {
>   .data   = _oom_kill_allocating_task,
>   .maxlen = sizeof(sysctl_oom_kill_allocating_task),
>       .mode   = 0644,
> - .proc_handler   = proc_dointvec,
> + .proc_handler   = proc_oom_kill_allocating_tasks,
>   },
>   {
>   .procname   = "oom_dump_tasks",
> -- 
> 2.13.5

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko
On Wed 06-09-17 14:41:42, Roman Gushchin wrote:
[...]
> Although, I don't think the whole thing is useful without any way
> to adjust the memcg selection, so we can't postpone if for too long.
> Anyway, if you think it's a way to go forward, let's do it.

I am not really sure we are in a rush here. The whole oom_score_adj
fiasco has showed that most users tend to only care "to never kill this
and that". A better fine tuned oom control sounds useful at first but
apart from very special usecases turns out very impractical to set
up. At least that is my experience. There are special cases of course
but we should target general use first.

Kill the whole memcg is a really useful feature on its own for proper
container cleanup.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko
On Wed 06-09-17 13:57:50, Roman Gushchin wrote:
> On Wed, Sep 06, 2017 at 10:31:58AM +0200, Michal Hocko wrote:
> > On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> > > On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
> > [...]
> > > > Hmm. The changelog says "By default, it will look for the biggest leaf
> > > > cgroup, and kill the largest task inside." But you are accumulating
> > > > oom_score up the hierarchy and so parents will have higher score than
> > > > the layer of their children and the larger the sub-hierarchy the more
> > > > biased it will become. Say you have
> > > > root
> > > >  /\
> > > > /  \
> > > >AD
> > > >   / \
> > > >  B   C
> > > > 
> > > > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > > > going to go down A path and then chose C even though D is the largest
> > > > leaf group, right?
> > > 
> > > You're right, changelog is not accurate, I'll fix it.
> > > The behavior is correct, IMO.
> > 
> > Please explain why. This is really a non-intuitive semantic. Why should
> > larger hierarchies be punished more than shallow ones? I would
> > completely agree if the whole hierarchy would be a killable entity (aka
> > A would be kill-all).
> 
> I think it's a reasonable and clear policy: we're looking for a memcg
> with the smallest oom_priority and largest memory footprint recursively.

But this can get really complex for non-trivial setups. Anything with
deeper and larger hierarchies will get quite complex IMHO.

Btw. do you have any specific usecase for the priority based oom
killer? I remember David was asking for this because it _would_ be
useful but you didn't have it initially. And I agree with that I am
just not sure the semantic is thought through wery well. I am thinking
whether it would be easier to push this further without priority thing
for now and add it later with a clear example of the configuration and
how it should work and a more thought through semantic. Would that sound
acceptable? I believe the rest is quite useful to get merged on its own.

> Then we reclaim some memory from it (by killing the biggest process
> or all processes, depending on memcg preferences).
> 
> In general, if there are two memcgs of equal importance (which is defined
> by setting the oom_priority), we're choosing the largest, because there
> are more chances that it contain a leaking process. The same is true
> right now for processes.

Yes except this is not the case as shown above. We can easily select a
smaller leaf memcg just because it is in a larger hierarchy and that
sounds very dubious to me. Even when all the priorities are the same.

> I agree, that for size-based comparison we could use a different policy:
> comparing leaf cgroups despite their level. But I don't see a clever
> way to apply oom_priorities in this case. Comparing oom_priority
> on each level is a simple and powerful policy, and it works well
> for delegation.

You are already shaping semantic around the implementation and that is a
clear sign of problem.
 
> > [...]
> > > > I do not understand why do we have to handle root cgroup specially here.
> > > > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > > > (including root) so if the root memcg is the largest one then we
> > > > should simply consider it no?
> > > 
> > > We don't have necessary stats for the root cgroup, so we can't calculate
> > > it's oom_score.
> > 
> > We used to charge pages to the root memcg as well so we might resurrect
> > that idea. In any case this is something that could be hidden in
> > memcg_oom_badness rather then special cased somewhere else.
> 
> In theory I agree, but I do not see a good way to calculate root memcg
> oom_score.

Why cannot you emulate that by the largest task in the root? The same
way you actually do in select_victim_root_cgroup_task now?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko
On Tue 05-09-17 20:16:09, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 05:12:51PM +0200, Michal Hocko wrote:
[...]
> > > Then we should probably hide corresponding
> > > cgroup interface (oom_group and oom_priority knobs) by default,
> > > and it feels as unnecessary complication and is overall against
> > > cgroup v2 interface design.
> > 
> > Why. If we care enough, we could simply return EINVAL when those knobs
> > are written while the corresponding strategy is not used.
> 
> It doesn't look as a nice default interface.

I do not have a strong opinion on this. A printk_once could explain why
the knob is ignored and instruct the admin how to enable the feature
completely.
 
> > > > I think we should instead go with
> > > > oom_strategy=[alloc_task,biggest_task,cgroup]
> > > 
> > > It would be a really nice interface; although I've no idea how to 
> > > implement it:
> > > "alloc_task" is an existing sysctl, which we have to preserve;
> > 
> > I would argue that we should simply deprecate and later drop the sysctl.
> > I _strongly_ suspect anybody is using this. If yes it is not that hard
> > to change the kernel command like rather than select the sysctl.
> 
> I agree. And if so, why do we need a new interface for an useless feature?

Well, I won't be opposed just deprecating the sysfs and only add a
"real" kill-allocate strategy if somebody explicitly asks for it.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko
On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
[...]
> > Hmm. The changelog says "By default, it will look for the biggest leaf
> > cgroup, and kill the largest task inside." But you are accumulating
> > oom_score up the hierarchy and so parents will have higher score than
> > the layer of their children and the larger the sub-hierarchy the more
> > biased it will become. Say you have
> > root
> >  /\
> > /  \
> >AD
> >   / \
> >  B   C
> > 
> > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > going to go down A path and then chose C even though D is the largest
> > leaf group, right?
> 
> You're right, changelog is not accurate, I'll fix it.
> The behavior is correct, IMO.

Please explain why. This is really a non-intuitive semantic. Why should
larger hierarchies be punished more than shallow ones? I would
completely agree if the whole hierarchy would be a killable entity (aka
A would be kill-all).
 
[...]
> > I do not understand why do we have to handle root cgroup specially here.
> > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > (including root) so if the root memcg is the largest one then we
> > should simply consider it no?
> 
> We don't have necessary stats for the root cgroup, so we can't calculate
> it's oom_score.

We used to charge pages to the root memcg as well so we might resurrect
that idea. In any case this is something that could be hidden in
memcg_oom_badness rather then special cased somewhere else.

> > You are skipping root there because of
> > memcg_has_children but I suspect this and the whole accumulate up the
> > hierarchy approach just makes the whole thing more complex than necessary. 
> > With
> > "tasks only in leafs" cgroup policy we should only see any pages on LRUs
> > on the global root memcg and leaf cgroups. The same applies to memcg
> > stats. So why cannot we simply do the tree walk, calculate
> > badness/check the priority and select the largest memcg in one go?
> 
> We have to traverse from top to bottom to make priority-based decision,
> but size-based oom_score is calculated as sum of descending leaf cgroup 
> scores.
> 
> For example:
>   root
>   /\
>  /  \
> AD
>/ \
>   B   C
> A and D have same priorities, B has larger priority than C.
> 
> In this case we need to calculate size-based score for A, which requires
> summing oom_score of the sub-tree (B an C), despite we don't need it
> for choosing between B and C.
> 
> Maybe I don't see it, but I don't know how to implement it more optimal.

I have to think about the priority based oom killing some more to be
honest. Do we really want to allow setting a priority to non-leaf
memcgs? How are you going to manage the whole tree consistency? Say your
above example have prio(A) < prio(D) && prio(C) > prio(D). Your current
implementation would kill D, right? Isn't that counter intuitive
behavior again. If anything we should prio(A) = max(tree_prio(A)). Again
I could understand comparing priorities only on killable entities.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko
On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
[...]
> > > @@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct 
> > > *victim)
> > >   struct mm_struct *mm;
> > >   bool can_oom_reap = true;
> > >  
> > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
> > > + return;
> > > +
> > 
> > This will leak a reference to the victim AFACS
> 
> Good catch!
> I didn't fix this after moving reference dropping into __oom_kill_process().
> Fixed.

Btw. didn't you want to check
victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

here as well? Maybe I've missed something but you still can kill a task
which is oom disabled which I thought we agreed is the wrong thing to
do.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko
On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with 
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> > 
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
> 
> oom_kill_allocating_task is actually a really good example of why
> cgroup-awareness *should* be the new default.
> 
> Before we had the oom killer victim selection, we simply killed the
> faulting/allocating task. While a valid answer to the problem, it's
> not very fair or representative of what the user wants or intends.
> 
> Then we added code to kill the biggest offender instead, which should
> have been the case from the start and was hence made the new default.
> The oom_kill_allocating_task was added on the off-chance that there
> might be setups who, for historical reasons, rely on the old behavior.
> But our default was chosen based on what behavior is fair, expected,
> and most reflective of the user's intentions.

I am not sure this is how things evolved actually. This is way before
my time so my git log interpretation might be imprecise. We do have
oom_badness heuristic since out_of_memory has been introduced and
oom_kill_allocating_task has been introduced much later because of large
boxes with zillions of tasks (SGI I suspect) which took too long to
select a victim so David has added this heuristic.
 
> The cgroup-awareness in the OOM killer is exactly the same thing. It
> should have been the default from the beginning, because the user
> configures a group of tasks to be an interdependent, terminal unit of
> memory consumption, and it's undesirable for the OOM killer to ignore
> this intention and compare members across these boundaries.

I would agree if that was true in general. I can completely see how the
cgroup awareness is useful in e.g. containerized environments (especially
with kill-all enabled) but memcgs are used in a large variety of
usecases and I cannot really say all of them really demand the new
semantic. Say I have a workload which doesn't want to see reclaim
interference from others on the same machine. Why should I kill a
process from that particular memcg just because it is the largest one
when there is a memory hog/leak outside of this memcg?

>From my point of view the safest (in a sense of the least surprise)
way to go with opt-in for the new heuristic. I am pretty sure all who
would benefit from the new behavior will enable it while others will not
regress in unexpected way.

We can talk about the way _how_ to control these oom strategies, of
course. But I would be really reluctant to change the default which is
used for years and people got used to it.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-05 Thread Michal Hocko
On Tue 05-09-17 15:30:21, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
[...]
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with 
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> > 
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
> > 
> > Does that make sense to you?
> 
> Absolutely.
> 
> The only thing: I'm not sure that we have to preserve the existing logic
> as default option. For most users (except few very specific usecases),
> it should be at least as good, as the existing one.

But this is really an unexpected change. Users even might not know that
they are using cgroup v2 and memcg is in use.

> Making it opt-in means that corresponding code will be executed only
> by few users, who cares.

Yeah, which is the way we should introduce new features no?

> Then we should probably hide corresponding
> cgroup interface (oom_group and oom_priority knobs) by default,
> and it feels as unnecessary complication and is overall against
> cgroup v2 interface design.

Why. If we care enough, we could simply return EINVAL when those knobs
are written while the corresponding strategy is not used.

> > I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> 
> It would be a really nice interface; although I've no idea how to implement 
> it:
> "alloc_task" is an existing sysctl, which we have to preserve;

I would argue that we should simply deprecate and later drop the sysctl.
I _strongly_ suspect anybody is using this. If yes it is not that hard
to change the kernel command like rather than select the sysctl. The
deprecation process would be
- warn when somebody writes to the sysctl and check both boot
  and sysctl values
[ wait some time ]
- keep the sysctl but return EINVAL
[ wait some time ]
- remove the sysctl

> while "cgroup" depends on cgroup v2.

Which is not a big deal either. Simply fall back to default if there are
no cgroup v2. The implementation would have essentially the same effect
because there won't be any kill-all cgroups and so we will select the
largest task.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-05 Thread Michal Hocko
I will go and check patch 2 more deeply but this is something that I
wanted to sort out first.

On Mon 04-09-17 15:21:08, Roman Gushchin wrote:
> Introducing of cgroup-aware OOM killer changes the victim selection
> algorithm used by default: instead of picking the largest process,
> it will pick the largest memcg and then the largest process inside.
> 
> This affects only cgroup v2 users.
> 
> To provide a way to use cgroups v2 if the old OOM victim selection
> algorithm is preferred for some reason, the nogroupoom mount option
> is added.
> 
> If set, the OOM selection is performed in a "traditional" per-process
> way. Both oom_priority and oom_group memcg knobs are ignored.

Why is this an opt out rather than opt-in? IMHO the original oom logic
should be preserved by default and specific workloads should opt in for
the cgroup aware logic. Changing the global behavior depending on
whether cgroup v2 interface is in use is more than unexpected and IMHO
wrong approach to take. I think we should instead go with 
oom_strategy=[alloc_task,biggest_task,cgroup]

we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
biggest_task which is the default. You are adding cgroup and the more I
think about the more I agree that it doesn't really make sense to try to
fit thew new semantic into the existing one (compare tasks to kill-all
memcgs). Just introduce a new strategy and define a new semantic from
scratch. Memcg priority and kill-all are a natural extension of this new
strategy. This will make the life easier and easier to understand by
users.

Does that make sense to you?

> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 1 +
>  mm/memcontrol.c | 8 
>  2 files changed, 9 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 28f1a0f84456..07891f1030aa 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -489,6 +489,7 @@
>   Format: 
>   nosocket -- Disable socket memory accounting.
>   nokmem -- Disable kernel memory accounting.
> + nogroupoom -- Disable cgroup-aware OOM killer.
>  
>   checkreqprot[SELINUX] Set initial checkreqprot flag value.
>   Format: { "0" | "1" }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7dd293897ca..6a8235dc41f6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -87,6 +87,9 @@ static bool cgroup_memory_nosocket;
>  /* Kernel memory accounting disabled? */
>  static bool cgroup_memory_nokmem;
>  
> +/* Cgroup-aware OOM  disabled? */
> +static bool cgroup_memory_nogroupoom;
> +
>  /* Whether the swap controller is active */
>  #ifdef CONFIG_MEMCG_SWAP
>  int do_swap_account __read_mostly;
> @@ -2822,6 +2825,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control 
> *oc)
>   if (mem_cgroup_disabled())
>   return false;
>  
> + if (cgroup_memory_nogroupoom)
> + return false;
> +
>   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>   return false;
>  
> @@ -6188,6 +6194,8 @@ static int __init cgroup_memory(char *s)
>   cgroup_memory_nosocket = true;
>   if (!strcmp(token, "nokmem"))
>   cgroup_memory_nokmem = true;
> + if (!strcmp(token, "nogroupoom"))
> + cgroup_memory_nogroupoom = true;
>   }
>   return 0;
>  }
> -- 
> 2.13.5

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v7 1/5] mm, oom: refactor the oom_kill_process() function

2017-09-05 Thread Michal Hocko
On Mon 04-09-17 15:21:04, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Looks good to me
Acked-by: Michal Hocko <mho...@suse.com>

> ---
>  mm/oom_kill.c | 123 
> +++---
>  1 file changed, 65 insertions(+), 58 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 99736e026712..f061b627092c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
>   return ret;
>  }
>  
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)
>  {
> - struct task_struct *p = oc->chosen;
> - unsigned int points = oc->chosen_points;
> - struct task_struct *victim = p;
> - struct task_struct *child;
> - struct task_struct *t;
> + struct task_struct *p;
>   struct mm_struct *mm;
> - unsigned int victim_points = 0;
> - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> -   DEFAULT_RATELIMIT_BURST);
>   bool can_oom_reap = true;
>  
> - /*
> -  * If the task is already exiting, don't alarm the sysadmin or kill
> -  * its children or threads, just give it access to memory reserves
> -  * so it can die quickly
> -  */
> - task_lock(p);
> - if (task_will_free_mem(p)) {
> - mark_oom_victim(p);
> - wake_oom_reaper(p);
> - task_unlock(p);
> - put_task_struct(p);
> - return;
> - }
> - task_unlock(p);
> -
> - if (__ratelimit(_rs))
> - dump_header(oc, p);
> -
> - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> - message, task_pid_nr(p), p->comm, points);
> -
> - /*
> -  * If any of p's children has a different mm and is eligible for kill,
> -  * the one with the highest oom_badness() score is sacrificed for its
> -  * parent.  This attempts to lose the minimal amount of work done while
> -  * still freeing memory.
> -  */
> - read_lock(_lock);
> - for_each_thread(p, t) {
> - list_for_each_entry(child, >children, sibling) {
> - unsigned int child_points;
> -
> - if (process_shares_mm(child, p->mm))
> - continue;
> - /*
> -  * oom_badness() returns 0 if the thread is unkillable
> -  */
> - child_points = oom_badness(child,
> - oc->memcg, oc->nodemask, oc->totalpages);
> - if (child_points > victim_points) {
> - put_task_struct(victim);
> - victim = child;
> - victim_points = child_points;
> - get_task_struct(victim);
> - }
> - }
> - }
> - read_unlock(_lock);
> -
>   p = find_lock_task_mm(victim);
>   if (!p) {
>   put_task_struct(victim);
> @@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>  }
>  #undef K
>  
> +static void oom_kill_process(struct oom_control *oc, const char *message)
> +{
> + struct task_struct *

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-25 Thread Michal Hocko
On Fri 25-08-17 11:39:51, Roman Gushchin wrote:
> On Fri, Aug 25, 2017 at 10:14:03AM +0200, Michal Hocko wrote:
> > On Thu 24-08-17 15:58:01, Roman Gushchin wrote:
> > > On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> > > > On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
> > [...]
> > > > > Both ways are not ideal, and sum of the processes is not ideal too.
> > > > > Especially, if you take oom_score_adj into account. Will you respect 
> > > > > it?
> > > > 
> > > > Yes, and I do not see any reason why we shouldn't.
> > > 
> > > It makes things even more complicated.
> > > Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) 
> > > range,
> > > and it you're starting summing it, it can be multiplied by number of 
> > > tasks...
> > > Weird.
> > 
> > oom_score_adj is just a normalized bias so if tasks inside oom will use
> > it the whole memcg will get accumulated bias from all such tasks so it
> > is not completely off. I agree that the more tasks use the bias the more
> > biased the whole memcg will be. This might or might not be a problem.
> > As you are trying to reimplement the existing oom killer implementation
> > I do not think we cannot simply ignore API which people are used to.
> > 
> > If this was a configurable oom policy then I could see how ignoring
> > oom_score_adj is acceptable because it would be an explicit opt-in.
> >
> > > It also will be different in case of system and memcg-wide OOM.
> > 
> > Why, we do honor oom_score_adj for the memcg OOM now and in fact the
> > kernel memcg OOM killer shouldn't be very much different from the global
> > one except for the tasks scope.
> 
> Assume, you have two tasks (2Gb and 1Gb) in a cgroup with limit 3Gb.
> The second task has oom_score_adj +100. Total memory is 64Gb, for example.
> 
> I case of memcg-wide oom first task will be selected;
> in case of system-wide OOM - the second.
> 
> Personally I don't like this, but it looks like we have to respect
> oom_score_adj set to -1000, I'll alter my patch.

I cannot say I would love how oom_score_adj works but it's been like
that for a long time and people do rely on that. So we cannot simply
change it under people feets.
 
> > > > > I've started actually with such approach, but then found it weird.
> > > > > 
> > > > > > Besides that you have
> > > > > > to check each task for over-killing anyway. So I do not see any
> > > > > > performance merits here.
> > > > > 
> > > > > It's an implementation detail, and we can hopefully get rid of it at 
> > > > > some point.
> > > > 
> > > > Well, we might do some estimations and ignore oom scopes but I that
> > > > sounds really complicated and error prone. Unless we have anything like
> > > > that then I would start from tasks and build up the necessary to make a
> > > > decision at the higher level.
> > > 
> > > Seriously speaking, do you have an example, when summing per-process
> > > oom_score will work better?
> > 
> > The primary reason I am pushing for this is to have the common iterator
> > code path (which we have since Vladimir has unified memcg and global oom
> > paths) and only parametrize the value calculation and victim selection.
> 
> I agree, but I'm not sure that we can (and have to) totally unify the way,
> how oom_score is calculated for processes and cgroups.
> 
> But I'd like to see an unified oom_priority approach. This will allow
> to define an OOM killing order in a clear way, and use size-based tiebreaking
> for items of the same priority. Root-cgroup processes will be compared with
> other memory consumers by oom_priority first and oom_score afterwards.

This again changes the existing semantic so I really thing we should be
careful and this all should be opt-in.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-25 Thread Michal Hocko
On Thu 24-08-17 15:58:01, Roman Gushchin wrote:
> On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> > On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
[...]
> > > Both ways are not ideal, and sum of the processes is not ideal too.
> > > Especially, if you take oom_score_adj into account. Will you respect it?
> > 
> > Yes, and I do not see any reason why we shouldn't.
> 
> It makes things even more complicated.
> Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) 
> range,
> and it you're starting summing it, it can be multiplied by number of tasks...
> Weird.

oom_score_adj is just a normalized bias so if tasks inside oom will use
it the whole memcg will get accumulated bias from all such tasks so it
is not completely off. I agree that the more tasks use the bias the more
biased the whole memcg will be. This might or might not be a problem.
As you are trying to reimplement the existing oom killer implementation
I do not think we cannot simply ignore API which people are used to.

If this was a configurable oom policy then I could see how ignoring
oom_score_adj is acceptable because it would be an explicit opt-in.

> It also will be different in case of system and memcg-wide OOM.

Why, we do honor oom_score_adj for the memcg OOM now and in fact the
kernel memcg OOM killer shouldn't be very much different from the global
one except for the tasks scope.

> > > I've started actually with such approach, but then found it weird.
> > > 
> > > > Besides that you have
> > > > to check each task for over-killing anyway. So I do not see any
> > > > performance merits here.
> > > 
> > > It's an implementation detail, and we can hopefully get rid of it at some 
> > > point.
> > 
> > Well, we might do some estimations and ignore oom scopes but I that
> > sounds really complicated and error prone. Unless we have anything like
> > that then I would start from tasks and build up the necessary to make a
> > decision at the higher level.
> 
> Seriously speaking, do you have an example, when summing per-process
> oom_score will work better?

The primary reason I am pushing for this is to have the common iterator
code path (which we have since Vladimir has unified memcg and global oom
paths) and only parametrize the value calculation and victim selection.

> Especially, if we're talking about customizing oom_score calculation,
> it makes no sence to me. How you will sum process timestamps?

Well, I meant you could sum oom_badness for your particular
implementation. If we need some other policy then this wouldn't work and
that's why I've said that I would like to preserve the current common
code and only parametrize value calculation and victim selection...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Michal Hocko
On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
> On Thu, Aug 24, 2017 at 02:58:11PM +0200, Michal Hocko wrote:
> > On Thu 24-08-17 13:28:46, Roman Gushchin wrote:
> > > Hi Michal!
> > > 
> > There is nothing like a "better victim". We are pretty much in a
> > catastrophic situation when we try to survive by killing a userspace.
> 
> Not necessary, it can be a cgroup OOM.

memcg OOM is no different. The catastrophic is scoped to the specific
hierarchy but tasks in that hierarchy still fail to make a further
progress.

> > We try to kill the largest because that assumes that we return the
> > most memory from it. Now I do understand that you want to treat the
> > memcg as a single killable entity but I find it really questionable
> > to do a per-memcg metric and then do not treat it like that and kill
> > only a single task. Just imagine a single memcg with zillions of taks
> > each very small and you select it as the largest while a small taks
> > itself doesn't help to help to get us out of the OOM.
> 
> I don't think it's different from a non-containerized state: if you
> have a zillion of small tasks in the system, you'll meet the same issues.

Yes this is possible but usually you are comparing apples to apples so
you will kill the largest offender and then go on. To be honest I really
do hate how we try to kill a children rather than the selected victim
for the same reason.

> > > > I guess I have asked already and we haven't reached any consensus. I do
> > > > not like how you treat memcgs and tasks differently. Why cannot we have
> > > > a memcg score a sum of all its tasks?
> > > 
> > > It sounds like a more expensive way to get almost the same with less 
> > > accuracy.
> > > Why it's better?
> > 
> > because then you are comparing apples to apples?
> 
> Well, I can say that I compare some number of pages against some other number
> of pages. And the relation between a page and memcg is more obvious, than a
> relation between a page and a process.

But you are comparing different accounting systems.
 
> Both ways are not ideal, and sum of the processes is not ideal too.
> Especially, if you take oom_score_adj into account. Will you respect it?

Yes, and I do not see any reason why we shouldn't.

> I've started actually with such approach, but then found it weird.
> 
> > Besides that you have
> > to check each task for over-killing anyway. So I do not see any
> > performance merits here.
> 
> It's an implementation detail, and we can hopefully get rid of it at some 
> point.

Well, we might do some estimations and ignore oom scopes but I that
sounds really complicated and error prone. Unless we have anything like
that then I would start from tasks and build up the necessary to make a
decision at the higher level.
 
> > > > How do you want to compare memcg score with tasks score?
> > > 
> > > I have to do it for tasks in root cgroups, but it shouldn't be a common 
> > > case.
> > 
> > How come? I can easily imagine a setup where only some memcgs which
> > really do need a kill-all semantic while all others can live with single
> > task killed perfectly fine.
> 
> I mean taking a unified cgroup hierarchy into an account, there should not
> be lot of tasks in the root cgroup, if any.

Is that really the case? I would assume that memory controller would be
enabled only in those subtrees which really use the functionality and
the rest will be sitting in the root memcg. It might be the case if you
are running only containers but I am not really sure this is true in
general.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 3/4] mm, oom: introduce oom_priority for memory cgroups

2017-08-24 Thread Michal Hocko
On Thu 24-08-17 13:51:13, Roman Gushchin wrote:
> On Thu, Aug 24, 2017 at 02:10:54PM +0200, Michal Hocko wrote:
> > On Wed 23-08-17 17:52:00, Roman Gushchin wrote:
> > > Introduce a per-memory-cgroup oom_priority setting: an integer number
> > > within the [-1, 1] range, which defines the order in which
> > > the OOM killer selects victim memory cgroups.
> > 
> > Why do we need a range here?
> 
> No specific reason, both [INT_MIN, INT_MAX] and [-1, 1] will
> work equally.

Then do not enforce a range because this just reduces possible usecases
(e.g. timestamp one...).

> We should be able to predefine an OOM killing order for
> any reasonable amount of cgroups.
> 
> > 
> > > OOM killer prefers memory cgroups with larger priority if they are
> > > populated with eligible tasks.
> > 
> > So this is basically orthogonal to the score based selection and the
> > real size is only the tiebreaker for same priorities? Could you describe
> > the usecase? Becasuse to me this sounds like a separate oom killer
> > strategy. I can imagine somebody might be interested (e.g. always kill
> > the oldest memcgs...) but an explicit range wouldn't fly with such a
> > usecase very well.
> 
> The usecase: you have a machine with several containerized workloads
> of different importance, and some system-level stuff, also in (memory)
> cgroups.
> In case of global memory shortage, some workloads should be killed in
> a first order, others should be killed only if there is no other option.
> Several workloads can have equal importance. Size-based tiebreaking
> is very useful to catch memory leakers amongst them.

OK, please document that in the changelog.

> > That brings me back to my original suggestion. Wouldn't a "register an
> > oom strategy" approach much better than blending things together and
> > then have to wrap heads around different combinations of tunables?
> 
> Well, I believe that 90% of this patchset is still relevant;

agreed and didn't say otherwise.

> the only
> thing you might want to customize/replace size-based tiebreaking with
> something else (like timestamp-based tiebreaking, mentioned by David earlier).

> What about tunables, there are two, and they are completely orthogonal:
> 1) oom_priority allows to define an order, in which cgroups will be OOMed
> 2) oom_kill_all defines if all or just one task should be killed
> 
> So, I don't think it's a too complex interface.
> 
> Again, I'm not against oom strategy approach, it just looks as a much bigger
> project, and I do not see a big need.

Well, I was thinking that our current oom victim selection code is
quite extensible already. Your patches will teach it kill the whole
group semantic which is already very useful. Now we can talk about the
selection criteria and this is something to be replaceable. Because even
the current discussion has shown that different people might and will
have different requirements. Can we structure the code in such a way
that new comparison algorithm would be simple to add without reworking
the whole selection logic?

> Do you have an example, which can't be effectively handled by an approach
> I'm suggesting?

No, I do not have any which would be _explicitly_ requested but I do
envision new requirements will emerge. The most probable one would be
kill the youngest container because that would imply the least amount of
work wasted.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Michal Hocko
On Thu 24-08-17 13:28:46, Roman Gushchin wrote:
> Hi Michal!
> 
> On Thu, Aug 24, 2017 at 01:47:06PM +0200, Michal Hocko wrote:
> > This doesn't apply on top of mmotm cleanly. You are missing
> > http://lkml.kernel.org/r/20170807113839.16695-3-mho...@kernel.org
> 
> I'll rebase. Thanks!
> 
> > 
> > On Wed 23-08-17 17:51:59, Roman Gushchin wrote:
> > > Traditionally, the OOM killer is operating on a process level.
> > > Under oom conditions, it finds a process with the highest oom score
> > > and kills it.
> > > 
> > > This behavior doesn't suit well the system with many running
> > > containers:
> > > 
> > > 1) There is no fairness between containers. A small container with
> > > few large processes will be chosen over a large one with huge
> > > number of small processes.
> > > 
> > > 2) Containers often do not expect that some random process inside
> > > will be killed. In many cases much safer behavior is to kill
> > > all tasks in the container. Traditionally, this was implemented
> > > in userspace, but doing it in the kernel has some advantages,
> > > especially in a case of a system-wide OOM.
> > > 
> > > 3) Per-process oom_score_adj affects global OOM, so it's a breache
> > > in the isolation.
> > 
> > Please explain more. I guess you mean that an untrusted memcg could hide
> > itself from the global OOM killer by reducing the oom scores? Well you
> > need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has
> > already pointed out. I also agree that we absolutely must not kill an
> > oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN
> > as a protection from an untrusted SIGKILL and inconsistent state as a
> > result. Those applications simply shouldn't behave differently in the
> > global and container contexts.
> 
> The main point of the kill_all option is to clean up the victim cgroup
> _completely_. If some tasks can survive, that means userspace should
> take care of them, look at the cgroup after oom, and kill the survivors
> manually.
> 
> If you want to rely on OOM_SCORE_ADJ_MIN, don't set kill_all.
> I really don't get the usecase for this "kill all, except this and that".

OOM_SCORE_ADJ_MIN has become a contract de-facto. You cannot simply
expect that somebody would alter a specific workload for a container
just to be safe against unexpected SIGKILL. kill-all might be set up the
memcg hierarchy which is out of your control.

> Also, it's really confusing to respect -1000 value, and completely ignore 
> -999.
> 
> I believe that any complex userspace OOM handling should use memory.high
> and handle memory shortage before an actual OOM.
> 
> > 
> > If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.
> > 
> > > To address these issues, cgroup-aware OOM killer is introduced.
> > > 
> > > Under OOM conditions, it tries to find the biggest memory consumer,
> > > and free memory by killing corresponding task(s). The difference
> > > the "traditional" OOM killer is that it can treat memory cgroups
> > > as memory consumers as well as single processes.
> > > 
> > > By default, it will look for the biggest leaf cgroup, and kill
> > > the largest task inside.
> > 
> > Why? I believe that the semantic should be as simple as kill the largest
> > oom killable entity. And the entity is either a process or a memcg which
> > is marked that way.
> 
> So, you still need to compare memcgroups and processes.
> 
> In my case, it's more like an exception (only processes from root memcg,
> and only if there are no eligible cgroups with lower oom_priority).
> You suggest to rely on this comparison.
> 
> > Why should we mix things and select a memcg to kill
> > a process inside it? More on that below.
> 
> To have some sort of "fairness" in a containerized environemnt.
> Say, 1 cgroup with 1 big task, another cgroup with many smaller tasks.
> It's not necessary true, that first one is a better victim.

There is nothing like a "better victim". We are pretty much in a
catastrophic situation when we try to survive by killing a userspace.
We try to kill the largest because that assumes that we return the
most memory from it. Now I do understand that you want to treat the
memcg as a single killable entity but I find it really questionable
to do a per-memcg metric and then do not treat it like that and kill
only a single task. Just imagine a single memcg with zillions of taks
each very small and you select it as the largest while a small taks
itself 

Re: [v6 3/4] mm, oom: introduce oom_priority for memory cgroups

2017-08-24 Thread Michal Hocko
On Wed 23-08-17 17:52:00, Roman Gushchin wrote:
> Introduce a per-memory-cgroup oom_priority setting: an integer number
> within the [-1, 1] range, which defines the order in which
> the OOM killer selects victim memory cgroups.

Why do we need a range here?

> OOM killer prefers memory cgroups with larger priority if they are
> populated with eligible tasks.

So this is basically orthogonal to the score based selection and the
real size is only the tiebreaker for same priorities? Could you describe
the usecase? Becasuse to me this sounds like a separate oom killer
strategy. I can imagine somebody might be interested (e.g. always kill
the oldest memcgs...) but an explicit range wouldn't fly with such a
usecase very well.

That brings me back to my original suggestion. Wouldn't a "register an
oom strategy" approach much better than blending things together and
then have to wrap heads around different combinations of tunables?

[...]
> @@ -2760,7 +2761,12 @@ static void select_victim_memcg(struct mem_cgroup 
> *root, struct oom_control *oc)
>   if (iter->oom_score == 0)
>   continue;
>  
> - if (iter->oom_score > score) {
> + if (iter->oom_priority > prio) {
> + memcg = iter;
> + prio = iter->oom_priority;
> + score = iter->oom_score;
> + } else if (iter->oom_priority == prio &&
> +iter->oom_score > score) {
>   memcg = iter;
>   score = iter->oom_score;
>   }

Just a minor thing. Why do we even have to calculate oom_score when we
use it only as a tiebreaker?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Michal Hocko
This doesn't apply on top of mmotm cleanly. You are missing
http://lkml.kernel.org/r/20170807113839.16695-3-mho...@kernel.org

On Wed 23-08-17 17:51:59, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.

Please explain more. I guess you mean that an untrusted memcg could hide
itself from the global OOM killer by reducing the oom scores? Well you
need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has
already pointed out. I also agree that we absolutely must not kill an
oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN
as a protection from an untrusted SIGKILL and inconsistent state as a
result. Those applications simply shouldn't behave differently in the
global and container contexts.

If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.

> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.

Why? I believe that the semantic should be as simple as kill the largest
oom killable entity. And the entity is either a process or a memcg which
is marked that way. Why should we mix things and select a memcg to kill
a process inside it? More on that below.

> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 
> Tasks in the root cgroup are treated as independent memory consumers,
> and are compared with other memory consumers (e.g. leaf cgroups).
> The root cgroup doesn't support the oom_kill_all_tasks feature.

If anything you wouldn't have to treat the root memcg any special. It
will be like any other memcg which doesn't have oom_kill_all_tasks...
 
[...]

> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +   const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;

I guess I have asked already and we haven't reached any consensus. I do
not like how you treat memcgs and tasks differently. Why cannot we have
a memcg score a sum of all its tasks? How do you want to compare memcg
score with tasks score? This just smells like the outcome of a weird
semantic that you try to select the largest group I have mentioned
above.

This is a rather fundamental concern and I believe we should find a
consensus on it before going any further. I believe that users shouldn't
see any difference in the OOM behavior when memcg v2 is used and there
is no kill-all memcg. If there is such a memcg then we should treat only
those specially. But you might have really strong usecases which haven't
been presented or I've missed their importance.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 1/4] mm, oom: refactor the oom_kill_process() function

2017-08-24 Thread Michal Hocko
This patch fails to apply on top of the mmotm tree. It seems the only
reason is the missing
http://lkml.kernel.org/r/20170810075019.28998-2-mho...@kernel.org

On Wed 23-08-17 17:51:57, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().

Yes this makes some sense even without further changes.

> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

I do agree with the patch there is just one thing to fix up.

> ---
>  mm/oom_kill.c | 123 
> +++---
>  1 file changed, 65 insertions(+), 58 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 53b44425ef35..5c29a3dd591b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task)
>   return ret;
>  }
>  
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)
>  {
[...]
>   p = find_lock_task_mm(victim);
>   if (!p) {
>   put_task_struct(victim);

The context doesn't tell us but there is return right after this.

p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
return;
} else if (victim != p) {
get_task_struct(p);
put_task_struct(victim);
victim = p;
}

So we return with the reference dropped. Moreover we can change
the victim, drop the reference on old one...

> +static void oom_kill_process(struct oom_control *oc, const char *message)
> +{
[...]
> + __oom_kill_process(victim);
> + put_task_struct(victim);

while we drop it here again and won't drop the changed one. If we race
with the exiting task and there is no mm then we we double drop as well.
So I think that __oom_kill_process should really drop the reference for
all cases and oom_kill_process shouldn't care. Or if you absolutely
need a guarantee that the victim won't go away after __oom_kill_process
then you need to return the real victim and let the caller to deal with
put_task_struct.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v4 2/4] mm, oom: cgroup-aware OOM killer

2017-08-03 Thread Michal Hocko
On Thu 03-08-17 13:47:51, Roman Gushchin wrote:
> On Wed, Aug 02, 2017 at 09:29:01AM +0200, Michal Hocko wrote:
> > On Tue 01-08-17 19:13:52, Roman Gushchin wrote:
> > > On Tue, Aug 01, 2017 at 07:03:03PM +0200, Michal Hocko wrote:
> > > > On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> > > > > On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > I would reap out the oom_kill_process into a separate patch.
> > > > > 
> > > > > It was a separate patch, I've merged it based on Vladimir's feedback.
> > > > > No problems, I can divide it back.
> > > > 
> > > > It would make the review slightly more easier
> > > > > 
> > > > > > > -static void oom_kill_process(struct oom_control *oc, const char 
> > > > > > > *message)
> > > > > > > +static void __oom_kill_process(struct task_struct *victim)
> > > > > > 
> > > > > > To the rest of the patch. I have to say I do not quite like how it 
> > > > > > is
> > > > > > implemented. I was hoping for something much simpler which would 
> > > > > > hook
> > > > > > into oom_evaluate_task. If a task belongs to a memcg with kill-all 
> > > > > > flag
> > > > > > then we would update the cumulative memcg badness (more 
> > > > > > specifically the
> > > > > > badness of the topmost parent with kill-all flag). Memcg will then
> > > > > > compete with existing self contained tasks (oom_badness will have to
> > > > > > tell whether points belong to a task or a memcg to allow the caller 
> > > > > > to
> > > > > > deal with it). But it shouldn't be much more complex than that.
> > > > > 
> > > > > I'm not sure, it will be any simpler. Basically I'm doing the same:
> > > > > the difference is that you want to iterate over tasks and for each
> > > > > task traverse the memcg tree, update per-cgroup oom score and find
> > > > > the corresponding memcg(s) with the kill-all flag. I'm doing the 
> > > > > opposite:
> > > > > traverse the cgroup tree, and for each leaf cgroup iterate over 
> > > > > processes.
> > > > 
> > > > Yeah but this doesn't fit very well to the existing scheme so we would
> > > > need two different schemes which is not ideal from maint. point of view.
> > > > We also do not have to duplicate all the tricky checks we already do in
> > > > oom_evaluate_task. So I would prefer if we could try to hook there and
> > > > do the special handling there.
> > > 
> > > I hope, that iterating over all tasks just to check if there are
> > > in-flight OOM victims might be optimized at some point.
> > > That means, we would be able to choose a victim much cheaper.
> > > It's not easy, but it feels as a right direction to go.
> > 
> > You would have to count per each oom domain and that sounds quite
> > unfeasible to me.
> 
> It's hard, but traversing the whole cgroup tree from bottom to top
> for each task is just not scalable.

We are talking about the oom path which is a slow path. Besides that
memcg hierarchies will not be very deep usually (we are not talking
about hundreds). 

> This is exactly why I've choosen a compromise right now: let's
> iterate over all tasks, but do it by iterating over the cgroup tree.
> 
> >  
> > > Also, adding new tricks to the oom_evaluate_task() will make the code
> > > even more hairy. Some of the existing tricks are useless for memcg 
> > > selection.
> > 
> > Not sure what you mean but oom_evaluate_task has been usable for both
> > global and memcg oom paths so far. I do not see any reason why this
> > shouldn't hold for a different oom killing strategy.
> 
> Yes, but in both cases we've evaluated tasks, not cgroups.
> 
> > 
> > > > > Also, please note, that even without the kill-all flag the decision 
> > > > > is made
> > > > > on per-cgroup level (except tasks in the root cgroup).
> > > > 
> > > > Yeah and I am not sure this is a reasonable behavior. Why should we
> > > > consider memcgs which are not kill-all as a single entity?
> > > 
> > > I think, it's reasonable to choose a cgroup/container to blow off based on
> > > the cgroup oom_priority/size (including hierarchical settings),

Re: [v4 2/4] mm, oom: cgroup-aware OOM killer

2017-08-02 Thread Michal Hocko
On Tue 01-08-17 19:13:52, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 07:03:03PM +0200, Michal Hocko wrote:
> > On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> > > On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> > [...]
> > > > I would reap out the oom_kill_process into a separate patch.
> > > 
> > > It was a separate patch, I've merged it based on Vladimir's feedback.
> > > No problems, I can divide it back.
> > 
> > It would make the review slightly more easier
> > > 
> > > > > -static void oom_kill_process(struct oom_control *oc, const char 
> > > > > *message)
> > > > > +static void __oom_kill_process(struct task_struct *victim)
> > > > 
> > > > To the rest of the patch. I have to say I do not quite like how it is
> > > > implemented. I was hoping for something much simpler which would hook
> > > > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > > > then we would update the cumulative memcg badness (more specifically the
> > > > badness of the topmost parent with kill-all flag). Memcg will then
> > > > compete with existing self contained tasks (oom_badness will have to
> > > > tell whether points belong to a task or a memcg to allow the caller to
> > > > deal with it). But it shouldn't be much more complex than that.
> > > 
> > > I'm not sure, it will be any simpler. Basically I'm doing the same:
> > > the difference is that you want to iterate over tasks and for each
> > > task traverse the memcg tree, update per-cgroup oom score and find
> > > the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> > > traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> > 
> > Yeah but this doesn't fit very well to the existing scheme so we would
> > need two different schemes which is not ideal from maint. point of view.
> > We also do not have to duplicate all the tricky checks we already do in
> > oom_evaluate_task. So I would prefer if we could try to hook there and
> > do the special handling there.
> 
> I hope, that iterating over all tasks just to check if there are
> in-flight OOM victims might be optimized at some point.
> That means, we would be able to choose a victim much cheaper.
> It's not easy, but it feels as a right direction to go.

You would have to count per each oom domain and that sounds quite
unfeasible to me.
 
> Also, adding new tricks to the oom_evaluate_task() will make the code
> even more hairy. Some of the existing tricks are useless for memcg selection.

Not sure what you mean but oom_evaluate_task has been usable for both
global and memcg oom paths so far. I do not see any reason why this
shouldn't hold for a different oom killing strategy.

> > > Also, please note, that even without the kill-all flag the decision is 
> > > made
> > > on per-cgroup level (except tasks in the root cgroup).
> > 
> > Yeah and I am not sure this is a reasonable behavior. Why should we
> > consider memcgs which are not kill-all as a single entity?
> 
> I think, it's reasonable to choose a cgroup/container to blow off based on
> the cgroup oom_priority/size (including hierarchical settings), and then
> kill one biggest or all tasks depending on cgroup settings.

But that doesn't mean you have to treat even !kill-all memcgs like a
single entity. In fact we should compare killable entities which is
either a task or the whole memcg if configured that way.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v4 2/4] mm, oom: cgroup-aware OOM killer

2017-08-01 Thread Michal Hocko
On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
[...]
> > I would reap out the oom_kill_process into a separate patch.
> 
> It was a separate patch, I've merged it based on Vladimir's feedback.
> No problems, I can divide it back.

It would make the review slightly more easier
> 
> > > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > > +static void __oom_kill_process(struct task_struct *victim)
> > 
> > To the rest of the patch. I have to say I do not quite like how it is
> > implemented. I was hoping for something much simpler which would hook
> > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > then we would update the cumulative memcg badness (more specifically the
> > badness of the topmost parent with kill-all flag). Memcg will then
> > compete with existing self contained tasks (oom_badness will have to
> > tell whether points belong to a task or a memcg to allow the caller to
> > deal with it). But it shouldn't be much more complex than that.
> 
> I'm not sure, it will be any simpler. Basically I'm doing the same:
> the difference is that you want to iterate over tasks and for each
> task traverse the memcg tree, update per-cgroup oom score and find
> the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> traverse the cgroup tree, and for each leaf cgroup iterate over processes.

Yeah but this doesn't fit very well to the existing scheme so we would
need two different schemes which is not ideal from maint. point of view.
We also do not have to duplicate all the tricky checks we already do in
oom_evaluate_task. So I would prefer if we could try to hook there and
do the special handling there.

> Also, please note, that even without the kill-all flag the decision is made
> on per-cgroup level (except tasks in the root cgroup).

Yeah and I am not sure this is a reasonable behavior. Why should we
consider memcgs which are not kill-all as a single entity?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v4 2/4] mm, oom: cgroup-aware OOM killer

2017-08-01 Thread Michal Hocko
On Wed 26-07-17 14:27:16, Roman Gushchin wrote:
[...]
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +   const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;

I am wondering why are you diverging from the global oom_badness
behavior here. Although doing per NUMA accounting sounds like a better
idea but then you just end up mixing this with non NUMA numbers and the
whole thing is harder to understand without great advantages.

> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> + struct mem_cgroup *iter, *parent;
> +
> + for_each_mem_cgroup_tree(iter, root) {
> + if (memcg_has_children(iter)) {
> + iter->oom_score = 0;
> + continue;
> + }
> +
> + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> + if (iter->oom_score == -1) {
> + oc->chosen_memcg = (void *)-1UL;
> + mem_cgroup_iter_break(root, iter);
> + return;
> + }
> +
> + if (!iter->oom_score)
> + continue;
> +
> + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> +  parent = parent_mem_cgroup(parent))
> + parent->oom_score += iter->oom_score;
> + }
> +
> + for (;;) {
> + struct cgroup_subsys_state *css;
> + struct mem_cgroup *memcg = NULL;
> + long score = LONG_MIN;
> +
> + css_for_each_child(css, >css) {
> + struct mem_cgroup *iter = mem_cgroup_from_css(css);
> +
> + if (iter->oom_score > score) {
> + memcg = iter;
> + score = iter->oom_score;
> + }
> + }
> +
> + if (!memcg) {
> + if (oc->memcg && root == oc->memcg) {
> + oc->chosen_memcg = oc->memcg;
> + css_get(>chosen_memcg->css);
> + oc->chosen_points = oc->memcg->oom_score;
> + }
> + break;
> + }
> +
> + if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
> + oc->chosen_memcg = memcg;
> + css_get(>chosen_memcg->css);
> + oc->chosen_points = score;
> + break;
> + }
> +
> + root = memcg;
> + }
> +}

This and the rest of the victim selection code is really hairy and hard
to follow.

I would reap out the oom_kill_process into a separate patch.

> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)

To the rest of the patch. I have to say I do not quite like how it is
implemented. I was hoping for something much simpler which would hook
into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
then we would update the cumulative memcg badness (more specifically the
badness of the topmost parent with kill-all flag). Memcg will then
compete with existing self contained tasks (oom_badness will have to
tell whether points belong to a task or a memcg to allow the caller to
deal with it). But it shouldn't be much more complex than that.

Or is there something that I am missing and that would prevent such a
simple approach?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 16:24:34, Michal Hocko wrote:
[...]
> Or if you prefer I can post it separately?

I've just tried to rebase relevant parts on top of the current mmotm
tree and it needs some non-trivial updates. Would you mind if I post
those patches with you on CC? I really think that we shouldn't invent a
new throttling just to replace it later.

I will comment on the rest of the series later. I have glanced through
it and have to digest it some more before replying.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 15:06:07, Roman Gushchin wrote:
> On Wed, Jul 26, 2017 at 03:56:22PM +0200, Michal Hocko wrote:
> > On Wed 26-07-17 14:27:15, Roman Gushchin wrote:
> > [...]
> > > @@ -656,13 +658,24 @@ static void mark_oom_victim(struct task_struct *tsk)
> > >   struct mm_struct *mm = tsk->mm;
> > >  
> > >   WARN_ON(oom_killer_disabled);
> > > - /* OOM killer might race with memcg OOM */
> > > - if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > +
> > > + if (!cmpxchg(_memdie_owner, NULL, current)) {
> > > + struct task_struct *t;
> > > +
> > > + rcu_read_lock();
> > > + for_each_thread(current, t)
> > > + set_tsk_thread_flag(t, TIF_MEMDIE);
> > > + rcu_read_unlock();
> > > + }
> > 
> > I would realy much rather see we limit the amount of memory reserves oom
> > victims can consume rather than build on top of the current hackish
> > approach of limiting the number of tasks because the fundamental problem
> > is still there (a heavy multithreaded process can still deplete the
> > reserves completely).
> > 
> > Is there really any reason to not go with the existing patch I've
> > pointed to the last time around? You didn't seem to have any objects
> > back then.
> 
> Hi Michal!
> 
> I had this patch in mind and mentioned in the commit log, that TIF_MEMDIE
> as an memory reserves access indicator will probably be eliminated later.
> 
> But that patch is not upstream yet, and it's directly related to the theme.
> The proposed refactoring of TIF_MEMDIE usage is not against your approach,
> and will not make harder to go this way further.

So what is the reason to invent another tricky way to limit access to
memory reserves? The patch is not upstream but nothin prevents you from
picking it up and post along with your series. Or if you prefer I can
post it separately?

> I'm slightly concerned about an idea to give TIF_MEMDIE to all tasks
> in case we're killing a really large cgroup.

Why? There shouldn't be any issue if the access is limited.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 14:27:15, Roman Gushchin wrote:
[...]
> @@ -656,13 +658,24 @@ static void mark_oom_victim(struct task_struct *tsk)
>   struct mm_struct *mm = tsk->mm;
>  
>   WARN_ON(oom_killer_disabled);
> - /* OOM killer might race with memcg OOM */
> - if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> +
> + if (!cmpxchg(_memdie_owner, NULL, current)) {
> + struct task_struct *t;
> +
> + rcu_read_lock();
> + for_each_thread(current, t)
> + set_tsk_thread_flag(t, TIF_MEMDIE);
> + rcu_read_unlock();
> + }

I would realy much rather see we limit the amount of memory reserves oom
victims can consume rather than build on top of the current hackish
approach of limiting the number of tasks because the fundamental problem
is still there (a heavy multithreaded process can still deplete the
reserves completely).

Is there really any reason to not go with the existing patch I've
pointed to the last time around? You didn't seem to have any objects
back then.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-13 Thread Michal Hocko
On Thu 13-07-17 08:53:52, Benjamin Herrenschmidt wrote:
> On Wed, 2017-07-12 at 09:23 +0200, Michal Hocko wrote:
> > 
> > > 
> > > Ideally the MMU looks at the PTE for keys, in order to enforce
> > > protection. This is the case with x86 and is the case with power9 Radix
> > > page table. Hence the keys have to be programmed into the PTE.
> > 
> > But x86 doesn't update ptes for PKEYs, that would be just too expensive.
> > You could use standard mprotect to do the same...
> 
> What do you mean ? x86 ends up in mprotect_fixup -> change_protection()
> which will update the PTEs just the same as we do.
> 
> Changing the key for a page is a form mprotect. Changing the access
> permissions for keys is different, for us it's a special register
> (AMR).
> 
> I don't understand why you think we are doing any differently than x86
> here.

That was a misunderstanding on my side as explained in other reply.

> > > However with HPT on power, these keys do not necessarily have to be
> > > programmed into the PTE. We could bypass the Linux Page Table Entry(PTE)
> > > and instead just program them into the Hash Page Table(HPTE), since
> > > the MMU does not refer the PTE but refers the HPTE. The last version
> > > of the page attempted to do that.   It worked as follows:
> > > 
> > > a) when a address range is requested to be associated with a key; by the
> > >application through key_mprotect() system call, the kernel
> > >stores that key in the vmas corresponding to that address
> > >range.
> > > 
> > > b) Whenever there is a hash page fault for that address, the fault
> > >handler reads the key from the VMA and programs the key into the
> > >HPTE. __hash_page() is the function that does that.
> > 
> > What causes the fault here?
> 
> The hardware. With the hash MMU, the HW walks a hash table which is
> effectively a large in-memory TLB extension. When a page isn't found
> there, a  "hash fault" is generated allowing Linux to populate that
> hash table with the content of the corresponding PTE. 

Thanks for the clarification
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Michal Hocko
On Wed 12-07-17 09:23:37, Michal Hocko wrote:
> On Tue 11-07-17 12:32:57, Ram Pai wrote:
[...]
> > Ideally the MMU looks at the PTE for keys, in order to enforce
> > protection. This is the case with x86 and is the case with power9 Radix
> > page table. Hence the keys have to be programmed into the PTE.
> 
> But x86 doesn't update ptes for PKEYs, that would be just too expensive.
> You could use standard mprotect to do the same...

OK, this seems to be a misunderstanding and confusion on my end.
do_mprotect_pkey does mprotect_fixup even for the pkey path which is
quite surprising to me. I guess my misunderstanding comes from
Documentation/x86/protection-keys.txt
"
Memory Protection Keys provides a mechanism for enforcing page-based
protections, but without requiring modification of the page tables
when an application changes protection domains.  It works by
dedicating 4 previously ignored bits in each page table entry to a
"protection key", giving 16 possible keys.
"

So please disregard my previous comments about page tables and sorry
about the confusion.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Michal Hocko
On Tue 11-07-17 12:32:57, Ram Pai wrote:
> On Tue, Jul 11, 2017 at 04:52:46PM +0200, Michal Hocko wrote:
> > On Wed 05-07-17 14:21:37, Ram Pai wrote:
> > > Memory protection keys enable applications to protect its
> > > address space from inadvertent access or corruption from
> > > itself.
> > > 
> > > The overall idea:
> > > 
> > >  A process allocates a   key  and associates it with
> > >  an  address  range  withinits   address   space.
> > >  The process  then  can  dynamically  set read/write 
> > >  permissions on  the   key   without  involving  the 
> > >  kernel. Any  code that  violates   the  permissions
> > >  of  the address space; as defined by its associated
> > >  key, will receive a segmentation fault.
> > > 
> > > This patch series enables the feature on PPC64 HPTE
> > > platform.
> > > 
> > > ISA3.0 section 5.7.13 describes the detailed specifications.
> > 
> > Could you describe the highlevel design of this feature in the cover
> > letter.
> 
> Yes it can be hard to understand without the big picture.  I will
> provide the high level design and the rationale behind the patch split
> towards the end.  Also I will have it in the cover letter for my next
> revision of the patchset.

Thanks!
 
> > I have tried to get some idea from the patchset but it was
> > really far from trivial. Patches are not very well split up (many
> > helpers are added without their users etc..). 
> 
> I see your point. Earlier, I had the patches split such a way that the
> users of the helpers were in the same patch as that of the helper.
> But then comments from others lead to the current split.

It is not my call here, obviously. I cannot review arch specific parts
due to lack of familiarity but it is a general good practice to include
helpers along with their users to make the usage clear. Also, as much as
I like small patches because they are easier to review, having very many
of them can lead to a harder review in the end because you easily lose
a higher level overview.

> > > Testing:
> > >   This patch series has passed all the protection key
> > >   tests available in  the selftests directory.
> > >   The tests are updated to work on both x86 and powerpc.
> > > 
> > > version v5:
> > >   (1) reverted back to the old design -- store the 
> > >   key in the pte, instead of bypassing it.
> > >   The v4 design slowed down the hash page path.
> > 
> > This surprised me a lot but I couldn't find the respective code. Why do
> > you need to store anything in the pte? My understanding of PKEYs is that
> > the setup and teardown should be very cheap and so no page tables have
> > to updated. Or do I just misunderstand what you wrote here?
> 
> Ideally the MMU looks at the PTE for keys, in order to enforce
> protection. This is the case with x86 and is the case with power9 Radix
> page table. Hence the keys have to be programmed into the PTE.

But x86 doesn't update ptes for PKEYs, that would be just too expensive.
You could use standard mprotect to do the same...
 
> However with HPT on power, these keys do not necessarily have to be
> programmed into the PTE. We could bypass the Linux Page Table Entry(PTE)
> and instead just program them into the Hash Page Table(HPTE), since
> the MMU does not refer the PTE but refers the HPTE. The last version
> of the page attempted to do that.   It worked as follows:
> 
> a) when a address range is requested to be associated with a key; by the
>application through key_mprotect() system call, the kernel
>stores that key in the vmas corresponding to that address
>range.
> 
> b) Whenever there is a hash page fault for that address, the fault
>handler reads the key from the VMA and programs the key into the
>HPTE. __hash_page() is the function that does that.

What causes the fault here?

> c) Once the hpte is programmed, the MMU can sense key violations and
>generate key-faults.
> 
> The problem is with step (b).  This step is really a very critical
> path which is performance sensitive. We dont want to add any delays.
> However if we want to access the key from the vma, we will have to
> hold the vma semaphore, and that is a big NO-NO. As a result, this
> design had to be dropped.
> 
> 
> 
> I reverted back to the old design i.e the design in v4 version. In this
> version we do the following:
> 
> a) when a address range is requested to be associated with a key; by the
>application through key_mprotect() system call, the kernel
>stores that key in the vmas corresponding to that address
> 

Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-11 Thread Michal Hocko
On Wed 05-07-17 14:21:37, Ram Pai wrote:
> Memory protection keys enable applications to protect its
> address space from inadvertent access or corruption from
> itself.
> 
> The overall idea:
> 
>  A process allocates a   key  and associates it with
>  an  address  range  withinits   address   space.
>  The process  then  can  dynamically  set read/write 
>  permissions on  the   key   without  involving  the 
>  kernel. Any  code that  violates   the  permissions
>  of  the address space; as defined by its associated
>  key, will receive a segmentation fault.
> 
> This patch series enables the feature on PPC64 HPTE
> platform.
> 
> ISA3.0 section 5.7.13 describes the detailed specifications.

Could you describe the highlevel design of this feature in the cover
letter. I have tried to get some idea from the patchset but it was
really far from trivial. Patches are not very well split up (many
helpers are added without their users etc..). 

> 
> Testing:
>   This patch series has passed all the protection key
>   tests available in  the selftests directory.
>   The tests are updated to work on both x86 and powerpc.
> 
> version v5:
>   (1) reverted back to the old design -- store the 
>   key in the pte, instead of bypassing it.
>   The v4 design slowed down the hash page path.

This surprised me a lot but I couldn't find the respective code. Why do
you need to store anything in the pte? My understanding of PKEYs is that
the setup and teardown should be very cheap and so no page tables have
to updated. Or do I just misunderstand what you wrote here?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE

2017-06-30 Thread Michal Hocko
On Thu 29-06-17 14:45:13, Roman Gushchin wrote:
> On Thu, Jun 29, 2017 at 10:53:57AM +0200, Michal Hocko wrote:
> > On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> > > We want to limit the number of tasks which are having an access
> > > to the memory reserves. To ensure the progress it's enough
> > > to have one such process at the time.
> > > 
> > > If we need to kill the whole cgroup, let's give an access to the
> > > memory reserves only to the first process in the list, which is
> > > (usually) the biggest process.
> > > This will give us good chances that all other processes will be able
> > > to quit without an access to the memory reserves.
> > 
> > I don't like this to be honest. Is there any reason to go the reduced
> > memory reserves access to oom victims I was suggesting earlier [1]?
> > 
> > [1] 
> > http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mho...@kernel.org
> 
> I've nothing against your approach. What's the state of this patchset?
> Do you plan to bring it upstream?

Just the specific patch I have linked should be sufficient for what you
need here. The patchset had some issues which I didn't have time to fix
and as such the need for the above patch was not a high priority as
well.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-29 Thread Michal Hocko
On Wed 21-06-17 22:19:11, Roman Gushchin wrote:
> Oom killer should avoid unnecessary kills. To prevent them, during
> the tasks list traverse we check for task which was previously
> selected as oom victims. If there is such a task, new victim
> is not selected.
> 
> This approach is sub-optimal (we're doing costly iteration over the task
> list every time) and will not work for the cgroup-aware oom killer.
> 
> We already have oom_victims counter, which can be effectively used
> for the task.

A global counter will not work properly, I am afraid. a) you should
consider the oom domain and do not block oom on unrelated domains and b)
you have no guarantee that the oom victim will terminate reasonably.
That is why we have MMF_OOM_SKIP check in oom_evaluate_task.

I think you should have something similar for your memcg victim selection.
If you see a memcg in the oom hierarchy with oom victims which are alive
and not MMF_OOM_SKIP, you should abort the scanning.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE

2017-06-29 Thread Michal Hocko
On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> We want to limit the number of tasks which are having an access
> to the memory reserves. To ensure the progress it's enough
> to have one such process at the time.
> 
> If we need to kill the whole cgroup, let's give an access to the
> memory reserves only to the first process in the list, which is
> (usually) the biggest process.
> This will give us good chances that all other processes will be able
> to quit without an access to the memory reserves.

I don't like this to be honest. Is there any reason to go the reduced
memory reserves access to oom victims I was suggesting earlier [1]?

[1] 
http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] mm: Allow slab_nomerge to be set at build time

2017-06-26 Thread Michal Hocko
On Fri 23-06-17 12:20:25, Kees Cook wrote:
> On Fri, Jun 23, 2017 at 7:06 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Tue 20-06-17 16:09:11, Kees Cook wrote:
> >> Some hardened environments want to build kernels with slab_nomerge
> >> already set (so that they do not depend on remembering to set the kernel
> >> command line option). This is desired to reduce the risk of kernel heap
> >> overflows being able to overwrite objects from merged caches and changes
> >> the requirements for cache layout control, increasing the difficulty of
> >> these attacks. By keeping caches unmerged, these kinds of exploits can
> >> usually only damage objects in the same cache (though the risk to metadata
> >> exploitation is unchanged).
> >
> > Do we really want to have a dedicated config for each hardening specific
> > kernel command line? I believe we have quite a lot of config options
> > already. Can we rather have a CONFIG_HARDENED_CMD_OPIONS and cover all
> > those defauls there instead?
> 
> There's not been a lot of success with grouped Kconfigs in the past
> (e.g. CONFIG_EXPERIMENTAL), but one thing that has been suggested is a
> defconfig-like make target that would collect all the things together.

Which wouldn't reduce the number of config options, would it? I don't
know but is there any usecase when somebody wants to have hardened
kernel and still want to have different defaults than you are
suggesting?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] mm: Allow slab_nomerge to be set at build time

2017-06-23 Thread Michal Hocko
On Tue 20-06-17 16:09:11, Kees Cook wrote:
> Some hardened environments want to build kernels with slab_nomerge
> already set (so that they do not depend on remembering to set the kernel
> command line option). This is desired to reduce the risk of kernel heap
> overflows being able to overwrite objects from merged caches and changes
> the requirements for cache layout control, increasing the difficulty of
> these attacks. By keeping caches unmerged, these kinds of exploits can
> usually only damage objects in the same cache (though the risk to metadata
> exploitation is unchanged).

Do we really want to have a dedicated config for each hardening specific
kernel command line? I believe we have quite a lot of config options
already. Can we rather have a CONFIG_HARDENED_CMD_OPIONS and cover all
those defauls there instead?

> Cc: Daniel Micay <danielmi...@gmail.com>
> Cc: David Windsor <d...@nullcore.net>
> Cc: Eric Biggers <ebigge...@gmail.com>
> Signed-off-by: Kees Cook <keesc...@chromium.org>
> ---
> v2: split out of slab whitelisting series
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 10 --
>  init/Kconfig| 14 ++
>  mm/slab_common.c|  5 ++---
>  3 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 7737ab5d04b2..94d8b8195cb8 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3715,8 +3715,14 @@
>   slab_nomerge[MM]
>   Disable merging of slabs with similar size. May be
>   necessary if there is some reason to distinguish
> - allocs to different slabs. Debug options disable
> - merging on their own.
> + allocs to different slabs, especially in hardened
> + environments where the risk of heap overflows and
> + layout control by attackers can usually be
> + frustrated by disabling merging. This will reduce
> + most of the exposure of a heap attack to a single
> + cache (risks via metadata attacks are mostly
> + unchanged). Debug options disable merging on their
> + own.
>   For more information see Documentation/vm/slub.txt.
>  
>   slab_max_order= [MM, SLAB]
> diff --git a/init/Kconfig b/init/Kconfig
> index 1d3475fc9496..ce813acf2f4f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1891,6 +1891,20 @@ config SLOB
>  
>  endchoice
>  
> +config SLAB_MERGE_DEFAULT
> + bool "Allow slab caches to be merged"
> + default y
> + help
> +   For reduced kernel memory fragmentation, slab caches can be
> +   merged when they share the same size and other characteristics.
> +   This carries a risk of kernel heap overflows being able to
> +   overwrite objects from merged caches (and more easily control
> +   cache layout), which makes such heap attacks easier to exploit
> +   by attackers. By keeping caches unmerged, these kinds of exploits
> +   can usually only damage objects in the same cache. To disable
> +   merging at runtime, "slab_nomerge" can be passed on the kernel
> +   command line.
> +
>  config SLAB_FREELIST_RANDOM
>   default n
>   depends on SLAB || SLUB
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 01a0fe2eb332..904a83be82de 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -47,13 +47,12 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
>  
>  /*
>   * Merge control. If this is set then no merging of slab caches will occur.
> - * (Could be removed. This was introduced to pacify the merge skeptics.)
>   */
> -static int slab_nomerge;
> +static bool slab_nomerge = !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT);
>  
>  static int __init setup_slab_nomerge(char *str)
>  {
> - slab_nomerge = 1;
> + slab_nomerge = true;
>   return 1;
>  }
>  
> -- 
> 2.7.4
> 
> 
> -- 
> Kees Cook
> Pixel Security
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-06-05 Thread Michal Hocko
On Fri 02-06-17 16:18:52, Roman Gushchin wrote:
> On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote:
> > On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> > > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > > > > > +   /*
> > > > > >  * If current has a pending SIGKILL or is exiting, then 
> > > > > > automatically
> > > > > >  * select it.  The goal is to allow it to allocate so that it 
> > > > > > may
> > > > > >  * quickly exit and free its memory.
> > > > > > 
> > > > > > Please note that I haven't explored how much of the infrastructure
> > > > > > needed for the OOM decision making is available to modules. But we 
> > > > > > can
> > > > > > export a lot of what we currently have in oom_kill.c. I admit it 
> > > > > > might
> > > > > > turn out that this is simply not feasible but I would like this to 
> > > > > > be at
> > > > > > least explored before we go and implement yet another hardcoded way 
> > > > > > to
> > > > > > handle (see how I didn't use policy ;)) OOM situation.
> > > > > 
> > > > > ;)
> > > > > 
> > > > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > > > materialize that cannot be handled with cgroups and scoring. These are
> > > > > powerful building blocks on which userspace can implement all kinds of
> > > > > policy and sorting algorithms.
> > > > > 
> > > > > So this seems like a lot of churn and complicated code to handle one
> > > > > extension. An extension that implements basic functionality.
> > > > 
> > > > Well, as I've said I didn't get to explore this path so I have only a
> > > > very vague idea what we would have to export to implement e.g. the
> > > > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > > > do not have much time for that. I do not want to block a useful work
> > > > which you have a usecase for but I would be really happy if we could
> > > > consider longer term plans before diving into a "hardcoded"
> > > > implementation. We didn't do that previously and we are left with
> > > > oom_kill_allocating_task and similar one off things.
> > > 
> > > As I understand it, killing the allocating task was simply the default
> > > before the OOM killer and was added as a compat knob. I really doubt
> > > anybody is using it at this point, and we could probably delete it.
> > 
> > I might misremember but my recollection is that SGI simply had too
> > large machines with too many processes and so the task selection was
> > very expensinve.
> 
> Cgroup-aware OOM killer can be much better in case of large number of 
> processes,
> as we don't have to iterate over all processes locking each mm, and
> can select an appropriate cgroup based mostly on lockless counters.
> Of course, it depends on concrete setup, but it can be much more efficient
> under right circumstances.

Yes, I agree with that.

> > > I appreciate your concern of being too short-sighted here, but the
> > > fact that I cannot point to more usecases isn't for lack of trying. I
> > > simply don't see the endless possibilities of usecases that you do.
> > > 
> > > It's unlikely for more types of memory domains to pop up besides MMs
> > > and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> > > have panic_on_oom for whole-system death. What else could there be?)
> > > 
> > > And as I pointed out, there is no real evidence that the current
> > > system for configuring preferences isn't sufficient in practice.
> > > 
> > > That's my thoughts on exploring. I'm not sure what else to do before
> > > it feels like running off into fairly contrived hypotheticals.
> > 
> > Yes, I do not want hypotheticals to block an otherwise useful feature,
> > of course. But I haven't heard a strong argument why a module based
> > approach would be a more maintenance burden longterm. From a very quick
> > glance over patches Roman has posted yesterday it seems that a large
> > part of the existing oom infrastructure can be reused reasonably.
> 
> I have nothing against module based approach, but I don't think that a module
> should implement anything rather than then oom score calculati

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-06-02 Thread Michal Hocko
On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > > Everything the user would want to dynamically program in the kernel,
> > > say with bpf, they could do in userspace and then update the scores
> > > for each group and task periodically.
> > 
> > I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> > to mere oom disable/enable knobs from my experience.
> 
> That doesn't necessarily have to be a deficiency with the scoring
> system. I suspect that most people simply don't care as long as the
> the picks for OOM victims aren't entirely stupid.
> 
> For example, we have a lot of machines that run one class of job. If
> we run OOM there isn't much preference we'd need to express; just kill
> one job - the biggest, whatever - and move on. (The biggest makes
> sense because if all jobs are basically equal it's as good as any
> other victim, but if one has a runaway bug it goes for that.)
> 
> Where we have more than one job class, it actually is mostly one hipri
> and one lopri, in which case setting a hard limit on the lopri or the
> -1000 OOM score trick is enough.
> 
> How many systems run more than two clearly distinguishable classes of
> workloads concurrently?

What about those which run different containers on a large physical
machine?

> I'm sure they exist. I'm just saying it doesn't surprise me that
> elaborate OOM scoring isn't all that wide-spread.
> 
> > > The only limitation is that you have to recalculate and update the
> > > scoring tree every once in a while, whereas a bpf program could
> > > evaluate things just-in-time. But for that to matter in practice, OOM
> > > kills would have to be a fairly hot path.
> > 
> > I am not really sure how to reliably implement "kill the memcg with the
> > largest process" strategy. And who knows how many others strategies will
> > pop out.
> 
> That seems fairly contrived.
> 
> What does it mean to divide memory into subdomains, but when you run
> out of physical memory you kill based on biggest task?

Well, the biggest task might be the runaway one and so killing it first
before you kill other innocent ones makes some sense to me.

> Sure, it frees memory and gets the system going again, so it's as good
> as any answer to overcommit gone wrong, I guess. But is that something
> you'd intentionally want to express from a userspace perspective?
> 
[...]
> > > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > > from trivial.
> > > 
> > > Why is sorting and picking in userspace harder than sorting and
> > > picking in the kernel?
> > 
> > Because the userspace score based approach would be much more racy
> > especially in the busy system. This could lead to unexpected behavior
> > when OOM killer would kill a different than a run-away memcgs.
> 
> How would it be easier to weigh priority against runaway detection
> inside the kernel?

You have better chances to catch such a process at the time of the OOM
because you do the check at the time of the OOM rather than sometimes
back in time when your monitor was able to run and check all the
existing processes (which alone can be rather time consuming so you do
not want to do that very often).

> > > > +   /*
> > > >  * If current has a pending SIGKILL or is exiting, then 
> > > > automatically
> > > >  * select it.  The goal is to allow it to allocate so that it 
> > > > may
> > > >  * quickly exit and free its memory.
> > > > 
> > > > Please note that I haven't explored how much of the infrastructure
> > > > needed for the OOM decision making is available to modules. But we can
> > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > turn out that this is simply not feasible but I would like this to be at
> > > > least explored before we go and implement yet another hardcoded way to
> > > > handle (see how I didn't use policy ;)) OOM situation.
> > > 
> > > ;)
> > > 
> > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > materialize that cannot be handled with cgroups and scoring. These are
> > > powerful building blocks on which userspace can implement all kinds of
> > > policy and sorting algorithms.
> > > 
> > > So this seems like a lot of churn and complicated code to handle one
> > > extension. An extensi

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-25 Thread Michal Hocko
On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
[...]
> > > How to react on an OOM - is definitely a policy, which depends
> > > on the workload. Nothing is changing here from how it's working now,
> > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > rather than a process.
> > 
> > There is a _big_ difference. The current implementation just tries
> > to recover from the OOM situation without carying much about the
> > consequences on the workload. This is the last resort and a services for
> > the _system_ to get back to sane state. You are trying to make it more
> > clever and workload aware and that is inevitable going to depend on the
> > specific workload. I really do think we cannot simply hardcode any
> > policy into the kernel for this purpose and that is why I would like to
> > see a discussion about how to do that in a more extensible way. This
> > might be harder to implement now but it I believe it will turn out
> > better longerm.
> 
> And that's where I still maintain that this isn't really a policy
> change. Because what this code does ISN'T more clever, and the OOM
> killer STILL IS a last-resort thing.

The thing I wanted to point out is that what and how much to kill
definitely depends on the usecase. We currently kill all tasks which
share the mm struct because that is the smallest unit that can unpin
user memory. And that makes a lot of sense to me as a general default.
I would call any attempt to guess tasks belonging to the same
workload/job as a "more clever".

> We don't need any elaborate
> just-in-time evaluation of what each entity is worth. We just want to
> kill the biggest job, not the biggest MM. Just like you wouldn't want
> just the biggest VMA unmapped and freed, since it leaves your process
> incoherent, killing one process leaves a job incoherent.
> 
> I understand that making it fully configurable is a tempting thought,
> because you'd offload all responsibility to userspace.

It is not only tempting it is also the only place which can define
a more advanced OOM semantic sanely IMHO.

> But on the
> other hand, this was brought up years ago and nothing has happened
> since. And to me this is evidence that nobody really cares all that
> much. Because it's still a rather rare event, and there isn't much you
> cannot accomplish with periodic score adjustments.

Yes and there were no attempts since then which suggests that people
didn't care all that much. Maybe things have changed now that containers
got much more popular.

> > > > And both kinds of workloads (services/applications and individual
> > > > processes run by users) can co-exist on the same host - consider the
> > > > default systemd setup, for instance.
> > > > 
> > > > IMHO it would be better to give users a choice regarding what they
> > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > cgroup or one of its descendants. For example, we could introduce a
> > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > 
> > > The last thing we want to do, is to compare processes with cgroups.
> > > I agree, that we can have some option to disable the cgroup-aware OOM at 
> > > all,
> > > mostly for backward-compatibility. But I don't think it should be a
> > > per-cgroup configuration option, which we will support forever.
> > 
> > I can clearly see a demand for "this is definitely more important
> > container than others so do not kill" usecases. I can also see demand
> > for "do not kill this container running for X days". And more are likely
> > to pop out.
> 
> That can all be done with scoring.

Maybe. But that requires somebody to tweak the scoring which can be hard
from trivial.
 
> In fact, we HAD the oom killer consider a target's cputime/runtime
> before, and David replaced it all with simple scoring in a63d83f427fb
> ("oom: badness heuristic rewrite").

Yes, that is correct and I agree that this was definitely step in the
right direction because time based heuristics tend to behave very
unpredictably in general workloads.

> This was 10 years ago, and nobody has missed anything critical enough
> to implement something beyond scoring. So I don't see

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-23 Thread Michal Hocko
On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > Hello Roman,
> 
> Hi Vladimir!
> 
> > 
> > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > ...
> > > +5-2-4. Cgroup-aware OOM Killer
> > > +
> > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > +It means that it treats memory cgroups as memory consumers
> > > +rather then individual processes. Under the OOM conditions it tries
> > > +to find an elegible leaf memory cgroup, and kill all processes
> > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > +to the root cgroup), it falls back to the traditional per-process
> > > +behaviour.
> > 
> > I agree that the current OOM victim selection algorithm is totally
> > unfair in a system using containers and it has been crying for rework
> > for the last few years now, so it's great to see this finally coming.
> > 
> > However, I don't reckon that killing a whole leaf cgroup is always the
> > best practice. It does make sense when cgroups are used for
> > containerizing services or applications, because a service is unlikely
> > to remain operational after one of its processes is gone, but one can
> > also use cgroups to containerize processes started by a user. Kicking a
> > user out for one of her process has gone mad doesn't sound right to me.
> 
> I agree, that it's not always a best practise, if you're not allowed
> to change the cgroup configuration (e.g. create new cgroups).
> IMHO, this case is mostly covered by using the v1 cgroup interface,
> which remains unchanged.

But there are features which are v2 only and users might really want to
use it. So I really do not buy this v2-only argument.

> If you do have control over cgroups, you can put processes into
> separate cgroups, and obtain control over OOM victim selection and killing.

Usually you do not have that control because there is a global daemon
doing the placement for you.

> > Another example when the policy you're suggesting fails in my opinion is
> > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > run processes. The main service may stop working normally if one of its
> > sub-services is killed. So it might make sense to kill not just an
> > individual process or a leaf cgroup, but the whole main service with all
> > its sub-services.
> 
> I agree, although I do not pretend for solving all possible
> userspace problems caused by an OOM.
> 
> How to react on an OOM - is definitely a policy, which depends
> on the workload. Nothing is changing here from how it's working now,
> except now kernel will choose a victim cgroup, and kill the victim cgroup
> rather than a process.

There is a _big_ difference. The current implementation just tries
to recover from the OOM situation without carying much about the
consequences on the workload. This is the last resort and a services for
the _system_ to get back to sane state. You are trying to make it more
clever and workload aware and that is inevitable going to depend on the
specific workload. I really do think we cannot simply hardcode any
policy into the kernel for this purpose and that is why I would like to
see a discussion about how to do that in a more extensible way. This
might be harder to implement now but it I believe it will turn out
better longerm.

> > And both kinds of workloads (services/applications and individual
> > processes run by users) can co-exist on the same host - consider the
> > default systemd setup, for instance.
> > 
> > IMHO it would be better to give users a choice regarding what they
> > really want for a particular cgroup in case of OOM - killing the whole
> > cgroup or one of its descendants. For example, we could introduce a
> > per-cgroup flag that would tell the kernel whether the cgroup can
> > tolerate killing a descendant or not. If it can, the kernel will pick
> > the fattest sub-cgroup or process and check it. If it cannot, it will
> > kill the whole cgroup and all its processes and sub-cgroups.
> 
> The last thing we want to do, is to compare processes with cgroups.
> I agree, that we can have some option to disable the cgroup-aware OOM at all,
> mostly for backward-compatibility. But I don't think it should be a
> per-cgroup configuration option, which we will support forever.

I can clearly see a demand for "this is definitely more important
container than others so do not kill" usecases. I can also see demand
for "do not kill this container running for X days". And more are likely
to pop out.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: per-cgroup memory reclaim stats

2017-05-19 Thread Michal Hocko
On Thu 11-05-17 20:16:23, Roman Gushchin wrote:
> Track the following reclaim counters for every memory cgroup:
> PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and
> PGLAZYFREED.
> 
> These values are exposed using the memory.stats interface of cgroup v2.
> 
> The meaning of each value is the same as for global counters,
> available using /proc/vmstat.
> 
> Also, for consistency, rename mem_cgroup_count_vm_event() to
> count_memcg_event_mm().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Suggested-by: Johannes Weiner <han...@cmpxchg.org>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Li Zefan <lize...@huawei.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: Michal Hocko <mho...@suse.com>

Thanks!

> ---
>  Documentation/cgroup-v2.txt | 28 ++
>  fs/dax.c|  2 +-
>  fs/ncpfs/mmap.c |  2 +-
>  include/linux/memcontrol.h  | 48 
> ++---
>  mm/filemap.c|  2 +-
>  mm/memcontrol.c | 10 ++
>  mm/memory.c |  4 ++--
>  mm/shmem.c  |  3 +--
>  mm/swap.c   |  1 +
>  mm/vmscan.c | 30 +---
>  10 files changed, 113 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index e50b95c..5804355 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -918,6 +918,34 @@ PAGE_SIZE multiple when read back.
>  
>   Number of major page faults incurred
>  
> +   pgrefill
> +
> + Amount of scanned pages (in an active LRU list)
> +
> +   pgscan
> +
> + Amount of scanned pages (in an inactive LRU list)
> +
> +   pgsteal
> +
> + Amount of reclaimed pages
> +
> +   pgactivate
> +
> + Amount of pages moved to the active LRU list
> +
> +   pgdeactivate
> +
> + Amount of pages moved to the inactive LRU lis
> +
> +   pglazyfree
> +
> + Amount of pages postponed to be freed under memory pressure
> +
> +   pglazyfreed
> +
> + Amount of reclaimed lazyfree pages
> +
>memory.swap.current
>  
>   A read-only single value file which exists on non-root
> diff --git a/fs/dax.c b/fs/dax.c
> index 66d7906..9aac521d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1230,7 +1230,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>   case IOMAP_MAPPED:
>   if (iomap.flags & IOMAP_F_NEW) {
>   count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>   major = VM_FAULT_MAJOR;
>   }
>   error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
> diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
> index 0c3905e..6719c0b 100644
> --- a/fs/ncpfs/mmap.c
> +++ b/fs/ncpfs/mmap.c
> @@ -89,7 +89,7 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf)
>* -- nyc
>*/
>   count_vm_event(PGMAJFAULT);
> - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>   return VM_FAULT_MAJOR;
>  }
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 899949b..b2a5b1c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -357,6 +357,17 @@ static inline unsigned short mem_cgroup_id(struct 
> mem_cgroup *memcg)
>  }
>  struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
>  
> +static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec)
> +{
> + struct mem_cgroup_per_node *mz;
> +
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> + return mz->memcg;
> +}
> +
>  /**
>   * parent_mem_cgroup - find the accounting parent of a memcg
>   * @memcg: memcg whose parent to find
> @@ -546,8 +557,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
> *pgdat, int order,
>   gfp_t gfp_mask,
>   unsigned long *total_scan

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-19 Thread Michal Hocko
On Thu 18-05-17 14:11:17, Johannes Weiner wrote:
> On Thu, May 18, 2017 at 07:30:04PM +0200, Michal Hocko wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > Traditionally, the OOM killer is operating on a process level.
> > > Under oom conditions, it finds a process with the highest oom score
> > > and kills it.
> > > 
> > > This behavior doesn't suit well the system with many running
> > > containers. There are two main issues:
> > > 
> > > 1) There is no fairness between containers. A small container with
> > > a few large processes will be chosen over a large one with huge
> > > number of small processes.
> > > 
> > > 2) Containers often do not expect that some random process inside
> > > will be killed. So, in general, a much safer behavior is
> > > to kill the whole cgroup. Traditionally, this was implemented
> > > in userspace, but doing it in the kernel has some advantages,
> > > especially in a case of a system-wide OOM.
> > > 
> > > To address these issues, cgroup-aware OOM killer is introduced.
> > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > and kills all processes inside.
> > > 
> > > Memcg oom score is calculated as a size of active and inactive
> > > anon LRU lists, unevictable LRU list and swap size.
> > > 
> > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > the OOMing cgroup are considered.
> > 
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> OOM policy is an orthogonal discussion, though.
> 
> The OOM killer's job is to pick a memory consumer to kill. Per default
> the unit of the memory consumer is a process, but cgroups allow
> grouping processes into compound consumers. Extending the OOM killer
> to respect the new definition of "consumer" is not a new policy.

I do not want to play word games here but picking a task or more tasks
is a policy from my POV but that is not all that important. My primary
point is that this new "implementation" is most probably not what people
who use memory cgroups outside of containers want. Why? Mostly because
they do not care that only a part of the memcg is still alive pretty
much like the current global OOM behavior when a single task (or its
children) are gone all of the sudden. Why should I kill the whole user
slice just because one of its processes went wild?
 
> I don't think it's reasonable to ask the person who's trying to make
> the OOM killer support group-consumers to design a dynamic OOM policy
> framework instead.
> 
> All we want is the OOM policy, whatever it is, applied to cgroups.

And I am not dismissing this usecase. I believe it is valid but not
universally applicable when memory cgroups are deployed. That is why
I think that we need a way to define those policies in some sane way.
Our current oom policies are basically random -
/proc/sys/vm/oom_kill_allocating_task resp. /proc/sys/vm/panic_on_oom.

I am not really sure we want another hardcoded one e.g.
/proc/sys/vm/oom_kill_container because even that might turn out not the
great fit for different container usecases. Do we want to kill the
largest container or the one with the largest memory hog? Should some
containers have a higher priority over others? I am pretty sure more
criterion would pop up with more usecases.

That's why I think that the current OOM killer implementation should
stay as a last resort and be process oriented and we should think about
a way to override it for particular usecases. The exact mechanism is not
completely clear to me to be honest.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-18 Thread Michal Hocko
On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> a few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. So, in general, a much safer behavior is
> to kill the whole cgroup. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, cgroup-aware OOM killer is introduced.
> Under OOM conditions, it looks for a memcg with highest oom score,
> and kills all processes inside.
> 
> Memcg oom score is calculated as a size of active and inactive
> anon LRU lists, unevictable LRU list and swap size.
> 
> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> the OOMing cgroup are considered.

While this might make sense for some workloads/setups it is not a
generally acceptable policy IMHO. We have discussed that different OOM
policies might be interesting few years back at LSFMM but there was no
real consensus on how to do that. One possibility was to allow bpf like
mechanisms. Could you explore that path?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm, docs: update memory.stat description with workingset* entries

2017-05-16 Thread Michal Hocko
On Thu 11-05-17 20:18:13, Roman Gushchin wrote:
> Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in
> cache workingset transition") introduced three new entries in memory
> stat file:
>  - workingset_refault,
>  - workingset_activate,
>  - workingset_nodereclaim.
> 
> This commit adds a corresponding description to the cgroup v2 docs.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Li Zefan <lize...@huawei.com>
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org

Missed this one. Sorry about that
Acked-by: Michal Hocko <mho...@suse.com>

> ---
>  Documentation/cgroup-v2.txt | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index e50b95c..dc5e2dc 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -918,6 +918,18 @@ PAGE_SIZE multiple when read back.
>  
>   Number of major page faults incurred
>  
> +   workingset_refault
> +
> + Number of refaults of previously evicted pages
> +
> +   workingset_activate
> +
> + Number of refaulted pages that were immediately activated
> +
> +   workingset_nodereclaim
> +
> +     Number of times a shadow node has been reclaimed
> +
>memory.swap.current
>  
>   A read-only single value file which exists on non-root
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-09-30 Thread Michal Hocko
[CC Mike and Mel as they have seen some accounting oddities
 when doing performance testing. They can share details but
 essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.gl23...@dhcp22.suse.cz

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.45
System time (seconds): 0.82
Percent of CPU this job got: 111%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.47
System time (seconds): 1.55
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
 2.01%  as[kernel.vmlinux] [k] page_fault
 0.59%  cc1   [kernel.vmlinux] [k] page_fault
 0.15%  git   [kernel.vmlinux] [k] page_fault
 0.12%  bash  [kernel.vmlinux] [k] page_fault
 0.11%  sh[kernel.vmlinux] [k] page_fault
 0.08%  gcc   [kernel.vmlinux] [k] page_fault
 0.06%  make  [kernel.vmlinux] [k] page_fault
 0.04%  rm[kernel.vmlinux] [k] page_fault
 0.03%  ld[kernel.vmlinux] [k] page_fault
 0.02%  bash  [kernel.vmlinux] [k] entry_SYSCALL_64
 0.01%  git   [kernel.vmlinux] [k] entry_SYSCALL_64
 0.01%  cat   [kernel.vmlinux] [k] page_fault
 0.01%  collect2  [kernel.vmlinux] [k] page_fault
 0.00%  sh[kernel.vmlinux] [k] entry_SYSCALL_64
 0.00%  rm[kernel.vmlinux] [k] entry_SYSCALL_64
 0.00%  grep  [kernel.vmlinux] [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> > 
> > And just to give Rik more context. While debugging overhead of the
> > /proc//smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> > 
> > root@test1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@test1:~# ./max_map 
> > pid:2990 maps:65515
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> > User time (seconds): 0.00
> > System time (seconds): 0.45
> > Percent of CPU this job got: 98%
> > 
> 
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> > User time (seconds): 0.18
> > System time (seconds): 0.29
> > Percent of CPU this job got: 97%
> 
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
> 
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
> 
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
> 
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
> 
> Does the workload consistently get zero user
> time?
> 
> If so, we need to dig further to see under
> what precise circumstances that happens.
> 
> On my laptop, with kernel 4.6.3-300.fc24.x86_6

Re: [PATCH v5 0/3] mm, proc: Implement /proc//totmaps

2016-09-19 Thread Michal Hocko
On Mon 19-09-16 11:16:31, Robert Foss wrote:
> On 2016-09-14 05:12 AM, Michal Hocko wrote:
> > On Tue 13-09-16 13:27:39, Sonny Rao wrote:
[...]
> > > Given that smaps
> > > doesn't provide this in a straightforward way, what do you think is
> > > the right way to provide this information?
> > 
> > I would be tempted to sneak it into /proc//statm because that looks
> > like a proper place but getting this information is not for free
> > performance wise so I am not really sure something that relies on this
> > file would see unexpected stalls. Maybe this could be worked around by
> > some caching... I would suggest to check who is actually using this file
> > (top/ps etc...)
> 
> What would this caching look like? Can any information be re-used between
> vma walks?

yes basically return the same value if called within HZ or something
similar. But that assumes that statm latency really matters and it is
called often enough.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/3] mm, proc: Implement /proc//totmaps

2016-09-14 Thread Michal Hocko
On Tue 13-09-16 13:27:39, Sonny Rao wrote:
> On Tue, Sep 13, 2016 at 12:12 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Mon 12-09-16 10:28:53, Sonny Rao wrote:
> >> On Mon, Sep 12, 2016 at 10:15 AM, Michal Hocko <mho...@kernel.org> wrote:
> >> > On Mon 12-09-16 08:31:36, Sonny Rao wrote:
> > [...]
> >> >> but how about the other fields like Swap, Private_Dirty and
> >> >> Private_Shared?
> >> >
> >> > Private_Shared can be pretty confusing as well without the whole context
> >> > as well see my other emails in the original thread (just to remind
> >> > shmem/tmpfs makes all this really confusing).
> >>
> >> But this is exactly the issue -- RSS is can be just as confusing if
> >> you don't know something about the application.
> >
> > I agree that rss can be confusing but we will not make the situation any
> > better if we add yet another confusing metric.
> >
> >> I think the issue is
> >> how common that situation is, and you seem to believe that it's so
> >> uncommon that it's actually better to keep the information more
> >> difficult to get for those of us who know something about our systems.
> >>
> >> That's fine, I guess we just have to disagree here, thanks for look at 
> >> this.
> >
> > I think you should just step back and think more about what exactly
> > you expect from the counter(s). I believe what you want is an
> > estimate of a freeable memory when the particular process dies or is
> > killed. That would mean resident single mapped private anonymous memory
> > + unlinked single mapped shareable mappings + single mapped swapped out
> > memory. Maybe I've missed something but it should be something along
> > those lines. Definitely something that the current smaps infrastructure
> > doesn't give you, though.
> 
> Yes your description of what we want is pretty good.  Having a
> reasonable lower bound on the estimate is fine, though we probably
> want to break out swapped out memory separately.

Why would you want to separate that?

> Given that smaps
> doesn't provide this in a straightforward way, what do you think is
> the right way to provide this information?

I would be tempted to sneak it into /proc//statm because that looks
like a proper place but getting this information is not for free
performance wise so I am not really sure something that relies on this
file would see unexpected stalls. Maybe this could be worked around by
some caching... I would suggest to check who is actually using this file
(top/ps etc...)

If this would be unacceptable then a new file could be considered.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/3] mm, proc: Implement /proc//totmaps

2016-09-13 Thread Michal Hocko
On Mon 12-09-16 10:28:53, Sonny Rao wrote:
> On Mon, Sep 12, 2016 at 10:15 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Mon 12-09-16 08:31:36, Sonny Rao wrote:
[...]
> >> but how about the other fields like Swap, Private_Dirty and
> >> Private_Shared?
> >
> > Private_Shared can be pretty confusing as well without the whole context
> > as well see my other emails in the original thread (just to remind
> > shmem/tmpfs makes all this really confusing).
> 
> But this is exactly the issue -- RSS is can be just as confusing if
> you don't know something about the application.

I agree that rss can be confusing but we will not make the situation any
better if we add yet another confusing metric.

> I think the issue is
> how common that situation is, and you seem to believe that it's so
> uncommon that it's actually better to keep the information more
> difficult to get for those of us who know something about our systems.
> 
> That's fine, I guess we just have to disagree here, thanks for look at this.

I think you should just step back and think more about what exactly
you expect from the counter(s). I believe what you want is an
estimate of a freeable memory when the particular process dies or is
killed. That would mean resident single mapped private anonymous memory
+ unlinked single mapped shareable mappings + single mapped swapped out
memory. Maybe I've missed something but it should be something along
those lines. Definitely something that the current smaps infrastructure
doesn't give you, though.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/3] mm, proc: Implement /proc//totmaps

2016-09-12 Thread Michal Hocko
On Mon 12-09-16 08:31:36, Sonny Rao wrote:
> On Mon, Sep 12, 2016 at 5:02 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Mon 05-09-16 16:14:06, robert.f...@collabora.com wrote:
> >> From: Robert Foss <robert.f...@collabora.com>
> >>
> >> This series provides the /proc/PID/totmaps feature, which
> >> summarizes the information provided by /proc/PID/smaps for
> >> improved performance and usability reasons.
> >>
> >> A use case is to speed up monitoring of memory consumption in
> >> environments where RSS isn't precise.
> >>
> >> For example Chrome tends to many processes which have hundreds of VMAs
> >> with a substantial amount of shared memory, and the error of using
> >> RSS rather than PSS tends to be very large when looking at overall
> >> memory consumption.  PSS isn't kept as a single number that's exported
> >> like RSS, so to calculate PSS means having to parse a very large smaps
> >> file.
> >>
> >> This process is slow and has to be repeated for many processes, and we
> >> found that the just act of doing the parsing was taking up a
> >> significant amount of CPU time, so this patch is an attempt to make
> >> that process cheaper.
> >
> > I still maintain my concerns about a single pss value. It might work in
> > a very specific situations where the consumer knows what is shared but
> > other than that the value can be more misleading than helpful. So a NACK
> > from me until I am shown that this is usable in general and still
> > helpful.
> 
> I know you think Pss isn't useful in general (though I'll point out
> two other independent people said they found it useful)

sure, and one of them admitted that the value is useful because they
_know_ the resource. The other was quite vague for me to understand
all the details. Please, try to understand that once you provide a user
API then it will carved in stone. If the interface is poor and ambigous
it will bite us later. One very specific usecase doesn't justify
something that might be really misleading for 90% of cases.

> but how about the other fields like Swap, Private_Dirty and
> Private_Shared?

Private_Shared can be pretty confusing as well without the whole context
as well see my other emails in the original thread (just to remind
shmem/tmpfs makes all this really confusing).

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-30 Thread Michal Hocko
On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote:
[...]
> Sorry to hijack the thread, but I've found it recently
> and I guess it's the best place to present our point.
> We are working at our custom OS based on Linux and we also suffered much
> by /proc//smaps file. As in Chrome we tried to improve our internal
> application memory management polices (Low Memory Killer) using data
> provided by smaps but we failed due to very long time needed for reading
> and parsing properly the file.

I was already questioning Pss and also Private_* for any memory killer
purpose earlier in the thread because cumulative numbers for all
mappings can be really meaningless. Especially when you do not know
about which resource is shared and by whom. Maybe you can describe how
you are using those cumulative numbers for your decisions and prove me
wrong but I simply haven't heard any sound arguments so far. Everything
was just "we know what we are doing in our environment so we know those
resouces and therefore those numbers make sense to us". But with all due
respect this is not a reason to add a user visible API into the kernel.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-29 Thread Michal Hocko
[Sorry for a late reply, I was busy with other stuff]

On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mho...@kernel.org> wrote:
[...]
> But what about the private_clean and private_dirty?  Surely
> those are more generally useful for calculating a lower bound on
> process memory usage without additional knowledge?

I guess private_clean can be used as a reasonable estimate.
private_dirty less so because it may refer to e.g. tmpfs which is not
mapped by other process and so no memory would be freed after unmap
without removing the file.

> At the end of the day all of these metrics are approximations, and it
> comes down to how far off the various approximations are and what
> trade offs we are willing to make.
> RSS is the cheapest but the most coarse.

I agree on this part definitely. I also understand that what we provide
currently is quite confusing and not really helpful. But I am afraid
that the accounting is far from trivial to make right for all the
possible cases.

> PSS (with the correct context) and Private data plus swap are much
> better but also more expensive due to the PT walk.

Maybe we can be more clever and do some form of caching. I haven't
thought that through to see how hard that could be. I mean we could
cache some data per mm_struct and invalidate them only after the current
value would get too much out of sync.

> As far as I know, to get anything but RSS we have to go through smaps
> or use memcg.  Swap seems to be available in /proc//status.
> 
> I looked at the "shared" value in /proc//statm but it doesn't
> seem to correlate well with the shared value in smaps -- not sure why?

task_statm() does only approximate to get_mm_counter(mm, MM_FILEPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES) so all the pages accounted to the mm.
If they are not shared by anybody else they would be considered private
by smaps.

> It might be useful to show the magnitude of difference of using RSS vs
> PSS/Private in the case of the Chrome renderer processes.  On the
> system I was looking at there were about 40 of these processes, but I
> picked a few to give an idea:
> 
> localhost ~ # cat /proc/21550/totmaps
> Rss:   98972 kB
> Pss:   54717 kB
> Shared_Clean:  19020 kB
> Shared_Dirty:  26352 kB
> Private_Clean: 0 kB
> Private_Dirty: 53600 kB
> Referenced:92184 kB
> Anonymous: 46524 kB
> AnonHugePages: 24576 kB
> Swap:  13148 kB
> 
> 
> RSS is 80% higher than PSS and 84% higher than private data
> 
> localhost ~ # cat /proc/21470/totmaps
> Rss:  118420 kB
> Pss:   70938 kB
> Shared_Clean:  22212 kB
> Shared_Dirty:  26520 kB
> Private_Clean: 0 kB
> Private_Dirty: 69688 kB
> Referenced:   111500 kB
> Anonymous: 79928 kB
> AnonHugePages: 24576 kB
> Swap:  12964 kB
> 
> RSS is 66% higher than RSS and 69% higher than private data
> 
> localhost ~ # cat /proc/21435/totmaps
> Rss:   97156 kB
> Pss:   50044 kB
> Shared_Clean:  21920 kB
> Shared_Dirty:  26400 kB
> Private_Clean: 0 kB
> Private_Dirty: 48836 kB
> Referenced:90012 kB
> Anonymous: 75228 kB
> AnonHugePages: 24576 kB
> Swap:  13064 kB
> 
> RSS is 94% higher than PSS and 98% higher than private data.
> 
> It looks like there's a set of about 40MB of shared pages which cause
> the difference in this case.
> Swap was roughly even on these but I don't think it's always going to be true.

OK, I see that those processes differ in the way how they are using
memory but I am not really sure what kind of conclusion you can draw
from that.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps)

2016-08-23 Thread Michal Hocko
On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > [...]
> > > > I have no idea why those numbers are so different on my laptop
> > > > yet. It surely looks suspicious. I will try to debug this further
> > > > tomorrow.
> > > 
> > > Hmm, so I've tried to use my version of awk on other machine and vice
> > > versa and it didn't make any difference. So this is independent on the
> > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> > > ru_stime={0, 688438}, ...}) = 9128
> > > 
> > > so the kernel indeed reports 0 user time for some reason. Note I
> > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > > modifications). The other machine which reports non-0 utime is 3.12
> > > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > > my and the other machine but then I've noticed that the tests I was
> > > doing in kvm have this disabled too.. so it must be something else.
> > 
> > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> > in both kernels.
> 
> and one more thing. It is not like utime accounting would be completely
> broken and always report 0. Other commands report non-0 values even on
> 4.6 kernels. I will try to bisect this down later today.

OK, so it seems I found it. I was quite lucky because account_user_time
is not all that popular function and there were basically no changes
besides Riks ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN
to jiffy granularity") and that seems to cause the regression. Reverting
the commit on top of the current mmotm seems to fix the issue for me.

And just to give Rik more context. While debugging overhead of the
/proc//smaps I am getting a misleading output from /usr/bin/time -v
(source for ./max_mmap is [1])

root@test1:~# uname -r
4.5.0-rc6-bisect1-00025-gff9a9b4c4334
root@test1:~# ./max_map 
pid:2990 maps:65515
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
rss:263368 pss:262203
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
User time (seconds): 0.00
System time (seconds): 0.45
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1796
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 83
Voluntary context switches: 6
Involuntary context switches: 6
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

See the User time being 0 (as you can see above in the quoted text it
is not a rounding error in userspace or something similar because wait4
really returns 0). Now with the revert
root@test1:~# uname -r
4.5.0-rc6-revert-00026-g7fc86f968bf5
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
rss:263316 pss:262199
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
User time (seconds): 0.18
System time (seconds): 0.29
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1760
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 79
Voluntary context switches: 5
Involuntary context switches: 7
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

So it looks like the whole user time is accounted as the system time.
My config is attached and yes I do have CONFIG_VIRT_CPU_ACCOUNTING_GEN
enabled. Could you have a look please?

[1] http://lkml.kernel.org/r/20160817082200.ga10...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs


.config.gz
Description: application/gzip


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-23 Thread Michal Hocko
On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > [...]
> > > I have no idea why those numbers are so different on my laptop
> > > yet. It surely looks suspicious. I will try to debug this further
> > > tomorrow.
> > 
> > Hmm, so I've tried to use my version of awk on other machine and vice
> > versa and it didn't make any difference. So this is independent on the
> > awk version it seems. So I've tried to strace /usr/bin/time and
> > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> > ru_stime={0, 688438}, ...}) = 9128
> > 
> > so the kernel indeed reports 0 user time for some reason. Note I
> > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > modifications). The other machine which reports non-0 utime is 3.12
> > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > my and the other machine but then I've noticed that the tests I was
> > doing in kvm have this disabled too.. so it must be something else.
> 
> 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> in both kernels.

and one more thing. It is not like utime accounting would be completely
broken and always report 0. Other commands report non-0 values even on
4.6 kernels. I will try to bisect this down later today.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> [...]
> > I have no idea why those numbers are so different on my laptop
> > yet. It surely looks suspicious. I will try to debug this further
> > tomorrow.
> 
> Hmm, so I've tried to use my version of awk on other machine and vice
> versa and it didn't make any difference. So this is independent on the
> awk version it seems. So I've tried to strace /usr/bin/time and
> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, 
> ru_stime={0, 688438}, ...}) = 9128
> 
> so the kernel indeed reports 0 user time for some reason. Note I
> was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> modifications). The other machine which reports non-0 utime is 3.12
> SLES kernel. Maybe I am hitting some accounting bug. At first I was
> suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> my and the other machine but then I've noticed that the tests I was
> doing in kvm have this disabled too.. so it must be something else.

4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
in both kernels.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Fri 19-08-16 10:57:48, Sonny Rao wrote:
> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mho...@kernel.org> wrote:
> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mho...@kernel.org> 
> >> >> wrote:
> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> >> > [...]
> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> >> information and we'd like to be able to get this information to make
> >> >> >> the decision much faster than 400ms
> >> >> >
> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> >> > understand you want something better than SIGKILL and in fact this is
> >> >> > already possible with memory cgroup controller (btw. memcg will give
> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> >> > well). Anyway if you are getting close to the OOM your system will 
> >> >> > most
> >> >> > probably be really busy and chances are that also reading your new 
> >> >> > file
> >> >> > will take much more time. I am also not quite sure how is pss useful 
> >> >> > for
> >> >> > oom decisions.
> >> >>
> >> >> I mentioned it before, but based on experience RSS just isn't good
> >> >> enough -- there's too much sharing going on in our use case to make
> >> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> >> put, this patch wouldn't exist.
> >> >
> >> > But that doesn't answer my question, I am afraid. So how exactly do you
> >> > use pss for oom decisions?
> >>
> >> We use PSS to calculate the memory used by a process among all the
> >> processes in the system, in the case of Chrome this tells us how much
> >> each renderer process (which is roughly tied to a particular "tab" in
> >> Chrome) is using and how much it has swapped out, so we know what the
> >> worst offenders are -- I'm not sure what's unclear about that?
> >
> > So let me ask more specifically. How can you make any decision based on
> > the pss when you do not know _what_ is the shared resource. In other
> > words if you select a task to terminate based on the pss then you have to
> > kill others who share the same resource otherwise you do not release
> > that shared resource. Not to mention that such a shared resource might
> > be on tmpfs/shmem and it won't get released even after all processes
> > which map it are gone.
> 
> Ok I see why you're confused now, sorry.
> 
> In our case that we do know what is being shared in general because
> the sharing is mostly between those processes that we're looking at
> and not other random processes or tmpfs, so PSS gives us useful data
> in the context of these processes which are sharing the data
> especially for monitoring between the set of these renderer processes.

OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.

> We also use the private clean and private dirty and swap fields to
> make a few metrics for the processes and charge each process for it's
> private, shared, and swap data. Private clean and dirty are used for
> estimating a lower bound on how much memory would be freed.

I can imagine that this kind of information might be useful and
presented in /proc//statm. The question is whether some of the
existing consumers would see the performance impact due to he page table
walk. Anyway even these counters might get quite tricky because even
shareable resources are considered private if the process is the only
one to map them (so again this might be a file on tmpfs...).

> Swap and
> PSS also give us some indication of additional memory which might get
> freed up.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-22 Thread Michal Hocko
On Mon 22-08-16 09:07:45, Minchan Kim wrote:
[...]
> #!/bin/sh
> ./smap_test &
> pid=$!
> 
> for i in $(seq 25)
> do
> awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
>  /proc/$pid/smaps
> done
> kill $pid
> 
> root@bbox:/home/barrios/test/smap# time ./s.sh 
> pid:21973
> 
> real0m17.812s
> user0m12.612s
> sys 0m5.187s

retested on the bare metal (x86_64 - 2CPUs)
Command being timed: "sh s.sh"
User time (seconds): 0.00
System time (seconds): 18.08
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29

multiple runs are quite consistent in those numbers. I am running with
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

> > like a problem we are not able to address. And I would even argue that
> > we want to address it in a generic way as much as possible.
> 
> Sure. What solution do you think as generic way?

either optimize seq_printf or replace it with something faster.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Michal Hocko
On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> Hi Michal,
> 
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mho...@kernel.org> wrote:
> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > >> than let the kernel's OOM killer activate and need to gather this
> > > >> information and we'd like to be able to get this information to make
> > > >> the decision much faster than 400ms
> > > >
> > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > understand you want something better than SIGKILL and in fact this is
> > > > already possible with memory cgroup controller (btw. memcg will give
> > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > well). Anyway if you are getting close to the OOM your system will most
> > > > probably be really busy and chances are that also reading your new file
> > > > will take much more time. I am also not quite sure how is pss useful for
> > > > oom decisions.
> > > 
> > > I mentioned it before, but based on experience RSS just isn't good
> > > enough -- there's too much sharing going on in our use case to make
> > > the correct decision based on RSS.  If RSS were good enough, simply
> > > put, this patch wouldn't exist.
> > 
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
> 
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
> 
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
> 
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.

So far it doesn't really seem that the parsing is the biggest problem.
The major cycles killer is the output formatting and that doesn't sound
like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.

> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.

That might confuse existing parsers. Besides that we already have
/proc//statm which gives cumulative numbers already. I am not sure
how often it is used and whether the pte walk is too expensive for
existing users but that should be explored and evaluated before a new
file is created.

The /proc became a dump of everything people found interesting just
because we were to easy to allow those additions. Do not repeat those
mistakes, please!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-19 Thread Michal Hocko
On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mho...@kernel.org> wrote:
> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> information and we'd like to be able to get this information to make
> >> >> the decision much faster than 400ms
> >> >
> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> > understand you want something better than SIGKILL and in fact this is
> >> > already possible with memory cgroup controller (btw. memcg will give
> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> > well). Anyway if you are getting close to the OOM your system will most
> >> > probably be really busy and chances are that also reading your new file
> >> > will take much more time. I am also not quite sure how is pss useful for
> >> > oom decisions.
> >>
> >> I mentioned it before, but based on experience RSS just isn't good
> >> enough -- there's too much sharing going on in our use case to make
> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> put, this patch wouldn't exist.
> >
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> We use PSS to calculate the memory used by a process among all the
> processes in the system, in the case of Chrome this tells us how much
> each renderer process (which is roughly tied to a particular "tab" in
> Chrome) is using and how much it has swapped out, so we know what the
> worst offenders are -- I'm not sure what's unclear about that?

So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.

I am sorry for being dense but it is still not clear to me how the
single pss number can be used for oom or, in general, any serious
decisions. The counter might be useful of course for debugging purposes
or to have a general overview but then arguing about 40 vs 20ms sounds a
bit strange to me.

> Chrome tends to use a lot of shared memory so we found PSS to be
> better than RSS, and I can give you examples of the  RSS and PSS on
> real systems to illustrate the magnitude of the difference between
> those two numbers if that would be useful.
> 
> >
> >> So even with memcg I think we'd have the same problem?
> >
> > memcg will give you instant anon, shared counters for all processes in
> > the memcg.
> >
> 
> We want to be able to get per-process granularity quickly.  I'm not
> sure if memcg provides that exactly?

I will give you that information if you do process-per-memcg but that
doesn't sound ideal. I thought those 20-something processes you were
talking about are treated together but it seems I misunderstood.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Michal Hocko
On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
[...]
> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> than let the kernel's OOM killer activate and need to gather this
> >> information and we'd like to be able to get this information to make
> >> the decision much faster than 400ms
> >
> > Global OOM handling in userspace is really dubious if you ask me. I
> > understand you want something better than SIGKILL and in fact this is
> > already possible with memory cgroup controller (btw. memcg will give
> > you a cheap access to rss, amount of shared, swapped out memory as
> > well). Anyway if you are getting close to the OOM your system will most
> > probably be really busy and chances are that also reading your new file
> > will take much more time. I am also not quite sure how is pss useful for
> > oom decisions.
> 
> I mentioned it before, but based on experience RSS just isn't good
> enough -- there's too much sharing going on in our use case to make
> the correct decision based on RSS.  If RSS were good enough, simply
> put, this patch wouldn't exist.

But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?

> So even with memcg I think we'd have the same problem?

memcg will give you instant anon, shared counters for all processes in
the memcg.

> > Don't take me wrong, /proc//totmaps might be suitable for your
> > specific usecase but so far I haven't heard any sound argument for it to
> > be generally usable. It is true that smaps is unnecessarily costly but
> > at least I can see some room for improvements. A simple patch I've
> > posted cut the formatting overhead by 7%. Maybe we can do more.
> 
> It seems like a general problem that if you want these values the
> existing kernel interface can be very expensive, so it would be
> generally usable by any application which wants a per process PSS,
> private data, dirty data or swap value.

yes this is really unfortunate. And if at all possible we should address
that. Precise values require the expensive rmap walk. We can introduce
some caching to help that. But so far it seems the biggest overhead is
to simply format the output and that should be addressed before any new
proc file is added.

> I mentioned two use cases, but I guess I don't understand the comment
> about why it's not usable by other use cases.

I might be wrong here but a use of pss is quite limited and I do not
remember anybody asking for large optimizations in that area. I still do
not understand your use cases properly so I am quite skeptical about a
general usefulness of a new file.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-18 Thread Michal Hocko
On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
[...]
> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> >> time spent on evaluating format strings. The new interface
> >> wouldn't have to spend that much time on format strings because there
> >> isn't so much text to format.
> >
> > well, this is true of course but I would much rather try to reduce the
> > overhead of smaps file than add a new file. The following should help
> > already. I've measured ~7% systime cut down. I guess there is still some
> > room for improvements but I have to say I'm far from being convinced about
> > a new proc file just because we suck at dumping information to the
> > userspace.
> > If this was something like /proc//stat which is
> > essentially read all the time then it would be a different question but
> > is the rss, pss going to be all that often? If yes why?
> 
> If the question is why do we need to read RSS, PSS, Private_*, Swap
> and the other fields so often?
> 
> I have two use cases so far involving monitoring per-process memory
> usage, and we usually need to read stats for about 25 processes.
> 
> Here's a timing example on an fairly recent ARM system 4 core RK3288
> running at 1.8Ghz
> 
> localhost ~ # time cat /proc/25946/smaps > /dev/null
> 
> real0m0.036s
> user0m0.020s
> sys 0m0.020s
> 
> localhost ~ # time cat /proc/25946/totmaps > /dev/null
> 
> real0m0.027s
> user0m0.010s
> sys 0m0.010s
> localhost ~ #
> 
> I'll ignore the user time for now, and we see about 20 ms of system
> time with smaps and 10 ms with totmaps, with 20 similar processes it
> would be 400 milliseconds of cpu time for the kernel to get this
> information from smaps vs 200 milliseconds with totmaps.  Even totmaps
> is still pretty slow, but much better than smaps.
> 
> Use cases:
> 1) Basic task monitoring -- like "top" that shows memory consumption
> including PSS, Private, Swap
> 1 second update means about 40% of one CPU is spent in the kernel
> gathering the data with smaps

I would argue that even 20% is way too much for such a monitoring. What
is the value to do it so often tha 20 vs 40ms really matters?

> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> than let the kernel's OOM killer activate and need to gather this
> information and we'd like to be able to get this information to make
> the decision much faster than 400ms

Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.

Don't take me wrong, /proc//totmaps might be suitable for your
specific usecase but so far I haven't heard any sound argument for it to
be generally usable. It is true that smaps is unnecessarily costly but
at least I can see some room for improvements. A simple patch I've
posted cut the formatting overhead by 7%. Maybe we can do more.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Michal Hocko
On Wed 17-08-16 11:31:25, Jann Horn wrote:
> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
> > [...]
> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> > > /proc/5025/smaps }"
> > > [...]
> > >   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
> > > /proc/5025/smaps
> > > }"
> > >   User time (seconds): 0.37
> > >   System time (seconds): 0.45
> > >   Percent of CPU this job got: 92%
> > >   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
> > 
> > This is really unexpected. Where is the user time spent? Anyway, rather
> > than measuring some random processes I've tried to measure something
> > resembling the worst case. So I've created a simple program to mmap as
> > much as possible:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > int main()
> > {
> > while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
> > MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> > ;
> > 
> > printf("pid:%d\n", getpid());
> > pause();
> > return 0;
> > }
> 
> Ah, nice, that's a reasonable test program. :)
> 
> 
> > So with a reasonable user space the parsing is really not all that time
> > consuming wrt. smaps handling. That being said I am still very skeptical
> > about a dedicated proc file which accomplishes what userspace can done
> > in a trivial way.
> 
> Now, since your numbers showed that all the time is spent in the kernel,
> also create this test program to just read that file over and over again:
> 
> $ cat justreadloop.c
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> char buf[100];
> 
> int main(int argc, char **argv) {
>   printf("pid:%d\n", getpid());
>   while (1) {
> int fd = open(argv[1], O_RDONLY);
> if (fd < 0) continue;
> if (read(fd, buf, sizeof(buf)) < 0)
>   err(1, "read");
> close(fd);
>   }
> }
> $ gcc -Wall -o justreadloop justreadloop.c
> $ 
> 
> Now launch your test:
> 
> $ ./mapstuff 
> pid:29397
> 
> point justreadloop at it:
> 
> $ ./justreadloop /proc/29397/smaps
> pid:32567
> 
> ... and then check the performance stats of justreadloop:
> 
> # perf top -p 32567
> 
> This is what I see:
> 
> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
> Overhead  Shared Object Symbol
>   30,43%  [kernel]  [k] format_decode
>9,12%  [kernel]  [k] number
>7,66%  [kernel]  [k] vsnprintf
>7,06%  [kernel]  [k] __lock_acquire
>3,23%  [kernel]  [k] lock_release
>2,85%  [kernel]  [k] debug_lockdep_rcu_enabled
>2,25%  [kernel]  [k] skip_atoi
>2,13%  [kernel]  [k] lock_acquire
>2,05%  [kernel]  [k] show_smap

This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.
 
> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> time spent on evaluating format strings. The new interface
> wouldn't have to spend that much time on format strings because there
> isn't so much text to format.

well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc//stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.
---
>From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers.  Say we would like to get Rss and Pss from a particular
process.  In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP

Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-17 Thread Michal Hocko
On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]
> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> /proc/5025/smaps }"
> [...]
>   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> }"
>   User time (seconds): 0.37
>   System time (seconds): 0.45
>   Percent of CPU this job got: 92%
>   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include 
#include 
#include 
#include 
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

so depending on /proc/sys/vm/max_map_count you will get the maximum
possible mmaps. I am using a default so 65k mappings. Then I have
retried your 25x file parsing:
$ cat s.sh
#!/bin/sh

pid=$1
for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", 
rss, pss}' /proc/$pid/smaps
done

But I am getting different results from you:
$ awk '/^[0-9a-f]/{print}' /proc/14808/smaps | wc -l
65532
[...]
Command being timed: "sh s.sh 14808"
User time (seconds): 0.00
System time (seconds): 20.10
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.20

The results are stable when I try multiple times, in fact there
shouldn't be any reason for them not to be. Then I went on to increase
max_map_count to 250k and that behaves consistently:
$ awk '/^[0-9a-f]/{print}' /proc/16093/smaps | wc -l 
250002
[...]
Command being timed: "sh s.sh 16093"
User time (seconds): 0.00
System time (seconds): 77.93
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.09

So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-16 Thread Michal Hocko
On Mon 15-08-16 12:25:10, Robert Foss wrote:
> 
> 
> On 2016-08-15 09:42 AM, Michal Hocko wrote:
[...]
> > The use case is to speed up monitoring of
> > memory consumption in environments where RSS isn't precise.
> >
> > For example Chrome tends to many processes which have hundreds of VMAs
> > with a substantial amount of shared memory, and the error of using
> > RSS rather than PSS tends to be very large when looking at overall
> > memory consumption.  PSS isn't kept as a single number that's exported
> > like RSS, so to calculate PSS means having to parse a very large smaps
> > file.
> >
> > This process is slow and has to be repeated for many processes, and we
> > found that the just act of doing the parsing was taking up a
> > significant amount of CPU time, so this patch is an attempt to make
> > that process cheaper.

Well, this is slow because it requires the pte walk otherwise you cannot
know how many ptes map the particular shared page. Your patch
(totmaps_proc_show) does the very same page table walk because in fact
it is unavoidable. So what exactly is the difference except for the
userspace parsing which is quite trivial e.g. my currently running Firefox
has
$ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
984

quite some VMAs, yet parsing it spends basically all the time in the kernel...

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/4950/smaps 
rss:1112288 pss:1096435
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf 
"rss:%d pss:%d\n", rss, pss} /proc/4950/smaps"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 91%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

So I am not really sure I see the performance benefit.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PACTH v2 0/3] Implement /proc//totmaps

2016-08-15 Thread Michal Hocko
On Mon 15-08-16 09:00:04, Robert Foss wrote:
> 
> 
> On 2016-08-14 05:04 AM, Michal Hocko wrote:
> > On Fri 12-08-16 18:04:19, robert.f...@collabora.com wrote:
> > > From: Robert Foss <robert.f...@collabora.com>
> > > 
> > > This series implements /proc/PID/totmaps, a tool for retrieving summarized
> > > information about the mappings of a process.
> > 
> > The changelog is absolutely missing the usecase. Why do we need this?
> > Why existing interfaces are not sufficient?
> 
> You are absolutely right, more info information is in 1/3.

Patch 1 is silent about the use case as well. It is usually recommended
to describe the motivation for the change in the cover letter.

> But the gist of it is that it provides a faster and more convenient way of
> accessing the information in /proc/PID/smaps.

I am sorry to insist but this is far from a description I was hoping
for. Why do we need a more convenient API? Please note that this is a
userspace API which we will have to maintain for ever. We have made many
mistakes in the past where exporting some information made sense at the
time while it turned out being a mistake only later on. So let's make
sure we will not fall into the same trap again.

So please make sure you describe the use case, why the current API is
insufficient and why it cannot be tweaked to provide the information you
are looking for.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Revert "mm: rename _count, field of the struct page, to _refcount"

2016-06-16 Thread Michal Hocko
On Thu 16-06-16 13:22:27, Vitaly Kuznetsov wrote:
> Michal Hocko <mho...@kernel.org> writes:
> 
> > On Thu 16-06-16 12:30:16, Vitaly Kuznetsov wrote:
> >> Christoph Hellwig <h...@infradead.org> writes:
> >> 
> >> > On Thu, Jun 16, 2016 at 11:22:46AM +0200, Vitaly Kuznetsov wrote:
> >> >> _count -> _refcount rename in commit 0139aa7b7fa12 ("mm: rename _count,
> >> >> field of the struct page, to _refcount") broke kdump. makedumpfile(8) 
> >> >> does
> >> >> stuff like READ_MEMBER_OFFSET("page._count", page._count) and fails. 
> >> >> While
> >> >> it is definitely possible to fix this particular tool I'm not sure about
> >> >> other tools which might be doing the same.
> >> >> 
> >> >> I suggest we remember the "we don't break userspace" rule and revert for
> >> >> 4.7 while it's not too late.
> >> >
> >> > Err, sorry - this is not "userspace".  It's crazy crap digging into
> >> > kernel internal structure.
> >> >
> >> > The rename was absolutely useful, so fix up your stinking pike in kdump.
> >> 
> >> Ok, sure, I'll send a patch to it. I was worried about other tools out
> >> there which e.g. inspect /proc/vmcore. As it is something we support
> >> some conservatism around it is justified.
> >
> > struct page layout as some others that such a tool might depend on has
> > changes several times in the past so I fail to see how is it any
> > different this time.
> 
> IMO this time the change doesn't give us any advantage, it was just a
> rename.

Which would catch all the pending users who are not using the
appropriate API. This is IMHO very useful as the sole purpose of the
change is to catch _all_ users. So the reason is pretty much technicall.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Revert "mm: rename _count, field of the struct page, to _refcount"

2016-06-16 Thread Michal Hocko
On Thu 16-06-16 12:30:16, Vitaly Kuznetsov wrote:
> Christoph Hellwig <h...@infradead.org> writes:
> 
> > On Thu, Jun 16, 2016 at 11:22:46AM +0200, Vitaly Kuznetsov wrote:
> >> _count -> _refcount rename in commit 0139aa7b7fa12 ("mm: rename _count,
> >> field of the struct page, to _refcount") broke kdump. makedumpfile(8) does
> >> stuff like READ_MEMBER_OFFSET("page._count", page._count) and fails. While
> >> it is definitely possible to fix this particular tool I'm not sure about
> >> other tools which might be doing the same.
> >> 
> >> I suggest we remember the "we don't break userspace" rule and revert for
> >> 4.7 while it's not too late.
> >
> > Err, sorry - this is not "userspace".  It's crazy crap digging into
> > kernel internal structure.
> >
> > The rename was absolutely useful, so fix up your stinking pike in kdump.
> 
> Ok, sure, I'll send a patch to it. I was worried about other tools out
> there which e.g. inspect /proc/vmcore. As it is something we support
> some conservatism around it is justified.

struct page layout as some others that such a tool might depend on has
changes several times in the past so I fail to see how is it any
different this time. struct page is nothing the userspace should depend
on.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0037/1529] Fix typo

2016-05-23 Thread Michal Hocko
This and few others have just landed in my inbox. Not only the changelog
doesn't mention what is the point of such a typo fix - does it help
useful grep patterns when checking the documentation? 1529 patch series
just sounds crazy. So let me ask, what is the whole point of this patch
bomb? Maybe the cover letter has mentioned that but this has either been
sent without threading or the threading was proken.

On Sat 21-05-16 13:40:37, Andrea Gelmini wrote:
> Signed-off-by: Andrea Gelmini <andrea.gelm...@gelma.net>
> ---
>  Documentation/filesystems/proc.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index e8d0075..66d077e 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -463,8 +463,8 @@ accessed.
>  "Anonymous" shows the amount of memory that does not belong to any file.  
> Even
>  a mapping associated with a file may contain anonymous pages: when 
> MAP_PRIVATE
>  and a page is modified, the file page is replaced by a private anonymous 
> copy.
> -"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> -"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
> +"AnonHugePages" shows the amount of memory backed by transparent hugepage.
> +"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by
>  hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
>  reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
>  "Swap" shows how much would-be-anonymous memory is also used, but out on 
> swap.
> -- 
> 2.8.2.534.g1f66975
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2