[patch -mm 4/4] mm, memcg: add hierarchical usage oom policy

2018-01-16 Thread David Rientjes
r. This allows administrators, for example, to require users in their own top-level mem cgroup subtree to be accounted for with hierarchical usage. In other words, they can longer evade the oom killer by using other controllers or subcontainers. Signed-off-by: David Rientjes --- Documentation/cgr

[patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-16 Thread David Rientjes
writing "cgroup" to the root mem cgroup's memory.oom_policy). The "all" oom policy cannot be enabled on the root mem cgroup. Signed-off-by: David Rientjes --- Documentation/cgroup-v2.txt | 51 ++--- include/linux/memcon

[patch -mm 0/4] mm, memcg: introduce oom policies

2018-01-16 Thread David Rientjes
There are three significant concerns about the cgroup aware oom killer as it is implemented in -mm: (1) allows users to evade the oom killer by creating subcontainers or using other controllers since scoring is done per cgroup and not hierarchically, (2) does not allow the user to inf

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-16 Thread David Rientjes
On Mon, 15 Jan 2018, Michal Hocko wrote: > > No, this isn't how kernel features get introduced. We don't design a new > > kernel feature with its own API for a highly specialized usecase and then > > claim we'll fix the problems later. Users will work around the > > constraints of the new fea

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-16 Thread David Rientjes
On Mon, 15 Jan 2018, Johannes Weiner wrote: > > It's quite trivial to allow the root mem cgroup to be compared exactly the > > same as another cgroup. Please see > > https://marc.info/?l=linux-kernel&m=151579459920305. > > This only says "that will be fixed" and doesn't address why I care. >

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-14 Thread David Rientjes
On Sat, 13 Jan 2018, Johannes Weiner wrote: > You don't have any control and no accounting of the stuff situated > inside the root cgroup, so it doesn't make sense to leave anything in > there while also using sophisticated containerization mechanisms like > this group oom setting. > > In fact, t

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-12 Thread David Rientjes
. Cgroup v2 is a very clean interface and I think it's the responsibility of every controller to maintain that. We should not fall into a cgroup v1 mentality which became very difficult to make extensible. Let's make a feature that is generally useful, complete, and empowers th

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-11 Thread David Rientjes
On Thu, 11 Jan 2018, Michal Hocko wrote: > > > I find this problem quite minor, because I haven't seen any practical > > > problems > > > caused by accounting of the root cgroup memory. > > > If it's a serious problem for you, it can be solved without switching to > > > the > > > hierarchical ac

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-10 Thread David Rientjes
On Wed, 10 Jan 2018, Roman Gushchin wrote: > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups > > > > The patchset uses two different heuristics to compare root and leaf mem > > cgroups and scores them based on number of pages. For the root mem > > cgroup, it totals the /p

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-09 Thread David Rientjes
On Thu, 30 Nov 2017, Andrew Morton wrote: > > This patchset makes the OOM killer cgroup-aware. > > Thanks, I'll grab these. > > There has been controversy over this patchset, to say the least. I > can't say that I followed it closely! Could those who still have > reservations please summarise

[patch -mm] mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks fix fix

2018-01-09 Thread David Rientjes
ment about invalidate_range() always being called under the ptl spinlock. Signed-off-by: David Rientjes --- include/linux/mmu_notifier.h | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/inc

Re: [patch v2 2/2] mm, oom: avoid reaping only for mm's with blockable invalidate callbacks

2017-12-15 Thread David Rientjes
On Fri, 15 Dec 2017, Michal Hocko wrote: > > This uses the new annotation to determine if an mm has mmu notifiers with > > blockable invalidate range callbacks to avoid oom reaping. Otherwise, the > > callbacks are used around unmap_page_range(). > > Do you have any example where this helped? KV

[patch v2 2/2] mm, oom: avoid reaping only for mm's with blockable invalidate callbacks

2017-12-14 Thread David Rientjes
This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Signed-off-by: David Rientjes --- mm/oom_kill.c | 21 +++-- 1 file changed, 11

[patch v2 1/2] mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks

2017-12-14 Thread David Rientjes
tch adds a "flags" field to mmu notifier ops that can set a bit to indicate that these callbacks do not block. The implementation is steered toward an expensive slowpath, such as after the oom reaper has grabbed mm->mmap_sem of a still alive oom victim. Signed-off-by: David Rientjes ---

Re: [patch 1/2] mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks

2017-12-14 Thread David Rientjes
On Wed, 13 Dec 2017, Christian König wrote: > > > > --- a/drivers/misc/sgi-gru/grutlbpurge.c > > > > +++ b/drivers/misc/sgi-gru/grutlbpurge.c > > > > @@ -298,6 +298,7 @@ struct gru_mm_struct > > > > *gru_register_mmu_notifier(void) > > > > return ERR_PTR(-ENOMEM); > > > >

Re: [PATCH 0/8] make some functions return bool

2017-12-12 Thread David Rientjes
On Tue, 12 Dec 2017, Randy Dunlap wrote: > Sure, but I didn't keep the patch emails. > > Acked-by: Randy Dunlap > You may have noticed changing functions like is_file_lru() to bool when it is used to index into an array or as part of an arithmetic operation for ZVC stats. I'm not sure why y

Re: [patch 1/2] mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks

2017-12-12 Thread David Rientjes
On Tue, 12 Dec 2017, Dimitri Sivanich wrote: > > --- a/drivers/misc/sgi-gru/grutlbpurge.c > > +++ b/drivers/misc/sgi-gru/grutlbpurge.c > > @@ -298,6 +298,7 @@ struct gru_mm_struct *gru_register_mmu_notifier(void) > > return ERR_PTR(-ENOMEM); > > STAT(gms_alloc); > >

Re: [PATCH 0/8] make some functions return bool

2017-12-11 Thread David Rientjes
On Mon, 11 Dec 2017, Yaowei Bai wrote: > This patchset makes some *_is_* like functions return bool because > these functions only use true or false as their return values. > > No functional changes. > I think the concern about this type of patchset in the past is that it is unnecessary churn

Re: [patch 1/2] mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks

2017-12-11 Thread David Rientjes
On Mon, 11 Dec 2017, Paolo Bonzini wrote: > > Commit 4d4bbd8526a8 ("mm, oom_reaper: skip mm structs with mmu notifiers") > > prevented the oom reaper from unmapping private anonymous memory with the > > oom reaper when the oom victim mm had mmu notifiers registered. > > > > The rationale is that

[patch 1/2] mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks

2017-12-11 Thread David Rientjes
tch adds a "flags" field for mmu notifiers that can set a bit to indicate that these callbacks do block. The implementation is steered toward an expensive slowpath, such as after the oom reaper has grabbed mm->mmap_sem of a still alive oom victim. Signed-off-by: David Rientjes --- arch/po

[patch 2/2] mm, oom: avoid reaping only for mm's with blockable invalidate callbacks

2017-12-11 Thread David Rientjes
This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Signed-off-by: David Rientjes --- mm/oom_kill.c | 21 +++-- 1 file changed, 11

Re: [PATCH v2] mm: terminate shrink_slab loop if signal is pending

2017-12-08 Thread David Rientjes
On Thu, 7 Dec 2017, Suren Baghdasaryan wrote: > Slab shrinkers can be quite time consuming and when signal > is pending they can delay handling of the signal. If fatal > signal is pending there is no point in shrinking that process > since it will be killed anyway. This change checks for pending >

Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap

2017-12-08 Thread David Rientjes
On Thu, 7 Dec 2017, David Rientjes wrote: > I'm backporting and testing the following patch against Linus's tree. To > clarify an earlier point, we don't actually have any change from upstream > code that allows for free_pgtables() before the > set_bit(MMF_OOM_

Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap

2017-12-07 Thread David Rientjes
On Thu, 7 Dec 2017, Michal Hocko wrote: > yes. I will fold the following in if this turned out to really address > David's issue. But I suspect this will be the case considering the NULL > pmd in the report which would suggest racing with free_pgtable... > I'm backporting and testing the followi

Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap

2017-12-07 Thread David Rientjes
On Thu, 7 Dec 2017, Michal Hocko wrote: > Very well spotted! It could be any task in fact (e.g. somebody reading > from /proc/ file which requires mm_struct). > > oom_reaperoom_victim task > mmget_not_zero >

Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap

2017-12-05 Thread David Rientjes
On Wed, 6 Dec 2017, Tetsuo Handa wrote: > > > One way to solve the issue is to have two mm flags: one to indicate the > > > mm > > > is entering unmap_vmas(): set the flag, do down_write(&mm->mmap_sem); > > > up_write(&mm->mmap_sem), then unmap_vmas(). The oom reaper needs this > > > flag cle

Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap

2017-12-05 Thread David Rientjes
On Tue, 5 Dec 2017, David Rientjes wrote: > One way to solve the issue is to have two mm flags: one to indicate the mm > is entering unmap_vmas(): set the flag, do down_write(&mm->mmap_sem); > up_write(&mm->mmap_sem), then unmap_vmas(). The oom reaper needs this > fl

Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap

2017-12-05 Thread David Rientjes
Hi, I'd like to understand the synchronization between the oom_reaper's unmap_page_range() and exit_mmap(). The latter does not hold mm->mmap_sem: it's supposed to be the last thread operating on the mm before it is destroyed. If unmap_page_range() races with unmap_vmas(), we trivially call

Re: [PATCH v3 1/3] mm/mempolicy: remove redundant check in get_nodes

2017-11-20 Thread David Rientjes
On Fri, 17 Nov 2017, Yisheng Xie wrote: > We have already checked whether maxnode is a page worth of bits, by: > maxnode > PAGE_SIZE*BITS_PER_BYTE > > So no need to check it once more. > > Acked-by: Vlastimil Babka > Signed-off-by: Yisheng Xie Acked-by: David Rientjes

Re: [PATCH v2] mm: show total hugetlb memory consumption in /proc/meminfo

2017-11-17 Thread David Rientjes
4161536 kB > DirectMap1G: 6291456 kB > > Also, this patch updates corresponding docs to reflect > Hugetlb entry meaning and difference between Hugetlb and > HugePages_Total * Hugepagesize. > > Signed-off-by: Roman Gushchin > Cc: Andrew Morton > Cc: Michal Hocko >

Re: [PATCH] mm: show total hugetlb memory consumption in /proc/meminfo

2017-11-15 Thread David Rientjes
On Wed, 15 Nov 2017, Michal Hocko wrote: > > > > if (!hugepages_supported()) > > > > return; > > > > seq_printf(m, > > > > @@ -2987,6 +2989,11 @@ void hugetlb_report_meminfo(struct seq_file *m) > > > > h->resv_huge_pages, > > > >

Re: [PATCH] mm: show total hugetlb memory consumption in /proc/meminfo

2017-11-14 Thread David Rientjes
> Hugepagesize: 2048 kB > > Hugetlb: 4194304 kB > > DirectMap4k: 32632 kB > > DirectMap2M: 4161536 kB > > DirectMap1G: 6291456 kB > > > > Signed-off-by: Roman Gushchin > > Cc: Andrew Morton > > Cc: Michal Hocko > > C

Re: [PATCH 1/3] mm, compaction: extend pageblock_skip_persistent() to all compound pages

2017-11-03 Thread David Rientjes
ge is not synchronously split like it was prior to the thp refcounting patchset, however. Acked-by: David Rientjes

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-11-01 Thread David Rientjes
On Wed, 1 Nov 2017, Michal Hocko wrote: > > memory.oom_score_adj would never need to be permanently tuned, just as > > /proc/pid/oom_score_adj need never be permanently tuned. My response was > > an answer to Roman's concern that "v8 has it's own limitations," but I > > haven't seen a concrete

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-31 Thread David Rientjes
On Tue, 31 Oct 2017, Michal Hocko wrote: > > I'm not ignoring them, I have stated that we need the ability to protect > > important cgroups on the system without oom disabling all attached > > processes. If that is implemented as a memory.oom_score_adj with the same > > semantics as /proc/pid/

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-30 Thread David Rientjes
On Fri, 27 Oct 2017, Roman Gushchin wrote: > The thing is that the hierarchical approach (as in v8), which are you pushing, > has it's own limitations, which we've discussed in details earlier. There are > reasons why v12 is different, and we can't really simple go back. I mean if > there are bett

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-26 Thread David Rientjes
On Thu, 26 Oct 2017, Johannes Weiner wrote: > > The nack is for three reasons: > > > > (1) unfair comparison of root mem cgroup usage to bias against that mem > > cgroup from oom kill in system oom conditions, > > > > (2) the ability of users to completely evade the oom killer by attachi

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-25 Thread David Rientjes
On Mon, 23 Oct 2017, Michal Hocko wrote: > On Sun 22-10-17 17:24:51, David Rientjes wrote: > > On Thu, 19 Oct 2017, Johannes Weiner wrote: > > > > > David would have really liked for this patchset to include knobs to > > > influence how the algorithm pic

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-22 Thread David Rientjes
On Thu, 19 Oct 2017, Johannes Weiner wrote: > David would have really liked for this patchset to include knobs to > influence how the algorithm picks cgroup victims. The rest of us > agreed that this is beyond the scope of these patches, that the > patches don't need it to be useful, and that ther

Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when unreclaimable slabs > user memory

2017-10-17 Thread David Rientjes
On Wed, 18 Oct 2017, Yang Shi wrote: > > Yes, this should catch occurrences of "huge unreclaimable slabs", right? > > Yes, it sounds so. Although single "huge" unreclaimable slab might not result > in excessive slabs use in a whole, but this would help to filter out "small" > unreclaimable slab.

[patch] mm, slab: only set __GFP_RECLAIMABLE once

2017-10-17 Thread David Rientjes
SLAB_RECLAIM_ACCOUNT is a permanent attribute of a slab cache. Set __GFP_RECLAIMABLE as part of its ->allocflags rather than check the cachep flag on every page allocation. Signed-off-by: David Rientjes --- mm/slab.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a

Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when unreclaimable slabs > user memory

2017-10-17 Thread David Rientjes
On Wed, 18 Oct 2017, Yang Shi wrote: > > > > Please simply dump statistics for all slab caches where the memory > > > > footprint is greater than 5% of system memory. > > > > > > Unconditionally? User controlable? > > > > Unconditionally, it's a single line of output per slab cache and there > >

Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when unreclaimable slabs > user memory

2017-10-17 Thread David Rientjes
On Tue, 17 Oct 2017, Michal Hocko wrote: > On Mon 16-10-17 17:15:31, David Rientjes wrote: > > Please simply dump statistics for all slab caches where the memory > > footprint is greater than 5% of system memory. > > Unconditionally? User controlable? Unconditionally,

Re: [PATCH 2/3] mm: slabinfo: dump CONFIG_SLABINFO

2017-10-16 Thread David Rientjes
nfo/?l=linux-kernel&m=150695909709711&w=2 > > Signed-off-by: Yang Shi Acked-by: David Rientjes Cool!

Re: [PATCH 3/3] mm: oom: show unreclaimable slab info when unreclaimable slabs > user memory

2017-10-16 Thread David Rientjes
On Wed, 11 Oct 2017, Yang Shi wrote: > @@ -161,6 +162,25 @@ static bool oom_unkillable_task(struct task_struct *p, > return false; > } > > +/* > + * Print out unreclaimble slabs info when unreclaimable slabs amount is > greater > + * than all user memory (LRU pages) > + */ > +static bool

[patch] mm, compaction: properly initialize alloc_flags in compact_control

2017-10-16 Thread David Rientjes
The same is true for compact_node() when explicitly triggering full node compaction. Properly initialize cc.alloc_flags on the stack. Signed-off-by: David Rientjes --- mm/compaction.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c --

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-13 Thread David Rientjes
On Fri, 13 Oct 2017, Roman Gushchin wrote: > > Think about it in a different way: we currently compare per-process usage > > and userspace has /proc/pid/oom_score_adj to adjust that usage depending > > on priorities of that process and still oom kill if there's a memory leak. > > Your heuristi

Re: [RFC 1/2] cgroup, kthread: do not allow moving kthreads out of the root cgroup

2017-10-12 Thread David Rientjes
On Thu, 12 Oct 2017, Peter Zijlstra wrote: > > Attaching kernel threads to a non-root cgroup is generally a bad > > idea. Kernel threads are generally performing the work required > > to keep the system working and healthy, and applying various > > resource limits may affect system stability and p

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-12 Thread David Rientjes
On Wed, 11 Oct 2017, Roman Gushchin wrote: > > But let's move the discussion forward to fix it. To avoid necessarily > > accounting memory to the root mem cgroup, have we considered if it is even > > necessary to address the root mem cgroup? For the users who opt-in to > > this heuristic, wou

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread David Rientjes
ocesses to child cgroups either purposefully or unpurposefully, and the > > inability of userspace to effectively control oom victim selection: > > > > Nacked-by: David Rientjes > > I consider this NACK rather dubious. Evading the heuristic as you > describe requir

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread David Rientjes
ompletely evade the oom killer by attaching all > > processes to child cgroups either purposefully or unpurposefully, and the > > inability of userspace to effectively control oom victim selection: > > > > Nacked-by: David Rientjes > > So, if we'll sum the oo

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread David Rientjes
mmit, both of these possibilities exist in the wild and the problem is only a result of the implementation detail of this patchset. For these reasons: unfair comparison of root mem cgroup usage to bias against that mem cgroup from oom kill in system oom conditions, the ability of users to completely

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-09 Thread David Rientjes
On Thu, 5 Oct 2017, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fairn

Re: [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-09 Thread David Rientjes
hich will use this function to iterate over tasks belonging > to the root memcg. > > Signed-off-by: Roman Gushchin Acked-by: David Rientjes

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes
On Thu, 5 Oct 2017, Roman Gushchin wrote: > > This patchset exists because overcommit is real, exactly the same as > > overcommit within memcg hierarchies is real. 99% of the time we don't run > > into global oom because people aren't using their limits so it just works > > out. 1% of the tim

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes
On Thu, 5 Oct 2017, Johannes Weiner wrote: > > It is, because it can quite clearly be a DoSand was prevented with > > Roman's earlier design of iterating usage up the hierarchy and comparing > > siblings based on that criteria. I know exactly why he chose that > > implementation detail early o

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes
On Wed, 4 Oct 2017, Johannes Weiner wrote: > > By only considering leaf memcgs, does this penalize users if their memcg > > becomes oc->chosen_memcg purely because it has aggregated all of its > > processes to be members of that memcg, which would otherwise be the > > standard behavior? > > >

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread David Rientjes
On Wed, 4 Oct 2017, Roman Gushchin wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b4de17a78dc1..79f30c281185 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct > mem_cgroup *memcg) > return ret; > }

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread David Rientjes
On Wed, 4 Oct 2017, Roman Gushchin wrote: > > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct > > > *victim) > > > struct mm_struct *mm; > > > bool can_oom_reap = true; > > > > > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD) || > > > + victim->s

Re: [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-04 Thread David Rientjes
On Wed, 4 Oct 2017, Roman Gushchin wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d5f3a62887cf..b4de17a78dc1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct > mem_cgroup *dead_memcg) > * value, the fun

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-26 Thread David Rientjes
On Tue, 26 Sep 2017, Michal Hocko wrote: > > No, I agree that we shouldn't compare sibling memory cgroups based on > > different criteria depending on whether group_oom is set or not. > > > > I think it would be better to compare siblings based on the same criteria > > independent of group_oom

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-25 Thread David Rientjes
On Mon, 25 Sep 2017, Johannes Weiner wrote: > > True but we want to have the semantic reasonably understandable. And it > > is quite hard to explain that the oom killer hasn't selected the largest > > memcg just because it happened to be in a deeper hierarchy which has > > been configured to cover

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-23 Thread David Rientjes
On Fri, 22 Sep 2017, Tejun Heo wrote: > > If you have this low priority maintenance job charging memory to the high > > priority hierarchy, you're already misconfigured unless you adjust > > /proc/pid/oom_score_adj because it will oom kill any larger process than > > itself in today's kernels a

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-22 Thread David Rientjes
On Thu, 21 Sep 2017, Johannes Weiner wrote: > > The issue is that if you opt-in to the new feature, then you are forced to > > change /proc/pid/oom_score_adj of all processes attached to a cgroup that > > you do not want oom killed based on size to be oom disabled. > > You're assuming that most

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-22 Thread David Rientjes
On Fri, 22 Sep 2017, Tejun Heo wrote: > > It doesn't have anything to do with my particular usecase, but rather the > > ability of userspace to influence the decisions of the kernel. Previous > > to this patchset, when selection is done based on process size, userspace > > has full control ove

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread David Rientjes
On Thu, 21 Sep 2017, Johannes Weiner wrote: > That's a ridiculous nak. > > The fact that this patch series doesn't solve your particular problem > is not a technical argument to *reject* somebody else's work to solve > a different problem. It's not a regression when behavior is completely > uncha

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread David Rientjes
On Mon, 18 Sep 2017, Roman Gushchin wrote: > > As said in other email. We can make priorities hierarchical (in the same > > sense as hard limit or others) so that children cannot override their > > parent. > > You mean they can set the knob to any value, but parent's value is enforced, > if it's

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread David Rientjes
On Wed, 20 Sep 2017, Roman Gushchin wrote: > > It's actually much more complex because in our environment we'd need an > > "activity manager" with CAP_SYS_RESOURCE to control oom priorities of user > > subcontainers when today it need only be concerned with top-level memory > > cgroups. Users

Re: [PATCH 2/2] mm: oom: show unreclaimable slab info when kernel panic

2017-09-21 Thread David Rientjes
On Thu, 21 Sep 2017, Yang Shi wrote: > Kernel may panic when oom happens without killable process sometimes it > is caused by huge unreclaimable slabs used by kernel. > > Although kdump could help debug such problem, however, kdump is not > available on all architectures and it might be malfuncti

Re: [PATCH 2/2] mm: oom: show unreclaimable slab info when kernel panic

2017-09-20 Thread David Rientjes
On Thu, 21 Sep 2017, Yang Shi wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 99736e0..173c423 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -43,6 +43,7 @@ > > #include > #include "internal.h" > +#include "slab.h" > > #define CREATE_TRACE_POINTS > #include > @@ -427

Re: [PATCH 1/2] tools: slabinfo: add "-U" option to show unreclaimable slabs only

2017-09-20 Thread David Rientjes
ment, NULL, 'L'}, > { "Xtotals", no_argument, NULL, 'X'}, > { "Bytes", no_argument, NULL, 'B'}, > + { "unreclaim", no_argument, NULL, 'U'}, > { NULL, 0, NULL, 0 } > }; > Same. After that: Acked-by: David Rientjes Also, you may find it better to remove the "RFC" tag from the patchset's header email since it's agreed that we want this.

Re: [PATCH 2/2] mm: oom: show unreclaimable slab info when kernel panic

2017-09-19 Thread David Rientjes
On Wed, 20 Sep 2017, Yang Shi wrote: > > > --- a/mm/slab_common.c > > > +++ b/mm/slab_common.c > > > @@ -35,6 +35,8 @@ > > > static DECLARE_WORK(slab_caches_to_rcu_destroy_work, > > > slab_caches_to_rcu_destroy_workfn); > > > +#define K(x) ((x)/1024) > > > + > > > /* >

Re: [PATCH 2/2] mm: oom: show unreclaimable slab info when kernel panic

2017-09-19 Thread David Rientjes
On Tue, 19 Sep 2017, Yang Shi wrote: > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -35,6 +35,8 @@ > static DECLARE_WORK(slab_caches_to_rcu_destroy_work, > slab_caches_to_rcu_destroy_workfn); > > +#define K(x) ((x)/1024) > + > /* > * Set of flags that will prevent s

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-19 Thread David Rientjes
On Fri, 15 Sep 2017, Roman Gushchin wrote: > > > > But then you just enforce a structural restriction on your configuration > > > > because > > > > root > > > > / \ > > > >AD > > > > /\ > > > > B C > > > > > > > > is a different thing than > > > >

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-19 Thread David Rientjes
On Mon, 18 Sep 2017, Michal Hocko wrote: > > > > But then you just enforce a structural restriction on your configuration > > > > because > > > > root > > > > / \ > > > >AD > > > > /\ > > > > B C > > > > > > > > is a different thing than > > > >

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-15 Thread David Rientjes
On Fri, 15 Sep 2017, Roman Gushchin wrote: > > But then you just enforce a structural restriction on your configuration > > because > > root > > / \ > >AD > > /\ > > B C > > > > is a different thing than > > root > > / | \ > >B C D > >

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-14 Thread David Rientjes
On Thu, 14 Sep 2017, Michal Hocko wrote: > > It is certainly possible to add oom priorities on top before it is merged, > > but I don't see why it isn't part of the patchset. > > Because the semantic of the priority for non-leaf memcgs is not fully > clear and I would rather have the core of the

Re: [v8 2/4] mm, oom: cgroup-aware OOM killer

2017-09-13 Thread David Rientjes
On Mon, 11 Sep 2017, Roman Gushchin wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 15af3da5af02..da2b12ea4667 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2661,6 +2661,231 @@ static inline bool memcg_has_children(struct > mem_cgroup *memcg) > return ret; >

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-13 Thread David Rientjes
On Wed, 13 Sep 2017, Michal Hocko wrote: > > > This patchset makes the OOM killer cgroup-aware. > > > > > > v8: > > > - Do not kill tasks with OOM_SCORE_ADJ -1000 > > > - Make the whole thing opt-in with cgroup mount option control > > > - Drop oom_priority for further discussions > > > >

Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-09-12 Thread David Rientjes
On Tue, 12 Sep 2017, Roman Gushchin wrote: > > I can't imagine that Tejun would be happy with a new mount option, > > especially when it's not required. > > > > OOM behavior does not need to be defined at mount time and for the entire > > hierarchy. It's possible to very easily implement a tun

Re: [patch 1/2] mm, compaction: kcompactd should not ignore pageblock skip

2017-09-11 Thread David Rientjes
On Mon, 11 Sep 2017, Vlastimil Babka wrote: > > A follow-up change will set the pageblock skip for this memory since it is > > never useful for either scanner. > > """ > > > >> Also there's now a danger that in cases where there's no direct > >> compaction happening (just kcompactd), nothing wil

Re: [patch 2/2] mm, compaction: persistently skip hugetlbfs pageblocks

2017-09-11 Thread David Rientjes
On Mon, 11 Sep 2017, Vlastimil Babka wrote: > > Yes, any page where compound_order(page) == pageblock_order would probably > > benefit from the same treatment. I haven't encountered such an issue, > > however, so I thought it was best to restrict it only to hugetlb: hugetlb > > memory usually

Re: [v8 1/4] mm, oom: refactor the oom_kill_process() function

2017-09-11 Thread David Rientjes
m cgroup. We don't need to print > the debug information for the each task, as well as play > with task selection (considering task's children), > so we can't use the existing oom_kill_process(). > > Signed-off-by: Roman Gushchin > Cc: Michal Hocko > Cc: Vladimi

Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-09-11 Thread David Rientjes
On Mon, 11 Sep 2017, Roman Gushchin wrote: > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware > OOM killer. If not set, the OOM selection is performed in > a "traditional" per-process way. > > The behavior can be changed dynamically by remounting the cgroupfs. I can't imagine t

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-11 Thread David Rientjes
On Mon, 11 Sep 2017, Roman Gushchin wrote: > This patchset makes the OOM killer cgroup-aware. > > v8: > - Do not kill tasks with OOM_SCORE_ADJ -1000 > - Make the whole thing opt-in with cgroup mount option control > - Drop oom_priority for further discussions Nack, we specifically require

Re: [patch 2/2] mm, compaction: persistently skip hugetlbfs pageblocks

2017-09-10 Thread David Rientjes
On Fri, 1 Sep 2017, Vlastimil Babka wrote: > The pageblock_skip_persistent() function checks for HugeTLB pages of pageblock > order. When clearing pageblock skip bits for compaction, the bits are not > cleared for such pageblocks, because they cannot contain base pages suitable > for migration, no

Re: [patch 2/2] mm, compaction: persistently skip hugetlbfs pageblocks

2017-09-10 Thread David Rientjes
On Wed, 23 Aug 2017, Vlastimil Babka wrote: > > diff --git a/mm/compaction.c b/mm/compaction.c > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -217,6 +217,20 @@ static void reset_cached_positions(struct zone *zone) > > pageblock_start_pfn(zone_end_pfn(zone) -

Re: [patch 1/2] mm, compaction: kcompactd should not ignore pageblock skip

2017-09-10 Thread David Rientjes
On Wed, 23 Aug 2017, Vlastimil Babka wrote: > On 08/16/2017 01:39 AM, David Rientjes wrote: > > Kcompactd is needlessly ignoring pageblock skip information. It is doing > > MIGRATE_SYNC_LIGHT compaction, which is no more powerful than > > MIGRATE_SYNC compaction. > >

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-09 Thread David Rientjes
On Fri, 8 Sep 2017, Christopher Lameter wrote: > Ok. Certainly there were scalability issues (lots of them) and the sysctl > may have helped there if set globally. But the ability to kill the > allocating tasks was primarily used in cpusets for constrained allocation. > I remember discussing it

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-07 Thread David Rientjes
On Thu, 7 Sep 2017, Christopher Lameter wrote: > > I am not sure this is how things evolved actually. This is way before > > my time so my git log interpretation might be imprecise. We do have > > oom_badness heuristic since out_of_memory has been introduced and > > oom_kill_allocating_task has be

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-07 Thread David Rientjes
On Thu, 7 Sep 2017, Christopher Lameter wrote: > > SGI required it when it was introduced simply to avoid the very expensive > > tasklist scan. Adding Christoph Lameter to the cc since he was involved > > back then. > > Really? From what I know and worked on way back when: The reason was to be >

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-06 Thread David Rientjes
with the overall patchset though :) > To make a first step towards deprecation, let's warn potential > users about deprecation plans. > > Signed-off-by: Roman Gushchin > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Johannes Weiner > Cc: David Rientjes > Cc: Vladimi

Re: [PATCH] mm: kvfree the swap cluster info if the swap file is unsatisfactory

2017-09-01 Thread David Rientjes
ll the vfree calls to use kvfree. > Hopefully this can make it into 4.13. Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures") Cc: sta...@vger.kernel.org [4.12] > Found by running generic/357 from xfstests. > > Signed-off-by: Darrick J. Won

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-31 Thread David Rientjes
On Thu, 31 Aug 2017, Roman Gushchin wrote: > So, it looks to me that we're close to an acceptable version, > and the only remaining question is the default behavior > (when oom_group is not set). > Nit: without knowledge of the implementation, I still don't think I would know what an "out of me

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-30 Thread David Rientjes
On Wed, 30 Aug 2017, Roman Gushchin wrote: > I've spent some time to implement such a version. > > It really became shorter and more existing code were reused, > howewer I've met a couple of serious issues: > > 1) Simple summing of per-task oom_score doesn't make sense. >First, we calculate

Re: [v6 3/4] mm, oom: introduce oom_priority for memory cgroups

2017-08-28 Thread David Rientjes
On Thu, 24 Aug 2017, Roman Gushchin wrote: > > > Do you have an example, which can't be effectively handled by an approach > > > I'm suggesting? > > > > No, I do not have any which would be _explicitly_ requested but I do > > envision new requirements will emerge. The most probable one would be >

Re: [PATCH] mm, madvise: Ensure poisoned pages are removed from per-cpu lists

2017-08-28 Thread David Rientjes
h a rare operation. > > Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages > allocated from the PCP") > Reported-and-tested-by: Wang, Wendy > Cc: sta...@kernel.org > Signed-off-by: Mel Gorman Acked-by: David Rientjes

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-23 Thread David Rientjes
On Wed, 23 Aug 2017, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fair

Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-23 Thread David Rientjes
On Wed, 23 Aug 2017, Roman Gushchin wrote: > > It's better to have newbies consult the documentation once than making > > everybody deal with long and cumbersome names for the rest of time. > > > > Like 'ls' being better than 'read_and_print_directory_contents'. > > I don't think it's a good arg

[patch] fs, proc: unconditional cond_resched when reading smaps

2017-08-21 Thread David Rientjes
ays reschedule in smaps_pte_range() if necessary since the pagewalk iteration can be expensive. Signed-off-by: David Rientjes --- fs/proc/task_mmu.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/p

<    2   3   4   5   6   7   8   9   10   11   >