from:"David Rientjes"

Re: [PATCH v4] mm: remove odd HAVE_PTE_SPECIAL

2018-04-12 Thread David Rientjes

On Thu, 12 Apr 2018, Laurent Dufour wrote:

> Remove the additional define HAVE_PTE_SPECIAL and rely directly on
> CONFIG_ARCH_HAS_PTE_SPECIAL.
> 
> There is no functional change introduced by this patch
> 
> Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>

Acked-by: David Rientjes <rient...@google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-10 Thread David Rientjes

On Tue, 10 Apr 2018, Laurent Dufour wrote:

> > On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote:
> >>  arch/powerpc/include/asm/pte-common.h  | 3 ---
> >>  arch/riscv/Kconfig | 1 +
> >>  arch/s390/Kconfig  | 1 +
> > 
> > You forgot to delete __HAVE_ARCH_PTE_SPECIAL from
> > arch/riscv/include/asm/pgtable-bits.h
> 
> Damned !
> Thanks for catching it.
> 

Squashing the two patches together at least allowed it to be caught 
easily.  After it's fixed, feel free to add

Acked-by: David Rientjes <rient...@google.com>

Thanks for doing this!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL

2018-04-09 Thread David Rientjes

On Mon, 9 Apr 2018, Christoph Hellwig wrote:

> > -#ifdef __HAVE_ARCH_PTE_SPECIAL
> > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> >  # define HAVE_PTE_SPECIAL 1
> >  #else
> >  # define HAVE_PTE_SPECIAL 0
> 
> I'd say kill this odd indirection and just use the
> CONFIG_ARCH_HAS_PTE_SPECIAL symbol directly.
> 
> 

Agree, and I think it would be easier to audit/review if patches 1 and 3 
were folded together to see the relationship between the newly added 
selects and what #define's it is replacing.  Otherwise, looks good!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] mm, page_alloc: extend kernelcore and movablecore for percent

2018-02-15 Thread David Rientjes

On Thu, 15 Feb 2018, Matthew Wilcox wrote:

> What I was proposing was an intermediate page allocator where slab would
> request 2MB for its own uses all at once, then allocate pages from that to
> individual slabs, so allocating a kmalloc-32 object and a dentry object
> would result in 510 pages of memory still being available for any slab
> that needed it.
> 

A type of memory arena built between the page allocator and slab 
allocator.

The issue that I see with this is eventually there's going to be low on 
memory situations where memory needs to be reclaimed from these arena 
pages.  We can free individual pages back to the buddy allocator, but have 
no control over whether an entire pageblock can be freed back.  So now we 
have MIGRATE_MOVABLE or MIGRATE_UNMOVABLE pageblocks with some user pages 
and some slab pages, and we've reached the same fragmentation issue in a 
different way.  After that, it will become more difficult for the slab 
allocator to request a page of pageblock_order.

Other than the stranding issue of MIGRATE_UNMOVABLE pages on pcps, the 
page allocator currently does well in falling back to other migratetypes 
but there isn't any type of slab reclaim or defragmentation done in the 
background to try to free up as much memory from that 
now-MIGRATE_UNMOVABLE pageblock as possible.  We have patches that do 
that, but as I mentioned before it can affect the performance of the page 
allocator because it drains pcps on fallback and it does kcompactd 
compaction in the background even if you don't need order-9 memory later 
(or you've defragmented needlessly when more slab is just going to be 
allocated anyway).
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] mm, page_alloc: extend kernelcore and movablecore for percent

2018-02-15 Thread David Rientjes

On Thu, 15 Feb 2018, Michal Hocko wrote:

> > When the amount of kernel 
> > memory is well bounded for certain systems, it is better to aggressively 
> > reclaim from existing MIGRATE_UNMOVABLE pageblocks rather than eagerly 
> > fallback to others.
> > 
> > We have additional patches that help with this fragmentation if you're 
> > interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE 
> > pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and 
> > draining of pcp lists back to the zone free area to prevent stranding.
> 
> Yes, I think we need a proper fix. (Ab)using zone_movable for this
> usecase is just sad.
> 

It's a hard balance to achieve between a fast page allocator with per-cpu 
pagesets, reducing fragmentation of unmovable memory, and the performance 
impact of any fix to reduce that fragmentation for users currently 
unaffected.  Our patches to kick kcompactd for MIGRATE_UNMOVABLE 
pageblocks on fallback would be a waste unless you have a ton of anonymous 
memory you want backed by thp.

If hugepages is the main motivation for reducing the fragmentation, 
hugetlbfs could be suggested because it would give us more runtime control 
and we could leave surplus pages sitting in the free pool unless reclaimed 
under memory pressure.  That works fine in dedicated environments where we 
know how much hugetlb to reserve; if we give it back under memory pressure 
it becomes hard to reallocate the high number of hugepages we want (>95% 
of system memory).  It's much more sloppy in shared environments where the 
amount of hugepages are unknown.

And of course this doesn't address when a pin prevents memory from being 
migrated during memory compaction that is __GFP_MOVABLE at allocation but 
later pinned in place, which can still be a problem with ZONE_MOVABLE.  It 
would nice to have a solution where this memory can be annotated to want 
to come from a non-MIGRATE_MOVABLE pageblock, if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] mm, page_alloc: extend kernelcore and movablecore for percent

2018-02-14 Thread David Rientjes

On Wed, 14 Feb 2018, Michal Hocko wrote:

> I do not have any objections regarding the extension. What I am more
> interested in is _why_ people are still using this command line
> parameter at all these days. Why would anybody want to introduce lowmem
> issues from 32b days. I can see the CMA/Hotplug usecases for
> ZONE_MOVABLE but those have their own ways to define zone movable. I was
> tempted to simply remove the kernelcore already. Could you be more
> specific what is your usecase which triggered a need of an easier
> scaling of the size?

Fragmentation of non-__GFP_MOVABLE pages due to low on memory situations 
can pollute most pageblocks on the system, as much as 1GB of slab being 
fragmented over 128GB of memory, for example.  When the amount of kernel 
memory is well bounded for certain systems, it is better to aggressively 
reclaim from existing MIGRATE_UNMOVABLE pageblocks rather than eagerly 
fallback to others.

We have additional patches that help with this fragmentation if you're 
interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE 
pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and 
draining of pcp lists back to the zone free area to prevent stranding.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch -mm] mm, page_alloc: extend kernelcore and movablecore for percent fix

2018-02-13 Thread David Rientjes

Specify that movablecore= can use a percent value.

Remove comment about hugetlb pages not being movable per Mike.

Cc: Mike Kravetz <mike.krav...@oracle.com>
Signed-off-by: David Rientjes <rient...@google.com>
---
 .../admin-guide/kernel-parameters.txt | 22 +--
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1837,10 +1837,9 @@
 
ZONE_MOVABLE is used for the allocation of pages that
may be reclaimed or moved by the page migration
-   subsystem.  This means that HugeTLB pages may not be
-   allocated from this zone.  Note that allocations like
-   PTEs-from-HighMem still use the HighMem zone if it
-   exists, and the Normal zone if it does not.
+   subsystem.  Note that allocations like PTEs-from-HighMem
+   still use the HighMem zone if it exists, and the Normal
+   zone if it does not.
 
It is possible to specify the exact amount of memory in
the form of "nn[KMGTPE]", a percentage of total system
@@ -2353,13 +2352,14 @@
mousedev.yres=  [MOUSE] Vertical screen resolution, used for devices
reporting absolute coordinates, such as tablets
 
-   movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
-   is similar to kernelcore except it specifies the
-   amount of memory used for migratable allocations.
-   If both kernelcore and movablecore is specified,
-   then kernelcore will be at *least* the specified
-   value but may be more. If movablecore on its own
-   is specified, the administrator must be careful
+   movablecore=[KNL,X86,IA-64,PPC]
+   Format: nn[KMGTPE] | nn%
+   This parameter is the complement to kernelcore=, it
+   specifies the amount of memory used for migratable
+   allocations.  If both kernelcore and movablecore is
+   specified, then kernelcore will be at *least* the
+   specified value but may be more.  If movablecore on its
+   own is specified, the administrator must be careful
that the amount of memory usable for all allocations
is not too small.
 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] mm, page_alloc: extend kernelcore and movablecore for percent

2018-02-13 Thread David Rientjes

On Tue, 13 Feb 2018, Mike Kravetz wrote:

> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1825,30 +1825,30 @@
> > keepinitrd  [HW,ARM]
> >  
> > kernelcore= [KNL,X86,IA-64,PPC]
> > -   Format: nn[KMGTPE] | "mirror"
> > -   This parameter
> > -   specifies the amount of memory usable by the kernel
> > -   for non-movable allocations.  The requested amount is
> > -   spread evenly throughout all nodes in the system. The
> > -   remaining memory in each node is used for Movable
> > -   pages. In the event, a node is too small to have both
> > -   kernelcore and Movable pages, kernelcore pages will
> > -   take priority and other nodes will have a larger number
> > -   of Movable pages.  The Movable zone is used for the
> > -   allocation of pages that may be reclaimed or moved
> > -   by the page migration subsystem.  This means that
> > -   HugeTLB pages may not be allocated from this zone.
> > -   Note that allocations like PTEs-from-HighMem still
> > -   use the HighMem zone if it exists, and the Normal
> > -   zone if it does not.
> > -
> > -   Instead of specifying the amount of memory (nn[KMGTPE]),
> > -   you can specify "mirror" option. In case "mirror"
> > +   Format: nn[KMGTPE] | nn% | "mirror"
> > +   This parameter specifies the amount of memory usable by
> > +   the kernel for non-movable allocations.  The requested
> > +   amount is spread evenly throughout all nodes in the
> > +   system as ZONE_NORMAL.  The remaining memory is used for
> > +   movable memory in its own zone, ZONE_MOVABLE.  In the
> > +   event, a node is too small to have both ZONE_NORMAL and
> > +   ZONE_MOVABLE, kernelcore memory will take priority and
> > +   other nodes will have a larger ZONE_MOVABLE.
> > +
> > +   ZONE_MOVABLE is used for the allocation of pages that
> > +   may be reclaimed or moved by the page migration
> > +   subsystem.  This means that HugeTLB pages may not be
> > +   allocated from this zone.  Note that allocations like
> > +   PTEs-from-HighMem still use the HighMem zone if it
> > +   exists, and the Normal zone if it does not.
> 
> I know you are just updating the documentation for the new ability to
> specify a percentage.  However, while looking at this I noticed that
> the existing description is out of date.  HugeTLB pages CAN be treated
> as movable and allocated from ZONE_MOVABLE.
> 
> If you have to respin, could you drop that line while making this change?
> 

Hi Mike,

It's merged in -mm, so perhaps no respin is necessary.  I think a general 
cleanup to this area regarding your work with hugetlb pages would be good.

> > +
> > +   It is possible to specify the exact amount of memory in
> > +   the form of "nn[KMGTPE]", a percentage of total system
> > +   memory in the form of "nn%", or "mirror".  If "mirror"
> > option is specified, mirrored (reliable) memory is used
> > for non-movable allocations and remaining memory is used
> > -   for Movable pages. nn[KMGTPE] and "mirror" are 
> > exclusive,
> > -   so you can NOT specify nn[KMGTPE] and "mirror" at the 
> > same
> > -   time.
> > +   for Movable pages.  "nn[KMGTPE]", "nn%", and "mirror"
> > +   are exclusive, so you cannot specify multiple forms.
> >  
> > kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
> > Format: [,poll interval]
> 
> Don't you need to make the same type percentage changes for 'movablecore='?
> 

The majority of the movablecore= documentation simply refers to the 
kernelcore= option as its complement, I'm not sure that we need to go 
in-depth into what the percentage specifiers mean for both options.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] mm, page_alloc: extend kernelcore and movablecore for percent

2018-02-13 Thread David Rientjes

On Tue, 13 Feb 2018, Andrew Morton wrote:

> > Both kernelcore= and movablecore= can be used to define the amount of
> > ZONE_NORMAL and ZONE_MOVABLE on a system, respectively.  This requires
> > the system memory capacity to be known when specifying the command line,
> > however.
> > 
> > This introduces the ability to define both kernelcore= and movablecore=
> > as a percentage of total system memory.  This is convenient for systems
> > software that wants to define the amount of ZONE_MOVABLE, for example, as
> > a proportion of a system's memory rather than a hardcoded byte value.
> > 
> > To define the percentage, the final character of the parameter should be
> > a '%'.
> 
> Is this fine-grained enough?  We've had percentage-based tunables in
> the past, and 10 years later when systems are vastly larger, 1% is too
> much.
> 

They still have the (current) ability to define the exact amount of bytes 
down to page sized granularity, whereas 1% would yield 40GB on a 4TB 
system.  I'm not sure that people will want any finer-grained control if 
defining the proportion of the system for kernelcore.  They do have the 
ability with the existing interface, though, if they want to be that 
precise.

(This is a cop out for not implementing some fractional percentage parser, 
 although that would be possible as a more complete solution.)
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 1/2] mm, page_alloc: extend kernelcore and movablecore for percent

2018-02-12 Thread David Rientjes

Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively.  This requires
the system memory capacity to be known when specifying the command line,
however.

This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory.  This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example, as
a proportion of a system's memory rather than a hardcoded byte value.

To define the percentage, the final character of the parameter should be
a '%'.

Signed-off-by: David Rientjes <rient...@google.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 44 -
 mm/page_alloc.c | 43 +++-
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1825,30 +1825,30 @@
keepinitrd  [HW,ARM]
 
kernelcore= [KNL,X86,IA-64,PPC]
-   Format: nn[KMGTPE] | "mirror"
-   This parameter
-   specifies the amount of memory usable by the kernel
-   for non-movable allocations.  The requested amount is
-   spread evenly throughout all nodes in the system. The
-   remaining memory in each node is used for Movable
-   pages. In the event, a node is too small to have both
-   kernelcore and Movable pages, kernelcore pages will
-   take priority and other nodes will have a larger number
-   of Movable pages.  The Movable zone is used for the
-   allocation of pages that may be reclaimed or moved
-   by the page migration subsystem.  This means that
-   HugeTLB pages may not be allocated from this zone.
-   Note that allocations like PTEs-from-HighMem still
-   use the HighMem zone if it exists, and the Normal
-   zone if it does not.
-
-   Instead of specifying the amount of memory (nn[KMGTPE]),
-   you can specify "mirror" option. In case "mirror"
+   Format: nn[KMGTPE] | nn% | "mirror"
+   This parameter specifies the amount of memory usable by
+   the kernel for non-movable allocations.  The requested
+   amount is spread evenly throughout all nodes in the
+   system as ZONE_NORMAL.  The remaining memory is used for
+   movable memory in its own zone, ZONE_MOVABLE.  In the
+   event, a node is too small to have both ZONE_NORMAL and
+   ZONE_MOVABLE, kernelcore memory will take priority and
+   other nodes will have a larger ZONE_MOVABLE.
+
+   ZONE_MOVABLE is used for the allocation of pages that
+   may be reclaimed or moved by the page migration
+   subsystem.  This means that HugeTLB pages may not be
+   allocated from this zone.  Note that allocations like
+   PTEs-from-HighMem still use the HighMem zone if it
+   exists, and the Normal zone if it does not.
+
+   It is possible to specify the exact amount of memory in
+   the form of "nn[KMGTPE]", a percentage of total system
+   memory in the form of "nn%", or "mirror".  If "mirror"
option is specified, mirrored (reliable) memory is used
for non-movable allocations and remaining memory is used
-   for Movable pages. nn[KMGTPE] and "mirror" are 
exclusive,
-   so you can NOT specify nn[KMGTPE] and "mirror" at the 
same
-   time.
+   for Movable pages.  "nn[KMGTPE]", "nn%", and "mirror"
+   are exclusive, so you cannot specify multiple forms.
 
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -272,7 +272,9 @@ static unsigned long __meminitdata dma_reserve;
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelco

[patch 2/2] mm, page_alloc: move mirrored_kernelcore to __meminitdata

2018-02-12 Thread David Rientjes

mirrored_kernelcore can be in __meminitdata, so move it there.

At the same time, fixup section specifiers to be after the name of the 
variable per checkpatch.
---
 mm/page_alloc.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -264,19 +264,19 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_scale_factor = 10;
 
-static unsigned long __meminitdata nr_kernel_pages;
-static unsigned long __meminitdata nr_all_pages;
-static unsigned long __meminitdata dma_reserve;
+static unsigned long nr_kernel_pages __meminitdata;
+static unsigned long nr_all_pages __meminitdata;
+static unsigned long dma_reserve __meminitdata;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
-static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
-static unsigned long __initdata required_kernelcore;
+static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __meminitdata;
+static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] 
__meminitdata;
+static unsigned long required_kernelcore __initdata;
 static unsigned long required_kernelcore_percent __initdata;
-static unsigned long __initdata required_movablecore;
+static unsigned long required_movablecore __initdata;
 static unsigned long required_movablecore_percent __initdata;
-static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
-static bool mirrored_kernelcore;
+static unsigned long zone_movable_pfn[MAX_NUMNODES] __meminitdata;
+static bool mirrored_kernelcore __meminitdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable

2018-02-01 Thread David Rientjes

On Wed, 31 Jan 2018, Michal Hocko wrote:

> > > > >   root
> > > > >  / |  \
> > > > > A  B   C
> > > > >/ \/ \
> > > > >   D   E  F   G
> > > > > 
> > > > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> > > > > 
> > > > 
> > > > At each level of the hierarchy, memory.oom_policy compares immediate 
> > > > children, it's the only way that an admin can lock in a specific oom 
> > > > policy like "tree" and then delegate the subtree to the user.  If 
> > > > you've 
> > > > configured it as above, comparing A and C should be the same based on 
> > > > the 
> > > > cumulative usage of their child mem cgroups.
> > > 
> It seems I am still not clear with my question. What kind of difference
> does policy=cgroup vs. none on A? Also what kind of different does it
> make when a leaf node has cgroup policy?
> 

If A has an oom policy of "cgroup" it will be comparing the local usage of 
D vs E, "tree" would be the same since neither descendants have child 
cgroups.  If A has an oom policy of "none" it would compare processes 
attached to D and E and respect /proc/pid/oom_score_adj.  It allows for 
opting in and opting out of cgroup aware selection not only for the whole 
system but also per subtree.

> > Hmm, I'm not sure why we would limit memory.oom_group to any policy.  Even 
> > if we are selecting a process, even without selecting cgroups as victims, 
> > killing a process may still render an entire cgroup useless and it makes 
> > sense to kill all processes in that cgroup.  If an unlucky process is 
> > selected with today's heursitic of oom_badness() or with a "none" policy 
> > with my patchset, I don't see why we can't enable the user to kill all 
> > other processes in the cgroup.  It may not make sense for some trees, but 
> > but I think it could be useful for others.
> 
> My intuition screams here. I will think about this some more but I would
> be really curious about any sensible usecase when you want sacrifice the
> whole gang just because of one process compared to other processes or
> cgroups is too large. Do you see how you are mixing entities here?
> 

It's a property of the workload that has nothing to do with selection.  
Regardless of how a victim is selected, we need a victim.  That victim may 
be able to tolerate the loss of the process, and not even need to be the 
largest memory hogging process based on /proc/pid/oom_score_adj (periodic 
cleanups, logging, stat collection are what I'm most familiar with).  It 
may also be vital to the workload and it's better off to kill the entire 
job, it's highly dependent on what the job is.  There's a general usecase 
for memory.oom_group behavior without any selection changes, we've had a 
killall tunable for years and is used by many customers for the same 
reason.  There's no reason for it to be coupled, it can exist independent 
of any cgroup aware selection.

> I do not understand. Get back to our example. Are you saying that G
> with none will enforce the none policy to C and root? If yes then this
> doesn't make any sense because you are not really able to delegate the
> oom policy down the tree at all. It would effectively make tree policy
> pointless.
> 

The oom policy of G is pointless, it has no children cgroups.  It can be 
"none", "cgroup", or "tree", it's not the root of a subtree.  (The oom 
policy of the root mem cgroup is also irrelevant if there are no other 
cgroups, same thing.)  If G is oom, it kills the largest process or 
everything if memory.oom_group is set, which in your example it is.

> I am skipping the rest of the following text because it is picking
> on details and the whole design is not clear to me. So could you start
> over documenting semantic and requirements. Ideally by describing:
> - how does the policy on the root of the OOM hierarchy controls the
>   selection policy

If "none", there's no difference than Linus's tree right now.  If 
"cgroup", it enables cgroup aware selection: it compares all cgroups on 
the system wrt local usage unless that cgroup has "tree" set in which case 
its usage is hierarchical.

> - how does the per-memcg policy act during the tree walk - for both
>   intermediate nodes and leafs

The oom policy is determined by the mem cgroup under oom, that is the root 
of the subtree under oom and its policy dictates how to select a victim 
mem cgroup.

> - how does the oom killer act based on the selected memcg

That's the point about memory.oom_group: once it has selected a cgroup (if 
cgroup aware behavior is enabled for the oom subtree [could be root]), a 
memory hogging process attached to that subtree is killed or everything is 
killed if memory.oom_group is enabled.

> - how do you compare tasks with memcgs
> 

You don't, I think the misunderstanding is what happens if the root of a 
subtree is "cgroup", for example, and a descendant has "none" enabled.  
The root is under oom, it is comparing cgroups :)  "None" is only 
effective

Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable

2018-01-30 Thread David Rientjes

On Tue, 30 Jan 2018, Michal Hocko wrote:

> > > So what is the actual semantic and scope of this policy. Does it apply
> > > only down the hierarchy. Also how do you compare cgroups with different
> > > policies? Let's say you have
> > >   root
> > >  / |  \
> > > A  B   C
> > >/ \/ \
> > >   D   E  F   G
> > > 
> > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> > > 
> > 
> > At each level of the hierarchy, memory.oom_policy compares immediate 
> > children, it's the only way that an admin can lock in a specific oom 
> > policy like "tree" and then delegate the subtree to the user.  If you've 
> > configured it as above, comparing A and C should be the same based on the 
> > cumulative usage of their child mem cgroups.
> 
> So cgroup == tree if we are memcg aware OOM killing, right? Why do we
> need both then? Just to make memcg aware OOM killing possible?
> 

We need "tree" to account the usage of the subtree rather than simply the 
cgroup alone, but "cgroup" and "tree" are accounted with the same units.  
In your example, D and E are treated as individual memory consumers and C 
is treated as the sum of all subtree memory consumers.

If we have /students/michal and /students/david, and both of these are 
"cgroup" policy, as the current patchset in -mm implements, and you use 
1GB, but I create /students/david/{a,b,c,d} each with 512MB of usage, you 
always get oom killed.

If we both have "tree" policy, I always get oom killed because my usage is 
2GB.  /students/michal and /students/david are compared based on their 
total usage instead of each cgroup being an individual memory consumer.

This is impossible with what is in -mm.

> > The policy for B hasn't been specified, but since it does not have any 
> > children "cgroup" and "tree" should be the same.
> 
> So now you have a killable cgroup selected by process criterion? That
> just doesn't make any sense. So I guess it would at least require to
> enforce (cgroup || tree) to allow oom_group.
> 

Hmm, I'm not sure why we would limit memory.oom_group to any policy.  Even 
if we are selecting a process, even without selecting cgroups as victims, 
killing a process may still render an entire cgroup useless and it makes 
sense to kill all processes in that cgroup.  If an unlucky process is 
selected with today's heursitic of oom_badness() or with a "none" policy 
with my patchset, I don't see why we can't enable the user to kill all 
other processes in the cgroup.  It may not make sense for some trees, but 
but I think it could be useful for others.

> > Right, a policy of "none" reverts its subtree back to per-process 
> > comparison if you are either not using the cgroup aware oom killer or your 
> > subtree is not using the cgroup aware oom killer.
> 
> So how are you going to compare none cgroups with those that consider
> full memcg or hierarchy (cgroup, tree)? Are you going to consider
> oom_score_adj?
> 

No, I think it would make sense to make the restriction that to set 
"none", the ancestor mem cgroups would also need the same policy, which is 
to select the largest process while still respecting 
/proc/pid/oom_score_adj.

> > In that case, mem_cgroup_select_oom_victim() always 
> > returns false and the value of memory.oom_group is ignored.  I agree that 
> > it's weird in -mm and there's nothing preventing us from separating 
> > memory.oom_group from the cgroup aware oom killer and allowing it to be 
> > set regardless of a selection change.
> 
> it is not weird. I suspect you misunderstood the code and its intention.
> 

We agree that memory.oom_group and a selection logic are two different 
things, and that's why I find it weird that memory.oom_group cannot be set 
without locking the entire hierarchy into a selection logic.  If you have 
a subtree oom, it makes sense for you to be able to kill all processes as 
a property of the workload.  That's independent of how the target mem 
cgroup was selected.  Regardless of the selection logic, we're going 
to target a specific mem cgroup for kill.  Choosing to kill one or all 
processes is still useful.

> > No, perhaps I wasn't clear in the documentation: the policy at each level 
> > of the hierarchy is specified by memory.oom_policy and compares its 
> > immediate children with that policy.  So the per-cgroup usage of A, B, and 
> > C and compared regardless of A, B, and C's own oom policies.
> 
> You are still operating in terms of levels. And that is rather confusing
> because we are operating on a _tree_ and that walk has to be independent
> on the way we walk that tree - i.e. whether we do DFS or BFS ordering.
> 

The selection criteria for the proposed policies, which can be extended, 
is to compare individual cgroups (for "cgroups" policy) to determine the 
victim and within that subtree, to allow the selection to be delegated 
further.  If the goal is the largest cgroup, all mem cgroups down the tree 
will have "cgroup" set.  If

Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable

2018-01-29 Thread David Rientjes

On Fri, 26 Jan 2018, Michal Hocko wrote:

> > The cgroup aware oom killer is needlessly declared for the entire system
> > by a mount option.  It's unnecessary to force the system into a single
> > oom policy: either cgroup aware, or the traditional process aware.
> > 
> > This patch introduces a memory.oom_policy tunable for all mem cgroups.
> > It is currently a no-op: it can only be set to "none", which is its
> > default policy.  It will be expanded in the next patch to define cgroup
> > aware oom killer behavior.
> > 
> > This is an extensible interface that can be used to define cgroup aware
> > assessment of mem cgroup subtrees or the traditional process aware
> > assessment.
> > 
> 
> So what is the actual semantic and scope of this policy. Does it apply
> only down the hierarchy. Also how do you compare cgroups with different
> policies? Let's say you have
>   root
>  / |  \
> A  B   C
>/ \/ \
>   D   E  F   G
> 
> Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> 

At each level of the hierarchy, memory.oom_policy compares immediate 
children, it's the only way that an admin can lock in a specific oom 
policy like "tree" and then delegate the subtree to the user.  If you've 
configured it as above, comparing A and C should be the same based on the 
cumulative usage of their child mem cgroups.

The policy for B hasn't been specified, but since it does not have any 
children "cgroup" and "tree" should be the same.

> Now we have the global OOM killer to choose a victim. From a quick
> glance over those patches, it seems that we will be comparing only
> tasks because root->oom_policy != MEMCG_OOM_POLICY_CGROUP. A, B and C
> policies are ignored.

Right, a policy of "none" reverts its subtree back to per-process 
comparison if you are either not using the cgroup aware oom killer or your 
subtree is not using the cgroup aware oom killer.

> Moreover If I select any of B's tasks then I will
> happily kill it breaking the expectation that the whole memcg will go
> away. Weird, don't you think? Or did I misunderstand?
> 

It's just as weird as the behavior of memory.oom_group today without using 
the mount option :)  In that case, mem_cgroup_select_oom_victim() always 
returns false and the value of memory.oom_group is ignored.  I agree that 
it's weird in -mm and there's nothing preventing us from separating 
memory.oom_group from the cgroup aware oom killer and allowing it to be 
set regardless of a selection change.  If memory.oom_group is set, and the 
kill originates from that mem cgroup, kill all processes attached to it 
and its subtree.

This is a criticism of the current implementation in -mm, however, my 
extension only respects its weirdness.

> So let's assume that root: cgroup. Then we are finally comparing
> cgroups. D, E, B, C. Of those D, E and F do not have any
> policy. Do they inherit their policy from the parent? If they don't then
> we should be comparing their tasks separately, no? The code disagrees
> because once we are in the cgroup mode, we do not care about separate
> tasks.
> 

No, perhaps I wasn't clear in the documentation: the policy at each level 
of the hierarchy is specified by memory.oom_policy and compares its 
immediate children with that policy.  So the per-cgroup usage of A, B, and 
C and compared regardless of A, B, and C's own oom policies.

> Let's say we choose C because it has the largest cumulative consumption.
> It is not oom_group so it will select a task from F, G. Again you are
> breaking oom_group policy of G if you kill a single task. So you would
> have to be recursive here. That sounds fixable though. Just be
> recursive.
> 

I fully agree, but that's (another) implementation detail of what is in 
-mm that isn't specified.  I think where you're going is the complete 
separation of mem cgroup selection from memory.oom_group.  I agree, and we 
can fix that.  memory.oom_group also shouldn't depend on any mount option, 
it can be set or unset depending on the properties of the workload.

> Then you say
> 
> > Another benefit of such an approach is that an admin can lock in a
> > certain policy for the system or for a mem cgroup subtree and can
> > delegate the policy decision to the user to determine if the kill should
> > originate from a subcontainer, as indivisible memory consumers
> > themselves, or selection should be done per process.
> 
> And the code indeed doesn't check oom_policy on each level of the
> hierarchy, unless I am missing something. So the subgroup is simply
> locked in to the oom_policy parent has chosen. That is not the case for
> the tree policy.
> 
> So look how we are comparing cumulative groups without policy with
> groups with policy with subtrees. Either I have grossly misunderstood
> something or this is massively inconsistent and it doesn't make much
> sense to me. Root memcg without cgroup policy will simply turn off the
> whole thing for the global OOM case. So you really

Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-26 Thread David Rientjes

On Fri, 26 Jan 2018, Andrew Morton wrote:

> > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> 
> Approximately.  Let me put it another way: can we modify your patchset
> so that the mount option remains, and continues to have a sufficiently
> same effect?  For backward compatibility.
> 

The mount option would exist solely to set the oom policy of the root mem 
cgroup, it would lose its effect of mandating that policy for any subtree 
since it would become configurable by the user if delegated.

Let me put it another way: if the cgroup aware oom killer is merged for 
4.16 without this patchset, userspace can reasonably infer the oom policy 
from checking how cgroups were mounted.  If my followup patchset were 
merged for 4.17, that's invalid and it becomes dependent on kernel 
version: it could have the "groupoom" mount option but configured through 
the root mem cgroup's memory.oom_policy to not be cgroup aware at all.

That inconsistency, to me, is unfair to burden userspace with.

> > This, and fixes to fairly compare the root mem cgroup with leaf mem 
> > cgroups, are essential before the feature is merged otherwise it yields 
> > wildly unpredictable (and unexpected, since its interaction with 
> > oom_score_adj isn't documented) results as I already demonstrated where 
> > cgroups with 1GB of usage are killed instead of 6GB workers outside of 
> > that subtree.
> 
> OK, so Roman's new feature is incomplete: it satisfies some use cases
> but not others.  And we kinda have a plan to address the other use
> cases in the future.
> 

Those use cases are also undocumented such that the user doesn't know the 
behavior they are opting into.  Nowhere in the patchset does it mention 
anything about oom_score_adj other than being oom disabled.  It doesn't 
mention that a per-process tunable now depends strictly on whether it is 
attached to root or not.  It specifies a fair comparison between the root 
mem cgroup and leaf mem cgroups, which is obviously incorrect by the 
implementation itself.  So I'm not sure the user would know which use 
cases it is valid for, which is why I've been trying to make it generally 
purposeful and documented.

> There's nothing wrong with that!  As long as we don't break existing
> setups while evolving the feature.  How do we do that?
> 

We'd break the setups that actually configure their cgroups and processes 
to abide by the current implementation since we'd need to discount 
oom_score_adj from the the root mem cgroup usage to fix it.

There hasn't been any feedback on v2 of my patchset that would suggest 
changes are needed.  I think we all recognize the shortcoming that it is 
addressing.  The only feedback on v1, the need for memory.oom_group as a 
separate tunable, has been addressed in v2.  What are we waiting for?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-26 Thread David Rientjes

On Fri, 26 Jan 2018, Michal Hocko wrote:

> > Could you elaborate on why specifying the oom policy for the entire 
> > hierarchy as part of the root mem cgroup and also for individual subtrees 
> > is incomplete?  It allows admins to specify and delegate policy decisions 
> > to subtrees owners as appropriate.  It addresses your concern in the 
> > /admins and /students example.  It addresses my concern about evading the 
> > selection criteria simply by creating child cgroups.  It appears to be a 
> > win-win.  What is incomplete or are you concerned about?
> 
> I will get back to this later. I am really busy these days. This is not
> a trivial thing at all.
> 

Please follow-up in the v2 patchset when you have time.

> Most usecases I've ever seen usually use oom_score_adj only to disable
> the oom killer for a particular service. In those case the current
> heuristic works reasonably well.
> 

I'm not familiar with the workloads you have worked with that use 
oom_score_adj.  We use it to prefer a subset of processes first and a 
subset of processes last.  I don't expect this to be a highly specialized 
usecase, it's the purpose of the tunable.

The fact remains that oom_score_adj tuning is only effective with the 
current implementation when attached to the root mem cgroup in an 
undocumented way, the preference or bias immediately changes as soon as it 
is attached to a cgroup, even if it's the only non root mem cgroup on the 
system.

> > That's because per-process usage and oom_score_adj are only relevant  
> > for the root mem cgroup and irrelevant when attached to a leaf.   
> 
> This is the simplest implementation. You could go and ignore
> oom_score_adj on root tasks. Would it be much better? Should you ignore
> oom disabled tasks? Should you consider kernel memory footprint of those
> tasks? Maybe we will realize that we simply have to account root memcg
> like any other memcg.  We used to do that but it has been reverted due
> to performance footprint. There are more questions to answer I believe
> but the most important one is whether actually any _real_ user cares.
> 

The goal is to compare the root mem cgroup and leaf mem cgroups equally.  
That is specifically listed as a goal for the cgroup aware oom killer and 
it's very obvious it's not implemented correctly particularly because of 
this bias but also because sum of oom_badness() != anon + unevictable + 
unreclaimable slab, even discounting oom_score_adj.  The amount of slab is 
only considered for leaf mem cgroups as well.

What I've proposed in the past was to use the global state of anon, 
unevictable, and unreclaimable slab to fairly account the root mem cgroup 
without bias from oom_score_adj for comparing cgroup usage.  oom_score_adj 
is valid when choosing the process from the root mem cgroup to kill, not 
when comparing against other cgroups since leaf cgroups discount it.

> I can see your arguments and they are true. You can construct setups
> where the current memcg oom heuristic works sub-optimally. The same has
> been the case for the OOM killer in general. The OOM killer has always
> been just a heuristic and there always be somebody complaining. This
> doesn't mean we should just remove it because it works reasonably well
> for most users.
> 

It's not most users, it's only for configurations that are fully 
containerized where there are no user processes attached to the root mem 
cgroup and nobody uses oom_score_adj like it is defined to be used, and 
it's undocumented so they don't even know that fact without looking at the 
kernel implementation.

> > Because of that, users are 
> > affected by the design decision and will organize their hierarchies as 
> > approrpiate to avoid it.  Users who only want to use cgroups for a subset 
> > of processes but still treat those processes as indivisible logical units 
> > when attached to cgroups find that it is simply not possible.
> 
> Nobody enforces the memcg oom selection as presented here for those
> users. They have to explicitly _opt-in_. If the new heuristic doesn't
> work for them we will hear about that most likely. I am really skeptical
> that oom_score_adj can be reused for memcg aware oom selection.
> 

oom_score_adj is value for choosing a process attached to a mem cgroup to 
kill, absent memory.oom_group being set.  It is not valid to for comparing 
cgroups, obviously.  That's why it shouldn't be used for the root mem 
cgroup either, which the current implementation does, when it is 
documented falsely to be a fair comparison.

> I do not think anything you have proposed so far is even close to
> mergeable state. I think you are simply oversimplifying this. We have
> spent many months discussing different aspects of the memcg aware OOM
> killer. The result is a compromise that should work reasonably well
> for the targeted usecases and it doesn't bring unsustainable APIs that
> will get carved into stone.

If you don't have time to review the patchset to

Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-26 Thread David Rientjes

On Thu, 25 Jan 2018, Andrew Morton wrote:

> > Now that each mem cgroup on the system has a memory.oom_policy tunable to
> > specify oom kill selection behavior, remove the needless "groupoom" mount
> > option that requires (1) the entire system to be forced, perhaps
> > unnecessarily, perhaps unexpectedly, into a single oom policy that
> > differs from the traditional per process selection, and (2) a remount to
> > change.
> > 
> > Instead of enabling the cgroup aware oom killer with the "groupoom" mount
> > option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
> 
> Can we retain the groupoom mount option and use its setting to set the
> initial value of every memory.oom_policy?  That way the mount option
> remains somewhat useful and we're back-compatible?
> 

-ECONFUSED.  We want to have a mount option that has the sole purpose of 
doing echo cgroup > /mnt/cgroup/memory.oom_policy?

Please note that this patchset is not only to remove a mount option, it is 
to allow oom policies to be configured per subtree such that users whom 
you delegate those subtrees to cannot evade the oom policy that is set at 
a higher level.  The goal is to prevent the user from needing to organize 
their hierarchy is a specific way to work around this constraint and use 
things like limiting the number of child cgroups that user is allowed to 
create only to work around the oom policy.  With a cgroup v2 single 
hierarchy it severely limits the amount of control the user has over their 
processes because they are locked into a very specific hierarchy 
configuration solely to not allow the user to evade oom kill.

This, and fixes to fairly compare the root mem cgroup with leaf mem 
cgroups, are essential before the feature is merged otherwise it yields 
wildly unpredictable (and unexpected, since its interaction with 
oom_score_adj isn't documented) results as I already demonstrated where 
cgroups with 1GB of usage are killed instead of 6GB workers outside of 
that subtree.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch -mm v2 0/3] mm, memcg: introduce oom policies

2018-01-25 Thread David Rientjes

There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:

 (1) allows users to evade the oom killer by creating subcontainers or
 using other controllers since scoring is done per cgroup and not
 hierarchically,

 (2) does not allow the user to influence the decisionmaking, such that
 important subtrees cannot be preferred or biased, and

 (3) unfairly compares the root mem cgroup using completely different
 criteria than leaf mem cgroups and allows wildly inaccurate results
 if oom_score_adj is used.

This patchset aims to fix (1) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.

It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.

It preserves memory.oom_group behavior.
---
 Applied on top of -mm.

 Documentation/cgroup-v2.txt | 64 -
 include/linux/cgroup-defs.h |  5 
 include/linux/memcontrol.h  | 21 +++
 kernel/cgroup/cgroup.c  | 13 +
 mm/memcontrol.c | 64 -
 5 files changed, 114 insertions(+), 53 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch -mm v2 3/3] mm, memcg: add hierarchical usage oom policy

2018-01-25 Thread David Rientjes

One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.

In this regard, users who do not distribute their processes over a set of
subcontainers for mem cgroup control, statistics, or other controllers
are unfairly penalized.

This adds an oom policy, "tree", that accounts for hierarchical usage
when comparing cgroups and the cgroup aware oom killer is enabled by an
ancestor.  This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage.  In other words, they can longer evade the oom killer
by using other controllers or subcontainers.

Signed-off-by: David Rientjes <rient...@google.com>
---
 Documentation/cgroup-v2.txt | 12 ++--
 include/linux/memcontrol.h  |  5 +
 mm/memcontrol.c | 12 +---
 3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1078,6 +1078,11 @@ PAGE_SIZE multiple when read back.
memory consumers; that is, they will compare mem cgroup usage rather
than process memory footprint.  See the "OOM Killer" section.
 
+   If "tree", the OOM killer will compare mem cgroups and its subtree
+   as indivisible memory consumers when selecting a hierarchy.  This
+   policy cannot be set on the root mem cgroup.  See the "OOM Killer"
+   section.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1301,6 +1306,9 @@ There are currently two available oom policies:
subtree as an OOM victim and kill at least one process, depending on
memory.oom_group, from it.
 
+ - "tree": choose the cgroup with the largest memory footprint considering
+   itself and its subtree and kill at least one process.
+
 When selecting a cgroup as a victim, the OOM killer will kill the process
 with the largest memory footprint.  A user can control this behavior by
 enabling the per-cgroup memory.oom_group option.  If set, it causes the
@@ -1314,8 +1322,8 @@ Please, note that memory charges are not migrating if 
tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
 If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and enabling oom_group
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+and enabling oom_group on an ancestor layer.
 
 
 IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -69,6 +69,11 @@ enum memcg_oom_policy {
 * mem cgroup as an indivisible consumer
 */
MEMCG_OOM_POLICY_CGROUP,
+   /*
+* Tree cgroup usage for all descendant memcg groups, treating each mem
+* cgroup and its subtree as an indivisible consumer
+*/
+   MEMCG_OOM_POLICY_TREE,
 };
 
 struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2728,7 +2728,7 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
/*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
-* Non-leaf oom_group cgroups accumulating score of descendant
+* Cgroups with oom policy of "tree" accumulate the score of descendant
 * leaf memory cgroups.
 */
rcu_read_lock();
@@ -2737,10 +2737,11 @@ static void select_victim_memcg(struct mem_cgroup 
*root, struct oom_control *oc)
 
/*
 * We don't consider non-leaf non-oom_group memory cgroups
-* as OOM victims.
+* without the oom policy of "tree" as OOM victims.
 */
if (memcg_has_children(iter) && iter != root_mem_cgroup &&
-   !mem_cgroup_oom_group(iter))
+   !mem_cgroup_oom_group(iter) &&
+   iter->oom_policy != MEMCG_OOM_POLICY_TREE)
continue;
 
/*
@@ -5538,6 +5539,9 @@ static int memory_oom_policy_show(struct seq_file *m, 
void *v)
case MEMCG_OOM_POLICY_CGROUP:
seq_puts(m, "cgroup\n");
break;
+   case MEMCG_OOM_POLICY_TREE:
+   seq_puts(m, "tree\n");
+   break;
case MEMCG_OOM_POLICY_NONE:
default:
seq_puts(m, &q

[patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable

2018-01-25 Thread David Rientjes

The cgroup aware oom killer is needlessly declared for the entire system
by a mount option.  It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.

This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy.  It will be expanded in the next patch to define cgroup
aware oom killer behavior.

This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.

Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.

Signed-off-by: David Rientjes <rient...@google.com>
---
 Documentation/cgroup-v2.txt |  9 +
 include/linux/memcontrol.h  | 11 +++
 mm/memcontrol.c | 35 +++
 3 files changed, 55 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1065,6 +1065,15 @@ PAGE_SIZE multiple when read back.
If cgroup-aware OOM killer is not enabled, ENOTSUPP error
is returned on attempt to access the file.
 
+  memory.oom_policy
+
+   A read-write single string file which exists on all cgroups.  The
+   default value is "none".
+
+   If "none", the OOM killer will use the default policy to choose a
+   victim; that is, it will choose the single process with the largest
+   memory footprint.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -58,6 +58,14 @@ enum memcg_event_item {
MEMCG_NR_EVENTS,
 };
 
+enum memcg_oom_policy {
+   /*
+* No special oom policy, process selection is determined by
+* oom_badness()
+*/
+   MEMCG_OOM_POLICY_NONE,
+};
+
 struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -203,6 +211,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /* OOM policy for this subtree */
+   enum memcg_oom_policy oom_policy;
+
/*
 * Treat the sub-tree as an indivisible memory consumer,
 * kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4417,6 +4417,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
if (parent) {
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
+   memcg->oom_policy = parent->oom_policy;
}
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
@@ -5534,6 +5535,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
 }
 
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+   enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+   switch (policy) {
+   case MEMCG_OOM_POLICY_NONE:
+   default:
+   seq_puts(m, "none\n");
+   };
+   return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+  char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   ssize_t ret = nbytes;
+
+   buf = strstrip(buf);
+   if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+   memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+   else
+   ret = -EINVAL;
+
+   return ret;
+}
+
 static struct cftype memory_files[] = {
{
.name = "current",
@@ -5575,6 +5604,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+   {
+   .name = "oom_policy",
+   .flags = CFTYPE_NS_DELEGATABLE,
+   .seq_show = memory_oom_policy_show,
+   .write = memory_oom_policy_write,
+   },
{ } /* terminate */
 };
 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-25 Thread David Rientjes

On Thu, 25 Jan 2018, Michal Hocko wrote:

> > As a result, this would remove patch 3/4 from the series.  Do you have any 
> > other feedback regarding the remainder of this patch series before I 
> > rebase it?
> 
> Yes, and I have provided it already. What you are proposing is
> incomplete at best and needs much better consideration and much more
> time to settle.
> 

Could you elaborate on why specifying the oom policy for the entire 
hierarchy as part of the root mem cgroup and also for individual subtrees 
is incomplete?  It allows admins to specify and delegate policy decisions 
to subtrees owners as appropriate.  It addresses your concern in the 
/admins and /students example.  It addresses my concern about evading the 
selection criteria simply by creating child cgroups.  It appears to be a 
win-win.  What is incomplete or are you concerned about?

> > I will address the unfair root mem cgroup vs leaf mem cgroup comparison in 
> > a separate patchset to fix an issue where any user of oom_score_adj on a 
> > system that is not fully containerized gets very unusual, unexpected, and 
> > undocumented results.
> 
> I will not oppose but as it has been mentioned several times, this is by
> no means a blocker issue. It can be added on top.
> 

The current implementation is only useful for fully containerized systems 
where no processes are attached to the root mem cgroup.  Anything in the 
root mem cgroup is judged by different criteria and if they use 
/proc/pid/oom_score_adj the entire heuristic breaks down.  That's because 
per-process usage and oom_score_adj are only relevant for the root mem 
cgroup and irrelevant when attached to a leaf.  Because of that, users are 
affected by the design decision and will organize their hierarchies as 
approrpiate to avoid it.  Users who only want to use cgroups for a subset 
of processes but still treat those processes as indivisible logical units 
when attached to cgroups find that it is simply not possible.

I'm focused solely on fixing the three main issues that this 
implementation causes.  One of them, userspace influence to protect 
important cgroups, can be added on top.  The other two, evading the 
selection criteria and unfair comparison of root vs leaf, are shortcomings 
in the design that I believe should be addressed before it's merged to 
avoid changing the API later.  I'm in no rush to ask for the cgroup aware 
oom killer to be merged if it's incomplete and must be changed for 
usecases that are not highly specialized (fully containerized and no use 
of oom_score_adj for any process).  I am actively engaged in fixing it, 
however, so that it becomes a candidate for merge.  Your feedback is 
useful with regard to those fixes, but daily emails on how we must merge 
the current implementation now are not providing value, at least to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-24 Thread David Rientjes

On Wed, 24 Jan 2018, Michal Hocko wrote:

> > The current implementation of memory.oom_group is based on top of a 
> > selection implementation that is broken in three ways I have listed for 
> > months:
> 
> This doesn't lead to anywhere. You are not presenting any new arguments
> and you are ignoring feedback you have received so far. We have tried
> really hard. Considering different _independent_ people presented more or
> less consistent view on these points I think you should deeply
> reconsider how you take that feedback.
> 

I've responded to each email providing useful feedback on this patchset.  
I agreed with Tejun about not embedding the oom mechanism into 
memory.oom_policy.  I was trying to avoid having two files in the mem 
cgroup v2 filesystem for oom policy and mechanism.  I agreed that 
delegating the mechanism to the workload would be useful in some cases.  
I've solicited feedback on any other opinions on how that can be done 
better, but it appears another tunable is the most convenient way of 
allowing this behavior to be specified.

As a result, this would remove patch 3/4 from the series.  Do you have any 
other feedback regarding the remainder of this patch series before I 
rebase it?

I will address the unfair root mem cgroup vs leaf mem cgroup comparison in 
a separate patchset to fix an issue where any user of oom_score_adj on a 
system that is not fully containerized gets very unusual, unexpected, and 
undocumented results.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-23 Thread David Rientjes

On Tue, 23 Jan 2018, Michal Hocko wrote:

> > It can't, because the current patchset locks the system into a single 
> > selection criteria that is unnecessary and the mount option would become a 
> > no-op after the policy per subtree becomes configurable by the user as 
> > part of the hierarchy itself.
> 
> This is simply not true! OOM victim selection has changed in the
> past and will be always a subject to changes in future. Current
> implementation doesn't provide any externally controlable selection
> policy and therefore the default can be assumed. Whatever that default
> means now or in future. The only contract added here is the kill full
> memcg if selected and that can be implemented on _any_ selection policy.
> 

The current implementation of memory.oom_group is based on top of a 
selection implementation that is broken in three ways I have listed for 
months:

 - allows users to intentionally/unintentionally evade the oom killer,
   requires not locking the selection implementation for the entire
   system, requires subtree control to prevent, makes a mount option
   obsolete, and breaks existing users who would use the implementation
   based on 4.16 if this were merged,

 - unfairly compares the root mem cgroup vs leaf mem cgroup such that
   users must structure their hierarchy only for 4.16 in such a way
   that _all_ processes are under hierarchical control and have no
   power to create sub cgroups because of the point above and
   completely breaks any user of oom_score_adj in a completely
   undocumented and unspecified way, such that fixing that breakage
   would also break any existing users who would use the implementation
   based on 4.16 if this were merged, and

 - does not allow userspace to protect important cgroups, which can be
   built on top.

I'm focused on fixing the breakage in the first two points since it 
affects the API and we don't want to switch that out from the user.  I 
have brought these points up repeatedly and everybody else has actively 
disengaged from development, so I'm proposing incremental changes that 
make the cgroup aware oom killer have a sustainable API and isn't useful 
only for a highly specialized usecase where everything is containerized, 
nobody can create subcgroups, and nobody uses oom_score_adj to break the 
root mem cgroup accounting.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-19 Thread David Rientjes

On Wed, 17 Jan 2018, David Rientjes wrote:

> Yes, this is a valid point.  The policy of "tree" and "all" are identical 
> policies and then the mechanism differs wrt to whether one process is 
> killed or all eligible processes are killed, respectively.  My motivation 
> for this was to avoid having two different tunables, especially because 
> later we'll need a way for userspace to influence the decisionmaking to 
> protect (bias against) important subtrees.  What would really be nice is 
> cgroup.subtree_control-type behavior where we could effect a policy and a 
> mechanism at the same time.  It's not clear how that would be handled to 
> allow only one policy and one mechanism, however, in a clean way.  The 
> simplest for the user would be a new file, to specify the mechanism and 
> leave memory.oom_policy alone.  Would another file really be warranted?  
> Not sure.
> 

Hearing no response, I'll implement this as a separate tunable in a v2 
series assuming there are no better ideas proposed before next week.  One 
of the nice things about a separate tunable is that an admin can control 
the overall policy and they can delegate the mechanism (killall vs one 
process) to a user subtree.  I agree with your earlier point that killall 
vs one process is a property of the workload and is better defined 
separately.

I'll also look to fix the breakage wrt root mem cgroup comparison with 
leaf mem cgroup comparison that is currently in -mm.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 0/4] mm, memcg: introduce oom policies

2018-01-17 Thread David Rientjes

On Wed, 17 Jan 2018, Roman Gushchin wrote:

> You're introducing a new oom_policy knob, which has two separate sets
> of possible values for the root and non-root cgroups. I don't think
> it aligns with the existing cgroup v2 design.
> 

The root mem cgroup can use "none" or "cgroup" to either disable or 
enable, respectively, the cgroup aware oom killer.  These are available to 
non root mem cgroups as well, with the addition of hierarchical usage 
comparison to prevent the ability to completely evade the oom killer by 
the user by creating sub cgroups.  Just as we have files that are made 
available on the root cgroup and not others, I think it's fine to allow 
only certain policies on the root mem cgroup.  As I wrote to Tejun, I'm 
fine with the policy being separated from the mechanism.  That can be 
resolved in that email thread, but the overall point of this patchset is 
to allow hierarchical comparison on some subtrees and not on others.  We 
can avoid the mount option in the same manner.

> For the root cgroup it works exactly as mount option, and both "none"
> and "cgroup" values have no meaning outside of the root cgroup. We can
> discuss if a knob on root cgroup is better than a mount option, or not
> (I don't think so), but it has nothing to do with oom policy as you
> define it for non-root cgroups.
> 

It certainly does have value outside of the root cgroup: for system oom 
conditions it is possible to choose the largest process, largest single 
cgroup, or largest subtree to kill from.  For memcg oom conditions it's 
possible for a consumer in a subtree to define whether it wants the 
largest memory hogging process to be oom killed or the largest of its 
child sub cgroups.  This would be needed for a job scheduler in its own 
subtree, for example, that creates its own subtrees to limit jobs 
scheduled by individual users who have the power over their subtree but 
not their limit.  This is a real-world example.  Michal also provided his 
own example concerning top-level /admins and /students.  They want to use 
"cgroup" themselves and then /students/roman would be forced to "tree".

Keep in mind that this patchset alone makes the interface extensible and 
addresses one of my big three concerns, but the comparison of the root mem 
cgroup compared to other individual cgroups and cgroups maintaining a 
subtree needs to be fixed separately so that we don't completely evade all 
of these oom policies by using /proc/pid/oom_score_adj in the root mem 
cgroup.  That's a separate issue that needs to be addressed.  My last 
concern, about userspace influence, can probably be addressed after this 
is merged by yet another tunable since it's vital that important cgroups 
can be protected.  It does make sense for an important cgroup to be able 
to use 50% of memory without being oom killed, and that's impossible if 
cgroup v2 has been mounted with your mount option.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-17 Thread David Rientjes

On Wed, 17 Jan 2018, Michal Hocko wrote:

> Absolutely agreed! And moreover, there are not all that many ways what
> to do as an action. You just kill a logical entity - be it a process or
> a logical group of processes. But you have way too many policies how
> to select that entity. Do you want to chose the youngest process/group
> because all the older ones have been computing real stuff and you would
> lose days of your cpu time? Or should those who pay more should be
> protected (aka give them static priorities), or you name it...
> 

That's an argument for making the interface extensible, yes.

> I am sorry, I still didn't grasp the full semantic of the proposed
> soluton but the mere fact it is starting by conflating selection and the
> action is a no go and a wrong API. This is why I've said that what you
> (David) outlined yesterday is probably going to suffer from a much
> longer discussion and most likely to be not acceptable. Your patchset
> proves me correct...

I'm very happy to change the API if there are better suggestions.  That 
may end up just being an memory.oom_policy file, as this implements, and 
separating out a new memory.oom_action that isn't a boolean value to 
either do a full group kill or only a single process.  Or it could be what 
I suggested in my mail to Tejun, such as "hierarchy killall" written to
memory.oom_policy, which would specify a single policy and then an 
optional mechanism.  With my proposed patchset, there would then be three 
policies: "none", "cgroup", and "tree" and one possible optional 
mechanism: "killall".
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-17 Thread David Rientjes

On Wed, 17 Jan 2018, Tejun Heo wrote:

> Hello, David.
> 

Hi Tejun!

> > The behavior of killing an entire indivisible memory consumer, enabled
> > by memory.oom_group, is an oom policy itself.  It specifies that all
> 
> I thought we discussed this before but maybe I'm misremembering.
> There are two parts to the OOM policy.  One is victim selection, the
> other is the action to take thereafter.
> 
> The two are different and conflating the two don't work too well.  For
> example, please consider what should be given to the delegatee when
> delegating a subtree, which often is a good excercise when designing
> these APIs.
> 
> When a given workload is selected for OOM kill (IOW, selected to free
> some memory), whether the workload can handle individual process kills
> or not is the property of the workload itself.  Some applications can
> safely handle some of its processes picked off and killed.  Most
> others can't and want to be handled as a single unit, which makes it a
> property of the workload.
> 

Yes, this is a valid point.  The policy of "tree" and "all" are identical 
policies and then the mechanism differs wrt to whether one process is 
killed or all eligible processes are killed, respectively.  My motivation 
for this was to avoid having two different tunables, especially because 
later we'll need a way for userspace to influence the decisionmaking to 
protect (bias against) important subtrees.  What would really be nice is 
cgroup.subtree_control-type behavior where we could effect a policy and a 
mechanism at the same time.  It's not clear how that would be handled to 
allow only one policy and one mechanism, however, in a clean way.  The 
simplest for the user would be a new file, to specify the mechanism and 
leave memory.oom_policy alone.  Would another file really be warranted?  
Not sure.

> That makes sense in the hierarchy too because whether one process or
> the whole workload is killed doesn't infringe upon the parent's
> authority over resources which in turn implies that there's nothing to
> worry about how the parent's groupoom setting should constrain the
> descendants.
> 
> OOM victim selection policy is a different beast.  As you've mentioned
> multiple times, especially if you're worrying about people abusing OOM
> policies by creating sub-cgroups and so on, the policy, first of all,
> shouldn't be delegatable and secondly should have meaningful
> hierarchical restrictions so that a policy that an ancestor chose
> can't be nullified by a descendant.
> 

The goal here was to require a policy of either "tree" or "all" that the 
user can't change.  They are allowed to have their own oom policies 
internal to their subtree, however, for oom conditions in that subtree 
alone.  However, if the common ancestor hits its limit, it is forced to 
either be "tree" or "all" and require hierarchical usage to be considered 
instead of localized usage.  Either "tree" or "all" is appropriate, and 
this may be why you brought up the point about separating them out, i.e. 
the policy can be demanded by the common ancestor but the actual mechanism 
that the oom killer uses, kill either a single process or the full cgroup, 
is left to the user depending on their workload.  That sounds reasonable 
and I can easily separate the two by introducing a new file, similar to 
memory.oom_group but in a more extensible way so that it is not simply a 
selection of either full cgroup kill or single process.

> I'm not necessarily against adding hierarchical victim selection
> policy tunables; however, I am skeptical whether static tunables on
> cgroup hierarchy (including selectable policies) can be made clean and
> versatile enough, especially because the resource hierarchy doesn't
> necessarily, or rather in most cases, match the OOM victim selection
> decision tree, but I'd be happy to be proven wrong.
> 

Right, and I think that giving users control over their subtrees is a 
powerful tool and one that can lead to very effective use of the cgroup v2 
hierarchy.  Being able to circumvent the oom selection by creating child 
cgroups is certainly something that can trivially be prevented.  The 
argument that users can currently divide their entire processes into 
several different smaller processes to circumvent today's heuristic 
doesn't mean we can't have "tree"-like comparisons between cgroups to 
address that issue itself since all processes charge to the tree itself.

I became convinced of this when I saw the real-world usecases that would 
use such a feature on cgroup v2: we want to have hierarchical usage for 
comparison when full subtrees are dedicated to individual consumers, for 
example, and local mem cgroup usage for comparison when using hierarchies 
for top-level /admins and /students cgroups for which Michal provided an 
example.  These can coexist on systems and it's clear that there needs to 
be a system-wide policy decision for the cgroup aware oom killer (the idea 
behind the current

[patch -mm 2/4] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-16 Thread David Rientjes

Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.

Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".

Signed-off-by: David Rientjes <rient...@google.com>
---
 Documentation/cgroup-v2.txt | 43 +--
 include/linux/cgroup-defs.h |  5 -
 include/linux/memcontrol.h  |  5 +
 kernel/cgroup/cgroup.c  | 13 +
 mm/memcontrol.c | 17 -
 5 files changed, 35 insertions(+), 48 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1069,6 +1069,10 @@ PAGE_SIZE multiple when read back.
victim; that is, it will choose the single process with the largest
memory footprint.
 
+   If "cgroup", the OOM killer will compare mem cgroups as indivisible
+   memory consumers; that is, they will compare mem cgroup usage rather
+   than process memory footprint.  See the "OOM Killer" section.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1275,37 +1279,32 @@ belonging to the affected files to ensure correct 
memory ownership.
 OOM Killer
 ~~
 
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
+Cgroup v2 memory controller implements an optional cgroup-aware out of
+memory killer, which treats cgroups as indivisible OOM entities.
 
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
+This policy is controlled by memory.oom_policy.  When a memory cgroup is
+out of memory, its memory.oom_policy will dictate how the OOM killer will
+select a process, or cgroup, to kill.  Likewise, when the system is OOM,
+the policy is dictated by the root mem cgroup.
 
-  # mount -o remount,groupoom $MOUNT_POINT
+There are currently two available oom policies:
 
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
+ - "none": default, choose the largest single memory hogging process to
+   oom kill, as traditionally the OOM killer has always done.
 
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer to kill all processes attached to the cgroup,
-except processes with oom_score_adj set to -1000.
+ - "cgroup": choose the cgroup with the largest memory footprint from the
+   subtree as an OOM victim and kill at least one process, depending on
+   memory.oom_group, from it.
 
-This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
-the memory controller considers only cgroups belonging to the sub-tree
-of the OOM'ing cgroup.
+When selecting a cgroup as a victim, the OOM killer will kill the process
+with the largest memory footprint.  A user can control this behavior by
+enabling the per-cgroup memory.oom_group option.  If set, it causes the
+OOM killer to kill all processes attached to the cgroup, except processes
+with /proc/pid/oom_score_adj set to -1000 (oom disabled).
 
 The root cgroup is treated as a leaf memory cgroup, so it's compared
 with other leaf memory cgroups and cgroups with oom_group option set.
 
-If there are no cgroups with the enabled memory controller,
-the OOM killer is using the "traditional" process-based approach.
-
 Please, note that memory charges are not migrating if tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,11 +81,6 @@ enum {
 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 */
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
-
-   /*
-* Enable cgroup-aware OOM killer.
-*/
-   CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontr

[patch -mm 1/4] mm, memcg: introduce per-memcg oom policy tunable

2018-01-16 Thread David Rientjes

The cgroup aware oom killer is needlessly declared for the entire system
by a mount option.  It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.

This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy.  It will be expanded in the next patch to define cgroup
aware oom killer behavior.

This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.

Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.

Signed-off-by: David Rientjes <rient...@google.com>
---
 Documentation/cgroup-v2.txt |  9 +
 include/linux/memcontrol.h  | 11 +++
 mm/memcontrol.c | 35 +++
 3 files changed, 55 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1060,6 +1060,15 @@ PAGE_SIZE multiple when read back.
If cgroup-aware OOM killer is not enabled, ENOTSUPP error
is returned on attempt to access the file.
 
+  memory.oom_policy
+
+   A read-write single string file which exists on all cgroups.  The
+   default value is "none".
+
+   If "none", the OOM killer will use the default policy to choose a
+   victim; that is, it will choose the single process with the largest
+   memory footprint.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -58,6 +58,14 @@ enum memcg_event_item {
MEMCG_NR_EVENTS,
 };
 
+enum memcg_oom_policy {
+   /*
+* No special oom policy, process selection is determined by
+* oom_badness()
+*/
+   MEMCG_OOM_POLICY_NONE,
+};
+
 struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -203,6 +211,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /* OOM policy for this subtree */
+   enum memcg_oom_policy oom_policy;
+
/*
 * Treat the sub-tree as an indivisible memory consumer,
 * kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4417,6 +4417,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
if (parent) {
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
+   memcg->oom_policy = parent->oom_policy;
}
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
@@ -5534,6 +5535,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
return 0;
 }
 
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+   enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+   switch (policy) {
+   case MEMCG_OOM_POLICY_NONE:
+   default:
+   seq_puts(m, "none\n");
+   };
+   return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+  char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   ssize_t ret = nbytes;
+
+   buf = strstrip(buf);
+   if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+   memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+   else
+   ret = -EINVAL;
+
+   return ret;
+}
+
 static struct cftype memory_files[] = {
{
.name = "current",
@@ -5575,6 +5604,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
+   {
+   .name = "oom_policy",
+   .flags = CFTYPE_NS_DELEGATABLE,
+   .seq_show = memory_oom_policy_show,
+   .write = memory_oom_policy_write,
+   },
{ } /* terminate */
 };
 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch -mm 0/4] mm, memcg: introduce oom policies

2018-01-16 Thread David Rientjes

There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:

 (1) allows users to evade the oom killer by creating subcontainers or
 using other controllers since scoring is done per cgroup and not
 hierarchically,

 (2) does not allow the user to influence the decisionmaking, such that
 important subtrees cannot be preferred or biased, and

 (3) unfairly compares the root mem cgroup using completely different
 criteria than leaf mem cgroups and allows wildly inaccurate results
 if oom_score_adj is used.

This patchset aims to fix (1) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.

It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.

It eliminates a pointless tunable, memory.oom_group, that unnecessarily
pollutes the mem cgroup v2 filesystem and is invalid when cgroup v2 is
mounted with the "groupoom" option.
---
 Applied on top of -mm.

 Documentation/cgroup-v2.txt |  87 -
 include/linux/cgroup-defs.h |   5 --
 include/linux/memcontrol.h  |  37 ++
 kernel/cgroup/cgroup.c  |  13 +
 mm/memcontrol.c | 116 +---
 mm/oom_kill.c   |   4 +-
 6 files changed, 139 insertions(+), 123 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch -mm 4/4] mm, memcg: add hierarchical usage oom policy

2018-01-16 Thread David Rientjes

One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.

In this regard, users who do not distribute their processes over a set of
subcontainers for mem cgroup control, statistics, or other controllers
are unfairly penalized.

This adds an oom policy, "tree", that accounts for hierarchical usage
when comparing cgroups and the cgroup aware oom killer is enabled by an
ancestor.  This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage.  In other words, they can longer evade the oom killer
by using other controllers or subcontainers.

Signed-off-by: David Rientjes <rient...@google.com>
---
 Documentation/cgroup-v2.txt | 12 ++--
 include/linux/memcontrol.h  |  9 +++--
 mm/memcontrol.c | 23 +++
 3 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1048,6 +1048,11 @@ PAGE_SIZE multiple when read back.
memory consumers; that is, they will compare mem cgroup usage rather
than process memory footprint.  See the "OOM Killer" section.
 
+   If "tree", the OOM killer will compare mem cgroups and its subtree
+   as indivisible memory consumers when selecting a hierarchy.  This
+   policy cannot be set on the root mem cgroup.  See the "OOM Killer"
+   section.
+
If "all", the OOM killer will compare mem cgroups and its subtree
as indivisible memory consumers and kill all processes attached to
the mem cgroup and its subtree.  This policy cannot be set on the
@@ -1275,6 +1280,9 @@ There are currently three available oom policies:
  - "cgroup": choose the cgroup with the largest memory footprint from the
subtree as an OOM victim and kill at least one process.
 
+ - "tree": choose the cgroup with the largest memory footprint considering
+   itself and its subtree and kill at least one process.
+
  - "all": choose the cgroup with the largest memory footprint considering
itself and its subtree and kill all processes attached (cannot be set on
the root mem cgroup).
@@ -1292,8 +1300,8 @@ Please, note that memory charges are not migrating if 
tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
 If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and setting a policy of "all"
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+or "all" on ancestor layer.
 
 
 IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -70,8 +70,13 @@ enum memcg_oom_policy {
 */
MEMCG_OOM_POLICY_CGROUP,
/*
-* Same as MEMCG_OOM_POLICY_CGROUP, but all eligible processes attached
-* to the cgroup and subtree should be oom killed
+* Tree cgroup usage for all descendant memcg groups, treating each mem
+* cgroup and its subtree as an indivisible consumer
+*/
+   MEMCG_OOM_POLICY_TREE,
+   /*
+* Same as MEMCG_OOM_POLICY_TREE, but all eligible processes are also
+* oom killed
 */
MEMCG_OOM_POLICY_ALL,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2715,11 +2715,11 @@ static void select_victim_memcg(struct mem_cgroup 
*root, struct oom_control *oc)
oc->chosen_points = 0;
 
/*
-* If OOM is memcg-wide, and the oom policy is "all", all processes
-* attached to the memcg and subtree should be killed.
-* So, we mark the memcg as a victim.
+* If OOM is memcg-wide, and the oom policy is "tree" or "all", this
+* is the selected memcg.
 */
-   if (oc->memcg && mem_cgroup_oom_policy_all(oc->memcg)) {
+   if (oc->memcg && (oc->memcg->oom_policy == MEMCG_OOM_POLICY_TREE ||
+ oc->memcg->oom_policy == MEMCG_OOM_POLICY_ALL)) {
oc->chosen_memcg = oc->memcg;
css_get(>chosen_memcg->css);
return;
@@ -2728,8 +2728,8 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
/*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
-* Cgroups with oom policy of "all" accumulate th

[patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-16 Thread David Rientjes

The behavior of killing an entire indivisible memory consumer, enabled
by memory.oom_group, is an oom policy itself.  It specifies that all
usage should be accounted to an ancestor and, if selected by the cgroup
aware oom killer, all processes attached to it and its descendant mem
cgroups should be oom killed.

This is replaced by writing "all" to memory.oom_policy and allows for the
same functionality as the existing memory.oom_group without (1) polluting
the mem cgroup v2 filesystem unnecessarily and (2) unnecessarily when the
"groupoom" mount option is not used (now by writing "cgroup" to the root
mem cgroup's memory.oom_policy).

The "all" oom policy cannot be enabled on the root mem cgroup.

Signed-off-by: David Rientjes <rient...@google.com>
---
 Documentation/cgroup-v2.txt | 51 ++---
 include/linux/memcontrol.h  | 18 +++
 mm/memcontrol.c | 55 -
 mm/oom_kill.c   |  4 ++--
 4 files changed, 41 insertions(+), 87 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1035,31 +1035,6 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
-  memory.oom_group
-
-   A read-write single value file which exists on non-root
-   cgroups.  The default is "0".
-
-   If set, OOM killer will consider the memory cgroup as an
-   indivisible memory consumers and compare it with other memory
-   consumers by it's memory footprint.
-   If such memory cgroup is selected as an OOM victim, all
-   processes belonging to it or it's descendants will be killed.
-
-   This applies to system-wide OOM conditions and reaching
-   the hard memory limit of the cgroup and their ancestor.
-   If OOM condition happens in a descendant cgroup with it's own
-   memory limit, the memory cgroup can't be considered
-   as an OOM victim, and OOM killer will not kill all belonging
-   tasks.
-
-   Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
-   and will never kill the unkillable task, even if memory.oom_group
-   is set.
-
-   If cgroup-aware OOM killer is not enabled, ENOTSUPP error
-   is returned on attempt to access the file.
-
   memory.oom_policy
 
A read-write single string file which exists on all cgroups.  The
@@ -1073,6 +1048,11 @@ PAGE_SIZE multiple when read back.
memory consumers; that is, they will compare mem cgroup usage rather
than process memory footprint.  See the "OOM Killer" section.
 
+   If "all", the OOM killer will compare mem cgroups and its subtree
+   as indivisible memory consumers and kill all processes attached to
+   the mem cgroup and its subtree.  This policy cannot be set on the
+   root mem cgroup.  See the "OOM Killer" section.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1287,29 +1267,32 @@ out of memory, its memory.oom_policy will dictate how 
the OOM killer will
 select a process, or cgroup, to kill.  Likewise, when the system is OOM,
 the policy is dictated by the root mem cgroup.
 
-There are currently two available oom policies:
+There are currently three available oom policies:
 
  - "none": default, choose the largest single memory hogging process to
oom kill, as traditionally the OOM killer has always done.
 
  - "cgroup": choose the cgroup with the largest memory footprint from the
-   subtree as an OOM victim and kill at least one process, depending on
-   memory.oom_group, from it.
+   subtree as an OOM victim and kill at least one process.
+
+ - "all": choose the cgroup with the largest memory footprint considering
+   itself and its subtree and kill all processes attached (cannot be set on
+   the root mem cgroup).
 
 When selecting a cgroup as a victim, the OOM killer will kill the process
-with the largest memory footprint.  A user can control this behavior by
-enabling the per-cgroup memory.oom_group option.  If set, it causes the
-OOM killer to kill all processes attached to the cgroup, except processes
-with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+with the largest memory footprint, unless the policy is specified as "all".
+In that case, the OOM killer will kill all processes attached to the cgroup
+and its subtree, except processes with /proc/pid/oom_score_adj set to
+-1000 (oom disabled).
 
 The root cgroup is treated as a leaf memory cgroup, so it's compared
-with other leaf memory cgroups and cgroups with oom_group option set.
+with other leaf memory cgroups.
 
 Please, note tha

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-16 Thread David Rientjes

On Mon, 15 Jan 2018, Michal Hocko wrote:

> > No, this isn't how kernel features get introduced.  We don't design a new 
> > kernel feature with its own API for a highly specialized usecase and then 
> > claim we'll fix the problems later.  Users will work around the 
> > constraints of the new feature, if possible, and then we can end up 
> > breaking them later.  Or, we can pollute the mem cgroup v2 filesystem with 
> > even more tunables to cover up for mistakes in earlier designs.
> 
> This is a blatant misinterpretation of the proposed changes. I haven't
> heard _any_ single argument against the proposed user interface except
> for complaints for missing tunables. This is not how the kernel
> development works and should work. The usecase was clearly described and
> far from limited to a single workload or company.
>  

The complaint about the user interface is that it is not extensible, as my 
next line states.  This doesn't need to be opted into with a mount option 
locking the entire system into a single oom policy.  That, itself, is the 
result of a poor design.  What is needed is a way for users to define an 
oom policy that is generally useful, not something that is locked in for 
the whole system.  We don't need several different cgroup mount options 
only for mem cgroup oom policies.  We also don't need random 
memory.groupoom files being added to the mem cgroup v2 filesystem only for 
one or two particular policies and being no-ops otherwise.  It can easily 
be specified as part of the policy itself.  My suggestion adds two new 
files to the mem cgroup v2 filesystem and no mount option, and allows any 
policy to be added later that only uses these two files.  I see you've 
ignored all of that in this email, so perhaps reading it would be 
worthwhile so that you can provide constructive feedback.

> > The key point to all three of my objections: extensibility.
> 
> And it has been argued that further _features_ can be added on top. I am
> absolutely fed up discussing those things again and again without any
> progress. You simply keep _ignoring_ counter arguments and that is a
> major PITA to be honest with you. You are basically blocking a useful
> feature because it doesn't solve your particular workload. This is
> simply not acceptable in the kernel development.
> 

As the thread says, this has nothing to do with my own particular 
workload, it has to do with three obvious shortcomings in the design that 
the user has no control over.  We can't add features on top if the 
heuristic itself changes as a result of the proposal, it needs to be 
introduced in an extensible way so that additional changes can be made 
later, if necessary, while still working around the very obvious problems 
with this current implementation.  My suggestion is that we introduce a 
way to define the oom policy once so that we don't have to change it later 
and are left with needless mount options or mem cgroup v2 files that 
become no-ops with the suggested design.  I hope that you will read the 
proposal for that extensible interface and comment on it about any 
concerns that you have, because that feedback would generally be useful.

> > Both you and Michal have acknowledged blantently obvious shortcomings in 
> > the design.
> 
> What you call blatant shortcomings we do not see affecting any
> _existing_ workloads. If they turn out to be real issues then we can fix
> them without deprecating any user APIs added by this patchset.
> 

There are existing workloads that use mem cgroup subcontainers purely for 
tracking charging and vmscan stats, which results in this logic being 
evaded.  It's a real issue, and a perfectly acceptable usecase for mem 
cgroup.  It's a result of the entire oom policy either being opted into or 
opted out of for the entire system and impossible for the user to 
configure or avoid.  That can be done better by enabling the oom policy 
only for a subtree, as I've suggested, but you've ignored.  It would also 
render both the mount option and the additional file in the mem cgroup v2 
filesystem added by this patchset to be no-ops.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-16 Thread David Rientjes

On Mon, 15 Jan 2018, Johannes Weiner wrote:

> > It's quite trivial to allow the root mem cgroup to be compared exactly the 
> > same as another cgroup.  Please see 
> > https://marc.info/?l=linux-kernel=151579459920305.
> 
> This only says "that will be fixed" and doesn't address why I care.
> 

Because the design of the cgroup aware oom killer requires the entire 
system to be fully containerized to be useful and select the 
correct/anticipated victim.  If anything is left behind in the root mem 
cgroup, or on systems that use mem cgroups in ways that you do not, it 
returns wildly unpredictable and downright wrong results; it factors 
oom_score_adj into the selection criteria only for processes attached to 
the root mem cgroup and ignores it otherwise.  If we treated root and leaf 
cgroups as comparable, this problem wouldn't arise.

> > > This assumes you even need one. Right now, the OOM killer picks the
> > > biggest MM, so you can evade selection by forking your MM. This patch
> > > allows picking the biggest cgroup, so you can evade by forking groups.
> > 
> > It's quite trivial to prevent any cgroup from evading the oom killer by 
> > either forking their mm or attaching all their processes to subcontainers.  
> > Please see https://marc.info/?l=linux-kernel=151579459920305.
> 
> This doesn't address anything I wrote.
> 

It prevents both problems if you are attached to a mem cgroup subtree.  
You can't fork the mm and you can't fork groups to evade the selection 
logic.  If the root mem cgroup and leaf cgroups were truly comparable, it 
also prevents both problems regardless of which cgroup the processes 
attached to.

> > > It's not a new vector, and clearly nobody cares. This has never been
> > > brought up against the current design that I know of.
> > 
> > As cgroup v2 becomes more popular, people will organize their cgroup 
> > hierarchies for all controllers they need to use.  We do this today, for 
> > example, by attaching some individual consumers to child mem cgroups 
> > purely for the rich statistics and vmscan stats that mem cgroup provides 
> > without any limitation on those cgroups.
> 
> There is no connecting tissue between what I wrote and what you wrote.
> 

You're completely ignoring other usecases other than your own highly 
specialized usecase.  You may attach every user process on the system to 
leaf cgroups and you may not have any users who have control over a 
subtree.  Other people do.  And this patchset, as implemented, returns 
very inaccurate results or allows users to intentionally or 
unintentionally evade the oom killer just because they want to use 
subcontainers.

Creating and attaching a subset of processes to either top-level 
containers or subcontainers for either limitation by mem cgroup or for 
statistics is a valid usecase, and one that is used in practice.  Your 
suggestion of avoiding that problem to work around the shortcomings of 
this patchset by limiting how many subcontainers can be created by the 
user is utterly ridiculous.

> We have a patch set that was developed, iterated and improved over 10+
> revisions, based on evaluating and weighing trade-offs. It's reached a
> state where the memcg maintainers are happy with it and don't have any
> concern about future extendabilty to cover more specialized setups.
> 

It's also obviously untested in those 10+ revisions since it uses 
oom_badness() for the root mem cgroup and not leaf mem cgroups which is 
why it breaks any system where user processes are attached to the root mem 
cgroup.  See my example where a 1GB worker attached to the root mem cgroup 
is preferred over a cgroup with 6GB workers.  It may have been tested with 
your own highly specialized usecase, but not with anything else, and the 
author obviously admits its shortcomings.

> You've had nine months to contribute and shape this patch series
> productively, and instead resorted to a cavalcade of polemics,
> evasion, and repetition of truisms and refuted points. A ten paragraph
> proposal of vague ideas at this point is simply not a valid argument.
> 

The patch series has gone through massive changes over the past nine 
months and I've brought up three very specific concerns with its current 
form that makes it non-extensible.  I know the patchset has very 
significant changes or rewrites in its history, but I'm only concerned 
with its present form because it's not something that can easily be 
extended later.  We don't need useless files polutting the cgroup v2 
filesystem for things that don't matter with other oom policies and we 
don't need the mount option, actually.

> You can send patches to replace or improve on Roman's code and make
> the case for them.
> 

I volunteered in the email thread where I proposed an alternative to 
replace it, I'm actively seeking any feedback on that proposal to address 
anybody else's concerns early on.  Your participation in that would be 
very useful.

Thanks.
--
To unsubscribe from this list:

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-14 Thread David Rientjes

On Sat, 13 Jan 2018, Johannes Weiner wrote:

> You don't have any control and no accounting of the stuff situated
> inside the root cgroup, so it doesn't make sense to leave anything in
> there while also using sophisticated containerization mechanisms like
> this group oom setting.
> 
> In fact, the laptop I'm writing this email on runs an unmodified
> mainstream Linux distribution. The only thing in the root cgroup are
> kernel threads.
> 
> The decisions are good enough for the rare cases you forget something
> in there and it explodes.
> 

It's quite trivial to allow the root mem cgroup to be compared exactly the 
same as another cgroup.  Please see 
https://marc.info/?l=linux-kernel=151579459920305.

> This assumes you even need one. Right now, the OOM killer picks the
> biggest MM, so you can evade selection by forking your MM. This patch
> allows picking the biggest cgroup, so you can evade by forking groups.
> 

It's quite trivial to prevent any cgroup from evading the oom killer by 
either forking their mm or attaching all their processes to subcontainers.  
Please see https://marc.info/?l=linux-kernel=151579459920305.

> It's not a new vector, and clearly nobody cares. This has never been
> brought up against the current design that I know of.
> 

As cgroup v2 becomes more popular, people will organize their cgroup 
hierarchies for all controllers they need to use.  We do this today, for 
example, by attaching some individual consumers to child mem cgroups 
purely for the rich statistics and vmscan stats that mem cgroup provides 
without any limitation on those cgroups.

> Note, however, that there actually *is* a way to guard against it: in
> cgroup2 there is a hierarchical limit you can configure for the number
> of cgroups that are allowed to be created in the subtree. See
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 

Not allowing the user to create subcontainers to track statistics to paper 
over an obvious and acknowledged shortcoming in the design of the cgroup 
aware oom killer seems like a pretty nasty shortcoming itself.

> It could be useful, but we have no concensus on the desired
> semantics. And it's not clear why we couldn't add it later as long as
> the default settings of a new knob maintain the default behavior
> (which would have to be preserved anyway, since we rely on it).
>

The active proposal is 
https://marc.info/?l=linux-kernel=151579459920305, which describes an 
extendable interface and one that covers all the shortcomings of this 
patchset without polluting the mem cgroup filesystem.  The default oom 
policy in that proposal would be "none", i.e. we do what we do today, 
based on process usage.  You can configure that, without the mount option 
this patchset introduces for local or hierarchical cgroup targeting.

> > > > I proposed a solution in 
> > > > https://marc.info/?l=linux-kernel=150956897302725, which was never 
> > > > responded to, for all of these issues.  The idea is to do hierarchical 
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > > > comparing total usage at each level to select target cgroups.  Admins 
> > > > and 
> > > > users can use memory.oom_score_adj to influence that decisionmaking at 
> > > > each level.
> 
> We did respond repeatedly: this doesn't work for a lot of setups.
> 

We need to move this discussion to the active proposal at 
https://marc.info/?l=linux-kernel=151579459920305, because it does 
address your setup, so it's not good use of anyones time to further 
discuss simply memory.oom_score_adj.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-12 Thread David Rientjes

 where usage is not
   considered and only the priority of the subtree is compared.

With cgroup v2 sematics of no internal process constraint, this is 
extremely straight forward.  All of your oom evaluation function can be 
reused with a simple comparison based on the policy to score individual 
cgroups.  In the simplest policy, "priority", this is like a 10 line 
function, but extremely useful to userspace.

This allows users to have full power over the decisionmaking in every 
subtree wrt oom kill selection and doesn't regress or allow for any 
pitfalls of the current patchset.  The goal is not to have one single oom 
policy for the entire system, but define the policy that makes useful 
sense.  This is how an extensible feature is designed and does not require 
any further pollution of the mem cgroup filesystem.

If you have additional features such as groupoom, you can make 
memory.oom_policy comma delimited, just as vmpressure modes are comma 
delimited.  You would want to use "adj,groupoom".  We don't need another 
file that is pointless in other policy decisions.  We don't need a mount 
option to lock the entire system into a single methodology.

Cgroup v2 is a very clean interface and I think it's the responsibility of 
every controller to maintain that.  We should not fall into a cgroup v1 
mentality which became very difficult to make extensible.  Let's make a 
feature that is generally useful, complete, and empowers the user rather 
than push them into a corner with a system wide policy with obvious 
downsides.

For these reasons, and the three concerns that I enumerated earlier which 
have been acknowledged of obvious shortcoming with this approach:

Nacked-by: David Rientjes <rient...@google.com>

I'll be happy to implement the core functionality that allows oom policies 
to be written by the user and introduce memory.oom_value, and then rework 
the logic defined in your patchset as "adj" by giving the user an optional 
way of perferring or biasing that usage in a way that is clearly 
documented and extended.  Root mem cgroup usage is obviously wrong in this 
current patchset since it uses oom_score_adj whereas leaf cgroups do not, 
so that will be fixed.  But I'll ask that the current patchset is removed 
from -mm since it has obvious downsides, pollutes the mem cgroup v2 
filesystem, is not extensible, is not documented wrt to oom_score_adj, 
allows evasion of the heuristic, and doesn't allow the user to have any 
power in the important decision of which of their important processes is 
oom killed such that this feature is not useful outside very specialized 
usecases.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-11 Thread David Rientjes

On Thu, 11 Jan 2018, Michal Hocko wrote:

> > > I find this problem quite minor, because I haven't seen any practical 
> > > problems
> > > caused by accounting of the root cgroup memory.
> > > If it's a serious problem for you, it can be solved without switching to 
> > > the
> > > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > > substract them from global values. So, it can be a relatively small 
> > > enhancement
> > > on top of the current mm tree. This has nothing to do with global victim 
> > > selection
> > > approach.
> > 
> > It sounds like a significant shortcoming to me - the oom-killing
> > decisions which David describes are clearly incorrect?
> 
> Well, I would rather look at that from the use case POV. The primary
> user of the new OOM killer functionality are containers. I might be
> wrong but I _assume_ that root cgroup will only contain the basic system
> infrastructure there and all the workload will run containers (aka
> cgroups). The only oom tuning inside the root cgroup would be to disable
> oom for some of those processes. The current implementation should work
> reasonably well for that configuration.
>  

It's a decision made on the system level when cgroup2 is mounted, it 
affects all processes that get attached to a leaf cgroup, even regardless 
of whether or not it has the memory controller enabled for that subtree.  
In other words, mounting cgroup2 with the cgroup aware oom killer mount 
option immediately:

 - makes /proc/pid/oom_score_adj effective for all processes attached
   to the root cgroup only, and

 - makes /proc/pid/oom_score_adj a no-op for all processes attached to a
   non-root cgroup.

Note that there may be no active usage of any memory controller at the 
time of oom, yet this tunable inconsistency still exists for any process.

The initial response is correct: it clearly produces incorrect oom killing 
decisions.  This implementation detail either requires the entire system 
is not containerized at all (no processes attached to any leaf cgroup), or 
fully containerized (all processes attached to leaf cgroups).  It's a 
highly specialized usecase with very limited limited scope and is wrought 
with pitfalls if any oom_score_adj is tuned because it strictly depends on 
the cgroup to which those processes are attached at any given time to 
determine whether it is effective or not.

> > > We've discussed this a lot.
> > > Hierarchical approach has their own issues, which we've discussed during
> > > previous iterations of the patchset. If you know how to address them
> > > (I've no idea), please, go on and suggest your version.
> > 
> > Well, if a hierarchical approach isn't a workable fix for the problem
> > which David has identified then what *is* the fix?
> 
> Hierarchical approach basically hardcodes the oom decision into the
> hierarchy structure and that is simply a no go because that would turn
> into a massive configuration PITA (more on that below). I consider the above
> example rather artificial to be honest. Anyway, if we _really_ have to
> address it in the future we can do that by providing a mechanism to
> prioritize cgroups. It seems that this is required for some usecases
> anyway.
>  

I'll address the hierarchical accounting suggestion below.

The example is not artificial, it's not theoretical, it's a real-world 
example.  Users can attach processes to subcontainers purely for memory 
stats from a mem cgroup point of view without limiting usage, or to 
subcontainers for use with other controllers other than the memory 
controller.  We do this all the time: it's helpful to assign groups of 
processes to subcontainers simply to track statistics while the actual 
limitation is enforced by an ancestor.

So without hierarchical accounting, we can extend the above restriction, 
that a system is either fully containerized or not containerized at all, 
by saying that a fully containerized system must not use subcontainers to 
avoid the cgroup aware oom killer heuristic.  In other words, using this 
feature requires:

 - the entire system is not containerized at all, or

 - the entire system is fully containerized and no cgroup (any controller,
   not just memory) uses subcontainers to intentionally/unintentionally
   distribute usage to evade this heuristic.

Of course the second restriction severely limits the flexibility that 
cgroup v2 introduces as a whole as a caveat of an implementation detail of 
the memory cgroup aware oom killer.  Why not simply avoid using the cgroup 
aware oom killer?  It's not so easy since the it's a property of the 
machine itself: users probably have no control over it themselves and, in 
the worst case, can trivially evade ever being oom killed if used.

> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea 
> > > of how
> > > it should look like, please, propose a patch; otherwise

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-10 Thread David Rientjes

On Wed, 10 Jan 2018, Roman Gushchin wrote:

> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > 
> > The patchset uses two different heuristics to compare root and leaf mem 
> > cgroups and scores them based on number of pages.  For the root mem 
> > cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> > unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> > different independent of /proc/pid/oom_score_adj, but the most obvious 
> > unfairness comes from users who tune oom_score_adj.
> > 
> > An example: start a process that faults 1GB of anonymous memory and leave 
> > it attached to the root mem cgroup.  Start six more processes that each 
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> > all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> > will always kill the 1GB process attached to the root mem cgroup.  It's 
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> > it.
> > 
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> > pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> > 12,652,907, eight times larger even though its usage is six times smaller.
> > 
> > This is caused by the patchset disregarding oom_score_adj entirely for 
> > leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> > the complete opposite result of what the cgroup aware oom killer 
> > advertises.
> > 
> > It also works the other way, if a large memory hog is attached to the root 
> > mem cgroup but has a negative oom_score_adj it is never killed and random 
> > processes are nuked solely because they happened to be attached to a leaf 
> > mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> > so I can't presume that it is either known nor tested.
> > 
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> > the same criteria by doing hierarchical accounting of usage and 
> > subtracting from total system usage to find root usage.
> 
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small 
> enhancement
> on top of the current mm tree. This has nothing to do with global victim 
> selection
> approach.
> 

It's not surprising that this problem is considered minor, since it was 
quite obviously never tested when developing the series.  
/proc/pid/oom_score_adj being effective when the process is attached to 
the root mem cgroup and not when attached to a leaf mem cgroup is a major 
problem, and one that was never documented or explained.  I'm somewhat 
stunned this is considered minor.

Allow me to put a stake in the ground: admins and users need to be able to 
influence oom kill decisions.  Killing a process matters.  The ability to 
protect important processes matters, such as processes that are really 
supposed to use 50% of a system's memory.  As implemented, there is no 
control over this.

If a process's oom_score_adj = 1000 (always kill it), this always happens 
when attached to the root mem cgroup.  If attached to a leaf mem cgroup, 
it never happens unless that individual mem cgroup is the next largest 
single consumer of memory.  It makes it impossible for a process to be the 
preferred oom victim if it is ever attached to any non-root mem cgroup.  
Where it is attached can be outside the control of the application itself.

Perhaps worse, an important process with oom_score_adj = -999 (try to kill 
everything before it) is always oom killed first if attached to a leaf mem 
cgroup with majority usage.

> > 2. Evading the oom killer by attaching processes to child cgroups
> > 
> > Any cgroup on the system can attach all their processes to individual 
> > child cgroups.  This is functionally the same as doing
> > 
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; 
> > done
> > 
> > without the no internal process constraint introduced with cgroup v2.  All 
> > child cgroups are evaluated based on their own usage: all anon, 
> > unevictable, and unreclaimable slab as described previously.  It requires 
> > an individual cgroup to be the single largest consumer to be targeted by 
> > the oom killer.
> > 
> > An example: allow users to manage two different mem cgroup hierarchies 
> > limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> > memory in their respective hierarchies.  On

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-09 Thread David Rientjes

On Thu, 30 Nov 2017, Andrew Morton wrote:

> > This patchset makes the OOM killer cgroup-aware.
> 
> Thanks, I'll grab these.
> 
> There has been controversy over this patchset, to say the least.  I
> can't say that I followed it closely!  Could those who still have
> reservations please summarise their concerns and hopefully suggest a
> way forward?
> 

Yes, I'll summarize what my concerns have been in the past and what they 
are wrt the patchset as it stands in -mm.  None of them originate from my 
current usecase or anticipated future usecase of the oom killer for 
system-wide or memcg-constrained oom conditions.  They are based purely on 
the patchset's use of an incomplete and unfair heuristic for deciding 
which cgroup to target.

I'll also suggest simple changes to the patchset, which I have in the 
past, that can be made to address all of these concerns.

1. The unfair comparison of the root mem cgroup vs leaf mem cgroups

The patchset uses two different heuristics to compare root and leaf mem 
cgroups and scores them based on number of pages.  For the root mem 
cgroup, it totals the /proc/pid/oom_score of all processes attached: 
that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
different independent of /proc/pid/oom_score_adj, but the most obvious 
unfairness comes from users who tune oom_score_adj.

An example: start a process that faults 1GB of anonymous memory and leave 
it attached to the root mem cgroup.  Start six more processes that each 
fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
will always kill the 1GB process attached to the root mem cgroup.  It's 
because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
it.

In this example, the leaf mem cgroup's score is 1,573,044, the number of 
pages for the 6GB of faulted memory.  The root mem cgroup's score is 
12,652,907, eight times larger even though its usage is six times smaller.

This is caused by the patchset disregarding oom_score_adj entirely for 
leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
the complete opposite result of what the cgroup aware oom killer 
advertises.

It also works the other way, if a large memory hog is attached to the root 
mem cgroup but has a negative oom_score_adj it is never killed and random 
processes are nuked solely because they happened to be attached to a leaf 
mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
so I can't presume that it is either known nor tested.

Solution: compare the root mem cgroup and leaf mem cgroups equally with 
the same criteria by doing hierarchical accounting of usage and 
subtracting from total system usage to find root usage.

2. Evading the oom killer by attaching processes to child cgroups

Any cgroup on the system can attach all their processes to individual 
child cgroups.  This is functionally the same as doing

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; 
done

without the no internal process constraint introduced with cgroup v2.  All 
child cgroups are evaluated based on their own usage: all anon, 
unevictable, and unreclaimable slab as described previously.  It requires 
an individual cgroup to be the single largest consumer to be targeted by 
the oom killer.

An example: allow users to manage two different mem cgroup hierarchies 
limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
memory in their respective hierarchies.  On a system oom condition, we'd 
expect at least one process from user B's hierarchy would always be oom 
killed with the cgroup aware oom killer.  In fact, the changelog 
explicitly states it solves an issue where "1) There is no fairness 
between containers. A small container with few large processes will be 
chosen over a large one with huge number of small processes."

The opposite becomes true, however, if user B creates child cgroups and 
distributes its processes such that each child cgroup's usage never 
exceeds 10GB of memory.  This can either be done intentionally to 
purposefully have a low cgroup memory footprint to evade the oom killer or 
unintentionally with cgroup v2 to allow those individual processes to be 
constrained by other cgroups in a single hierarchy model.  User A, using 
10% of his memory limit, is always oom killed instead of user B, using 90% 
of his memory limit.

Others have commented its still possible to do this with a per-process 
model if users split their processes into many subprocesses with small 
memory footprints.

Solution: comparing cgroups must be done hierarchically.  Neither user A 
nor user B can evade the oom killer because

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-11-01 Thread David Rientjes

On Wed, 1 Nov 2017, Michal Hocko wrote:

> > memory.oom_score_adj would never need to be permanently tuned, just as 
> > /proc/pid/oom_score_adj need never be permanently tuned.  My response was 
> > an answer to Roman's concern that "v8 has it's own limitations," but I 
> > haven't seen a concrete example where the oom killer is forced to kill 
> > from the non-preferred cgroup while the user has power of biasing against 
> > certain cgroups with memory.oom_score_adj.  Do you have such a concrete 
> > example that we can work with?
> 
> Yes, the one with structural requirements due to other controllers or
> due to general organizational purposes where hierarchical (sibling
> oriented) comparison just doesn't work.

You mean where an admin or user does

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; 
done

to place constraints on processes with other controllers but unknowingly 
completely circumvented oom kill selection?  That's one of my concerns 
that hasn't been addressed and I believe only can be done with 
hierarchical accounting.

> Take the students, teachers,
> admins example. You definitely do not want to kill from students
> subgroups by default just because this is the largest entity type.
> Tuning memory.oom_score_adj doesn't work for that usecase as soon as
> new subgroups come and go.
> 

With hierarchical accounting, that solves all three concerns that I have 
enumerated, the top-level organizer knows the limits imposed.  It is 
therefore *trivial* to prefer /students by biasing against it with 
memory.oom_score_adj over other top-level mem cgroups and still kill 
from /students if going over a certain threshold of memory.  With 
hierarchical accounting and memory.oom_score_adj, it could even be used 
for userspace to determine which student to kill from.  If /admins is 
using more memory than expected, it gets biased against with the same 
memory.oom_score_adj.  The point is that the top-level organizer that is 
structing the mem cgroup tree knows the limits imposed and the preference 
of /admins over /students, or vice versa.  It also doesn't allow /students 
to circumvent victim selection by creating child mem cgroups.

If this is missing your point, please draw the hierarchy you are 
suggesting and show which mem cgroup is preferred by the admin and does 
not allow the user to circumvent that priority.

> > We simply cannot determine if improvements can be implemented in the 
> > future without user-visible changes if those improvements are unknown or 
> > undecided at this time.
> 
> Come on. There have been at least two examples on how this could be
> achieved. One priority based which would use cumulative memory
> consumption if set on intermediate nodes which would allow you to
> compare siblings. And another one was to add a new knob which would make
> an intermediate node an aggregate for accounting purposes.
> 

We don't need a memory.oom_group, memory.oom_hierarchical_accounting, 
memory.oom_priority, and memory.oom_score_adj.  I believe 
memory.oom_score_adj suffices.  I don't think it is in our best interest 
or the users best interest to maintain many different combinations of how 
an oom victim is selected.  I believe all the power needed is by providing 
a memory.oom_score_adj since cgroup memory usage is being considered, just 
as /proc/pid/oom_score_adj exists when process memory usage is being 
considered.  It seems very intuitive.

In the interest of not polluting the namespace, not allowing users to 
circumvent the decisions of the oom killer, and to allow userspace to be 
able to influence oom victim selection, we need this to be implemented now 
rather than later since it may affect the heuristic under consideration 
and we should have a complete patchset without being backed into a corner 
based on decisions made earlier with the rationale that it can be figured 
out later, let's merge this now.

> And I am pretty sure we have already agreed that something like this is
> useful for some usecases and nobody objected this would get merged in
> future. All we are saying now is that this is not in scope of _this_
> patchseries because the vast majority of usecases simply do not care
> about influencing the oom selection. They only do care about having per
> cgroup behavior and/or kill all semantic. I really do not understand
> what is hard about that.
> 

I honestly do not believe what hurry we're in or what is so hard to 
understand about implementing the ability of userspace to influence the 
decisionmaking that works well together with the heuristic being 
introduced.

We can stop wasting time arguing about whether its appropriate to merge an 
incomplete patchset or not and actually fix the three fundamental flaws 
that have been outlined with this approach, or I can fork the patchset and 
introduce it myself as proposed if that is preferred.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-31 Thread David Rientjes

On Tue, 31 Oct 2017, Michal Hocko wrote:

> > I'm not ignoring them, I have stated that we need the ability to protect 
> > important cgroups on the system without oom disabling all attached 
> > processes.  If that is implemented as a memory.oom_score_adj with the same 
> > semantics as /proc/pid/oom_score_adj, i.e. a proportion of available 
> > memory (the limit), it can also address the issues pointed out with the 
> > hierarchical approach in v8.
> 
> No it cannot and it would be a terrible interface to have as well. You
> do not want to permanently tune oom_score_adj to compensate for
> structural restrictions on the hierarchy.
> 

memory.oom_score_adj would never need to be permanently tuned, just as 
/proc/pid/oom_score_adj need never be permanently tuned.  My response was 
an answer to Roman's concern that "v8 has it's own limitations," but I 
haven't seen a concrete example where the oom killer is forced to kill 
from the non-preferred cgroup while the user has power of biasing against 
certain cgroups with memory.oom_score_adj.  Do you have such a concrete 
example that we can work with?

> I believe, and Roman has pointed that out as well already, that further
> improvements can be implemented without changing user visible behavior
> as and add-on. If you disagree then you better come with a solid proof
> that all of us wrong and reasonable semantic cannot be achieved that
> way.

We simply cannot determine if improvements can be implemented in the 
future without user-visible changes if those improvements are unknown or 
undecided at this time.  It may require hierarchical accounting when 
making a choice between siblings, as suggested with oom_score_adj.  The 
only thing that we need to agree on is that userspace needs to have some 
kind of influence over victim selection: the oom killer killing an 
important user process is an extremely sensitive thing.  If the patchset 
lacks the ability to have that influence, and such an ability would impact 
the heuristic overall, it's better to introduce that together as a 
complete patchset rather than merging an incomplete feature when it's 
known the user needs some control, asking the user to workaround it by 
setting all processes to oom disabled in a preferred mem cgroup, and then 
changing the heuristic again.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-30 Thread David Rientjes

On Fri, 27 Oct 2017, Roman Gushchin wrote:

> The thing is that the hierarchical approach (as in v8), which are you pushing,
> has it's own limitations, which we've discussed in details earlier. There are
> reasons why v12 is different, and we can't really simple go back. I mean if
> there are better ideas how to resolve concerns raised in discussions around 
> v8,
> let me know, but ignoring them is not an option.
> 

I'm not ignoring them, I have stated that we need the ability to protect 
important cgroups on the system without oom disabling all attached 
processes.  If that is implemented as a memory.oom_score_adj with the same 
semantics as /proc/pid/oom_score_adj, i.e. a proportion of available 
memory (the limit), it can also address the issues pointed out with the 
hierarchical approach in v8.  If this is not the case, could you elaborate 
on what your exact concern is and why we do not care that users can 
completely circumvent victim selection by creating child cgroups for other 
controllers?

Since the ability to protect important cgroups on the system may require a 
heuristic change, I think it should be solved now rather than constantly 
insisting that we can make this patchset complete later and in the 
meantime force the user to set all attached processes to be oom disabled.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-26 Thread David Rientjes

On Thu, 26 Oct 2017, Johannes Weiner wrote:

> > The nack is for three reasons:
> > 
> >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> >  cgroup from oom kill in system oom conditions,
> > 
> >  (2) the ability of users to completely evade the oom killer by attaching
> >  all processes to child cgroups either purposefully or unpurposefully,
> >  and
> > 
> >  (3) the inability of userspace to effectively control oom victim  
> >  selection.
> 
> My apologies if my summary was too reductionist.
> 
> That being said, the arguments you repeat here have come up in
> previous threads and been responded to. This doesn't change my
> conclusion that your NAK is bogus.
> 

They actually haven't been responded to, Roman was working through v11 and 
made a change on how the root mem cgroup usage was calculated that was 
better than previous iterations but still not an apples to apples 
comparison with other cgroups.  The problem is that it the calculation for 
leaf cgroups includes additional memory classes, so it biases against 
processes that are moved to non-root mem cgroups.  Simply creating mem 
cgroups and attaching processes should not independently cause them to 
become more preferred: it should be a fair comparison between the root mem 
cgroup and the set of leaf mem cgroups as implemented.  That is very 
trivial to do with hierarchical oom cgroup scoring.

Since the ability of userspace to control oom victim selection is not 
addressed whatsoever by this patchset, and the suggested method cannot be 
implemented on top of this patchset as you have argued because it requires 
a change to the heuristic itself, the patchset needs to become complete 
before being mergeable.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-25 Thread David Rientjes

On Mon, 23 Oct 2017, Michal Hocko wrote:

> On Sun 22-10-17 17:24:51, David Rientjes wrote:
> > On Thu, 19 Oct 2017, Johannes Weiner wrote:
> > 
> > > David would have really liked for this patchset to include knobs to
> > > influence how the algorithm picks cgroup victims. The rest of us
> > > agreed that this is beyond the scope of these patches, that the
> > > patches don't need it to be useful, and that there is nothing
> > > preventing anyone from adding configurability later on. David
> > > subsequently nacked the series as he considers it incomplete. Neither
> > > Michal nor I see technical merit in David's nack.
> > > 
> > 
> > The nack is for three reasons:
> > 
> >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> >  cgroup from oom kill in system oom conditions,
> 
> Most users who are going to use this feature right now will have
> most of the userspace in their containers rather than in the root
> memcg. The root memcg will always be special and as such there will
> never be a universal best way to handle it. We should to satisfy most of
> usecases. I would consider this something that is an open for a further
> discussion but nothing that should stand in the way.
>  
> >  (2) the ability of users to completely evade the oom killer by attaching
> >  all processes to child cgroups either purposefully or unpurposefully,
> >  and
> 
> This doesn't differ from the current state where a task can purposefully
> or unpurposefully hide itself from the global memory killer by spawning
> new processes.
>  

It cannot hide from the global oom killer if this patchset is used because 
it cannot hide its memory usage beneath cgroup levels.  This comment is in 
support of accounting memory usage up the hierarchy.

> >  (3) the inability of userspace to effectively control oom victim  
> >  selection.
> 
> this is not requested by the current usecase and it has been pointed out
> that this will be possible to implement on top of the foundation of this
> patchset.
> 

There's no reason to not present a complete patchset.  Userspace needs the 
ability to bias or prefer processes (or cgroups, in this case).  That's 
been the case with oom_adj in the past and oom_score_adj with the 
rewritten heuristic.  It's trivial to implement and the only pending 
suggestion to do this influence involves a slightly different scoring 
mechanism than this patchset; it goes back to accounting memory up the 
hierarchy as Roman initially implemented and then biasing between cgroups 
based on an oom_score_adj.  So the proposed influence mechanism cannot be 
implemented on top of this patchset as is, and that gives more reason why 
we cannot merge incomplete patches that can't be extended in the future.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-22 Thread David Rientjes

On Thu, 19 Oct 2017, Johannes Weiner wrote:

> David would have really liked for this patchset to include knobs to
> influence how the algorithm picks cgroup victims. The rest of us
> agreed that this is beyond the scope of these patches, that the
> patches don't need it to be useful, and that there is nothing
> preventing anyone from adding configurability later on. David
> subsequently nacked the series as he considers it incomplete. Neither
> Michal nor I see technical merit in David's nack.
> 

The nack is for three reasons:

 (1) unfair comparison of root mem cgroup usage to bias against that mem 
 cgroup from oom kill in system oom conditions,

 (2) the ability of users to completely evade the oom killer by attaching
 all processes to child cgroups either purposefully or unpurposefully,
 and

 (3) the inability of userspace to effectively control oom victim  
 selection.

For (1), the difference in v12 is adding the rss of all processes attached 
to the root mem cgroup as the evaluation.  This is not the same criteria 
that child cgroups are evaluated on, and they are compared using that 
bias.  It is very trivial to provide a fair comparison as was suggested in 
v11.

For (2), users who do

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; 
done

can completely evade the oom killer and this may be part of their standard 
operating procedure for restricting resources with other cgroups other 
than the mem cgroup.  Again, it's an unfair comparison to all other 
cgroups on the system.

For (3), users need the ability to protect important cgroups, such as 
protecting a cgroup that is limited to 50% of system memory.  They need 
the ability to kill processes from other cgroups to protect these 
important processes.  This is nothing new: the oom killer has always 
provided the ability to bias against important processes.

There was follow-up email on all of these points where very trivial 
changes were suggested to address all three of these issues, and which 
Roman has implemented in one form or another in previous iterations with 
the bonus that no accounting to the root mem cgroup needs to be done.

> Michal acked the implementation, but on the condition that the new
> behavior be opt-in, to not surprise existing users. I *think* we agree
> that respecting the cgroup topography during global OOM is what we
> should have been doing when cgroups were initially introduced; where
> we disagree is that I think users shouldn't have to opt in to
> improvements. We have done much more invasive changes to the victim
> selection without actual regressions in the past. Further, this change
> only applies to mounts of the new cgroup2. Tejun also wasn't convinced
> of the risk for regression, and too would prefer cgroup-awareness to
> be the default in cgroup2. I would ask for patch 5/6 to be dropped.
> 

I agree with Michal that the new victim selection should be opt-in with 
CGRP_GROUP_OOM.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-13 Thread David Rientjes

On Fri, 13 Oct 2017, Roman Gushchin wrote:

> > Think about it in a different way: we currently compare per-process usage 
> > and userspace has /proc/pid/oom_score_adj to adjust that usage depending 
> > on priorities of that process and still oom kill if there's a memory leak.  
> > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer 
> > after all.  We don't need a strict memory.oom_priority that outranks all 
> > other sibling cgroups regardless of usage.  We need a memory.oom_score_adj 
> > to adjust the per-cgroup usage.  The decisionmaking in your earlier 
> > example would be under the control of C/memory.oom_score_adj and 
> > D/memory.oom_score_adj.  Problem solved.
> > 
> > It also solves the problem of userspace being able to influence oom victim 
> > selection so now they can protect important cgroups just like we can 
> > protect important processes today.
> > 
> > And since this would be hierarchical usage, you can trivially infer root 
> > mem cgroup usage by subtraction of top-level mem cgroup usage.
> > 
> > This is a powerful solution to the problem and gives userspace the control 
> > they need so that it can work in all usecases, not a subset of usecases.
> 
> You're right that per-cgroup oom_score_adj may resolve the issue with
> too strict semantics of oom_priorities. But I believe nobody likes
> the existing per-process oom_score_adj interface, and there are reasons 
> behind.

The previous heuristic before I rewrote the oom killer used 
/proc/pid/oom_adj which acted as a bitshift on mm->total_vm, which was a 
much more difficult interface to use as I'm sure you can imagine.  People 
ended up only using it to polarize selection: either -17 to oom disable a 
process, -16 to bias against it, and 15 to prefer it.  Nobody used 
anything in between and I worked with openssh, udev, kde, and chromium to 
get a consensus on the oom_score_adj semantics.  People do use it to 
protect against memory leaks and to prevent oom killing important 
processes when something else can be sacrificed, unless there's a leak.

> Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj
> will work is not trivial.

I suggest defining it in the terms used for previous iterations of the 
patchset: do hierarchical scoring so that each level of the hierarchy has 
usage information for each subtree.  You can get root mem cgroup usage 
with complete fairness by subtraction with this method.  When comparing 
usage at each level of the hierarchy, you can propagate the eligibility of 
processes in that subtree much like you do today.  I agree with your 
change to make the oom killer a no-op if selection races with the actual 
killing rather than falling back to the old heuristic.  I'm happy to help 
add a Tested-by once we settle the other issues with that change.

At each level, I would state that memory.oom_score_adj has the exact same 
semantics as /proc/pid/oom_score_adj.  In this case, it would simply be 
defined as a proportion of the parent's limit.  If the hierarchy is 
iterated starting at the root mem cgroup for system ooms and at the root 
of the oom memcg for memcg ooms, this should lead to the exact same oom 
killing behavior, which is desired.

This solution would address the three concerns that I had: it allows the 
root mem cgroup to be compared fairly with leaf mem cgroups (with the 
bonus of not requiring root mem cgroup accounting thanks to your heuristic 
using global vmstats), it allows userspace to influence the decisionmaking 
so that users can protect cgroups that use 50% of memory because they are 
important, and it completely avoids users being able to change victim 
selection simply by creating child mem cgroups.

This would be a very powerful patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-12 Thread David Rientjes

On Wed, 11 Oct 2017, Roman Gushchin wrote:

> > But let's move the discussion forward to fix it.  To avoid necessarily 
> > accounting memory to the root mem cgroup, have we considered if it is even 
> > necessary to address the root mem cgroup?  For the users who opt-in to 
> > this heuristic, would it be possible to discount the root mem cgroup from 
> > the heuristic entirely so that oom kills originate from leaf mem cgroups?  
> > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
> > the victim rss is greater than an eligible victim rss attached to the root 
> > mem cgroup?
> 
> David, I'm not pretending for implementing the best possible accounting
> for the root memory cgroup, and I'm sure there is a place for further
> enhancement. But if it's not leading to some obviously stupid victim
> selection (like ignoring leaking task, which consumes most of the memory),
> I don't see why it should be treated as a blocker for the whole patchset.
> I also doubt that any of us has these examples, and the best way to get
> them is to get some real usage feedback.
> 
> Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc,
> these all are perfect ideas which can be implemented on top of this patchset.
> 

For the root mem cgroup to be compared to leaf mem cgroups, it needs a 
fair comparison, not something that we leave to some future patches on top 
of this patchset.  We can't compare some cgroups with other cgroups based 
on different criteria depending on which cgroup is involved.  It's 
actually a quite trivial problem to address, it was a small modiifcation 
to your hierarchical usage patchset if that's the way that you elect to 
fix it.

I know that some of our customers use cgroups only for one or two jobs on 
the system, and that isn't necessarily just for memory limitation.  The 
fact remains, that without considering the root mem cgroup fairly, that 
these customers are unfairly biased against because they have aggregated 
their processes in a cgroup.  This a not a highly specialized usecase, I 
am positive that many users use cgroups only for a subset of processes.  
This heuristic penalizes that behavior to prefer them as oom victims.

The problem needs to be fixed instead of asking for the patchset to be 
merged and hope that we'll address these issues later.  If you account for 
hierarchical usage, you can easily subtract this from global vmstats to 
get an implicit root usage.

> > You would be required to discount oom_score_adj because the heuristic 
> > doesn't account for oom_score_adj when comparing the anon + unevictable + 
> > unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
> > correct victim selection in real-world scenarios where processes attached 
> > to the root mem cgroup are vital to the system and not part of any user 
> > job, i.e. they are important system daemons and the "activity manager" 
> > responsible for orchestrating the cgroup hierarchy.
> > 
> > It's also still unfair because it now compares
> > [sum of rss of processes attached to a cgroup] to
> > [anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
> > going to be a solution, regardless if its one process or all processes, if 
> > it's being compared to more types of memory in leaf cgroups.
> > 
> > If we really don't want root mem cgroup accounting so this is a fair 
> > comparison, I think the heuristic needs to special case the root mem 
> > cgroup either by discounting root oom kills if there are eligible oom 
> > kills from leaf cgroups (the user would be opting-in to this behavior) or 
> > comparing the badness of a victim from a leaf cgroup to the badness of a 
> > victim from the root cgroup when deciding which to kill and allow the user 
> > to protect root mem cgroup processes with oom_score_adj.
> > 
> > That aside, all of this has only addressed one of the three concerns with 
> > the patchset.
> > 
> > I believe the solution to avoid allowing users to circumvent oom kill is 
> > to account usage up the hierarchy as you have done in the past.  Cgroup 
> > hierarchies can be managed by the user so they can create their own 
> > subcontainers, this is nothing new, and I would hope that you wouldn't 
> > limit your feature to only a very specific set of usecases.  That may be 
> > your solution for the root mem cgroup itself: if the hierarchical usage of 
> > all top-level mem cgroups is known, it's possible to find the root mem 
> > cgroup usage by subtraction, you are using stats that are global vmstats 
> > in your heuristic.
> > 
> > Accounting usage up the hierarchy avoids the first two concerns with the 
> > patchset.  It allows you to implicitly understand the usage of the root 
> > mem cgroup itself, and does not allow users to circumvent oom kill by 
> > creating subcontainers, either purposefully or not.  The third concern, 
> > userspace influence, can allow users to attack leaf mem cgroups deeper in 
> > the tree if

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread David Rientjes

On Wed, 11 Oct 2017, Michal Hocko wrote:

> > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > against that mem cgroup from oom kill in system oom conditions, the 
> > ability of users to completely evade the oom killer by attaching all 
> > processes to child cgroups either purposefully or unpurposefully, and the 
> > inability of userspace to effectively control oom victim selection:
> > 
> > Nacked-by: David Rientjes <rient...@google.com>
> 
> I consider this NACK rather dubious. Evading the heuristic as you
> describe requires root privileges in default configuration because
> normal users are not allowed to create subtrees. If you
> really want to delegate subtree to an untrusted entity then you do not
> have to opt-in for this oom strategy. We can work on an additional means
> which would allow to cover those as well (e.g. priority based one which
> is requested for other usecases).
> 

You're missing the point that the user is trusted and it may be doing 
something to circumvent oom kill unknowingly.  With a single unified 
hierarchy, the user is forced to attach its processes to subcontainers if 
it wants to constrain resources with other controllers.  Doing so ends up 
completely avoiding oom kill because of this implementation detail.  It 
has nothing to do with trust and the admin who is opting-in will not know 
a user has cirumvented oom kill purely because it constrains its processes 
with controllers other than the memory controller.

> A similar argument applies to the root memcg evaluation. While the
> proposed behavior is not optimal it would work for general usecase
> described here where the root memcg doesn't really run any large number
> of tasks. If somebody who explicitly opts-in for the new strategy and it
> doesn't work well for that usecase we can enhance the behavior. That
> alone is not a reason to nack the whole thing.
> 
> I find it really disturbing that you keep nacking this approach just
> because it doesn't suite your specific usecase while it doesn't break
> it. Moreover it has been stated several times already that future
> improvements are possible and cover what you have described already.

This has nothing to do with my specific usecase.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread David Rientjes

On Tue, 10 Oct 2017, Roman Gushchin wrote:

> > We don't need a better approximation, we need a fair comparison.  The 
> > heuristic that this patchset is implementing is based on the usage of 
> > individual mem cgroups.  For the root mem cgroup to be considered 
> > eligible, we need to understand its usage.  That usage is _not_ what is 
> > implemented by this patchset, which is the largest rss of a single 
> > attached process.  This, in fact, is not an "approximation" at all.  In 
> > the example of 1 processes attached with 80MB rss each, the usage of 
> > the root mem cgroup is _not_ 80MB.
> 
> It's hard to imagine a "healthy" setup with 1 process in the root
> memory cgroup, and even if we kill 1 process we will still have 
> remaining process. I agree with you at some point, but it's not
> a real world example.
> 

It's an example that illustrates the problem with the unfair comparison 
between the root mem cgroup and leaf mem cgroups.  It's unfair to compare 
[largest rss of a single process attached to a cgroup] to
[anon + unevictable + unreclaimable slab usage of a cgroup].  It's not an 
approximation, as previously stated: the usage of the root mem cgroup is 
not 100MB if there are 10 such processes attached to the root mem cgroup, 
it's off by orders of magnitude.

For the root mem cgroup to be treated equally as a leaf mem cgroup as this 
patchset proposes, it must have a fair comparison.  That can be done by 
accounting memory to the root mem cgroup in the same way it is to leaf mem 
cgroups.

But let's move the discussion forward to fix it.  To avoid necessarily 
accounting memory to the root mem cgroup, have we considered if it is even 
necessary to address the root mem cgroup?  For the users who opt-in to 
this heuristic, would it be possible to discount the root mem cgroup from 
the heuristic entirely so that oom kills originate from leaf mem cgroups?  
Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
the victim rss is greater than an eligible victim rss attached to the root 
mem cgroup?

> > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > against that mem cgroup from oom kill in system oom conditions, the 
> > ability of users to completely evade the oom killer by attaching all 
> > processes to child cgroups either purposefully or unpurposefully, and the 
> > inability of userspace to effectively control oom victim selection:
> > 
> > Nacked-by: David Rientjes <rient...@google.com>
> 
> So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
> will it fix the problem?
> 
> It might have some drawbacks as well (especially around oom_score_adj),
> but it's doable, if we'll ignore tasks which are not owners of their's mm 
> struct.
> 

You would be required to discount oom_score_adj because the heuristic 
doesn't account for oom_score_adj when comparing the anon + unevictable + 
unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
correct victim selection in real-world scenarios where processes attached 
to the root mem cgroup are vital to the system and not part of any user 
job, i.e. they are important system daemons and the "activity manager" 
responsible for orchestrating the cgroup hierarchy.

It's also still unfair because it now compares
[sum of rss of processes attached to a cgroup] to
[anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
going to be a solution, regardless if its one process or all processes, if 
it's being compared to more types of memory in leaf cgroups.

If we really don't want root mem cgroup accounting so this is a fair 
comparison, I think the heuristic needs to special case the root mem 
cgroup either by discounting root oom kills if there are eligible oom 
kills from leaf cgroups (the user would be opting-in to this behavior) or 
comparing the badness of a victim from a leaf cgroup to the badness of a 
victim from the root cgroup when deciding which to kill and allow the user 
to protect root mem cgroup processes with oom_score_adj.

That aside, all of this has only addressed one of the three concerns with 
the patchset.

I believe the solution to avoid allowing users to circumvent oom kill is 
to account usage up the hierarchy as you have done in the past.  Cgroup 
hierarchies can be managed by the user so they can create their own 
subcontainers, this is nothing new, and I would hope that you wouldn't 
limit your feature to only a very specific set of usecases.  That may be 
your solution for the root mem cgroup itself: if the hierarchical usage of 
all top-level mem cgroups is known, it's possible to find the root mem 
cgroup usage by subtraction, you are using stats that are global vmstats 
in your heuristic.

Accounting usage up the hierarchy avoids the first two con

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread David Rientjes

t; 
> To all previously said words I can only add that cgroup v2 allows to limit
> the amount of cgroups in the sub-tree:
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 

So the solution to 

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

evading all oom kills for your mem cgroup is to limit the number of 
cgroups that can be created by the user?  With a unified cgroup hierarchy, 
that doesn't work well if I wanted to actually constrain these individual 
processes to different resource limits like cpu usage.  In fact, the user 
may not know it is effectively evading the oom killer entirely because it 
has constrained the cpu of individual processes because its a side-effect 
of this heuristic.


You chose not to respond to my reiteration of userspace having absolutely 
no control over victim selection with the new heuristic without setting 
all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
very important job that is running on a system that is really supposed to 
use 80% of memory, I need to be able to specify that it should not be oom 
killed based on user goals.  Setting all processes to be oom disabled in 
the important mem cgroup to avoid being oom killed unless absolutely 
necessary in a system oom condition is not a robust solution: (1) the mem 
cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
panic()'s if these preferred mem cgroups are the only consumers left on 
the system.  With overcommit, both of these possibilities exist in the 
wild and the problem is only a result of the implementation detail of this 
patchset.

For these reasons: unfair comparison of root mem cgroup usage to bias 
against that mem cgroup from oom kill in system oom conditions, the 
ability of users to completely evade the oom killer by attaching all 
processes to child cgroups either purposefully or unpurposefully, and the 
inability of userspace to effectively control oom victim selection:

Nacked-by: David Rientjes <rient...@google.com>

> > This is racy because mem_cgroup_select_oom_victim() found an eligible 
> > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> > process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> > process cannot be killed because of oom_unkillable_task(), the only 
> > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> > eligible processes changed, we end up falling back to the complete 
> > tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> > oom_unkillable_task() and also retry in the case where 
> > oom_kill_memcg_victim() returns NULL.
> 
> I agree with you here. The fallback to the existing mechanism is implemented
> to be safe for sure, especially in a case of a global OOM. When we'll get
> more confidence in cgroup-aware OOM killer reliability, we can change this
> behavior. Personally, I would prefer to get rid of looking at all tasks just
> to find a pre-existing OOM victim, but it might be quite tricky to implement.
> 

I'm not sure what this has to do with confidence in this patchset's 
reliability?  The race obviously exists: mem_cgroup_select_oom_victim() 
found an eligible process in oc->chosen_memcg but it was either ineligible 
later because of oom_unkillable_task(), it moved, or it exited.  It's a 
race.  For users who opt-in to this new heuristic, they should not be 
concerned with a process exiting and thus killing a completely unexpected 
process from an unexpected memcg when it should be possible to retry and 
select the correct victim.

It's much better to document and state to the user what they are opting-in 
to and clearly define how a victim is chosen with the new heuristic and 
then implement that so it works correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-09 Thread David Rientjes

On Thu, 5 Oct 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 

I'd move the second point to the changelog for the next patch since this 
patch doesn't implement any support for memory.oom_group.

> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 

This seems to unfairly bias the root mem cgroup depending on process size.  
It isn't treated fairly as a leaf mem cgroup if they are being compared 
based on different criteria: the root mem cgroup as (mostly) the largest 
rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
unreclaimable slab pages charged to it by all processes.

I imagine a configuration where the root mem cgroup has 100 processes 
attached each with rss of 80MB, compared to a leaf cgroup with 100 
processes of 1MB rss each.  How does this logic prevent repeatedly oom 
killing the processes of 1MB rss?

In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
quite fair, it can simply hide large processes from being selected.  Users 
who configure cgroups in a unified hierarchy for other resource 
constraints are penalized for this choice even though the mem cgroup with 
100 processes of 1MB rss each may not be limited itself.

I think for this comparison to be fair, it requires accounting for the 
root mem cgroup itself or for a different accounting methodology for leaf 
memory cgroups.

> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *0: memcg is not eligible, e.g. all belonging tasks are protected
> + *   by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *   of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +const nodemask_t *nodemask,
> +unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> +  * Memcg is OOM eligible if there are OOM killable tasks inside.
> +  *
> +  * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +  * as unkillable.
> +  *
> +  * If there are inflight OOM victim tasks inside the memcg,
> +  * we return -1.
> +  */
> + css_task_iter_start(>css, 0, );
> + while ((task = css_task_iter_next())) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end();
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> +  * The oom_score is calculated for leaf memory cgroups (including
> +  * the root memcg).
> +  */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter) && iter != root_mem_cgroup)
> + continue;

I'll reiterate what I did on the last version of the patchset: considering 
only leaf memory cgroups easily allows users to defeat this heuristic and 
bias against all of their memory

Re: [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-09 Thread David Rientjes

On Thu, 5 Oct 2017, Roman Gushchin wrote:

> Implement mem_cgroup_scan_tasks() functionality for the root
> memory cgroup to use this function for looking for a OOM victim
> task in the root memory cgroup by the cgroup-ware OOM killer.
> 
> The root memory cgroup is treated as a leaf cgroup, so only tasks
> which are directly belonging to the root cgroup are iterated over.
> 
> This patch doesn't introduce any functional change as
> mem_cgroup_scan_tasks() is never called for the root memcg.
> This is preparatory work for the cgroup-aware OOM killer,
> which will use this function to iterate over tasks belonging
> to the root memcg.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>

Acked-by: David Rientjes <rient...@google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes

On Thu, 5 Oct 2017, Roman Gushchin wrote:

> > This patchset exists because overcommit is real, exactly the same as 
> > overcommit within memcg hierarchies is real.  99% of the time we don't run 
> > into global oom because people aren't using their limits so it just works 
> > out.  1% of the time we run into global oom and we need a decision to made 
> > based for forward progress.  Using Michal's earlier example of admins and 
> > students, a student can easily use all of his limit and also, with v10 of 
> > this patchset, 99% of the time avoid being oom killed just by forking N 
> > processes over N cgroups.  It's going to oom kill an admin every single 
> > time.
> 
> Overcommit is real, but configuring the system in a way that system-wide OOM
> happens often is a strange idea.

I wouldn't consider 1% of the time to be often, but the incident rate 
depends on many variables and who is sharing the same machine.  We can be 
smart about it and limit the potential for it in many ways, but the end 
result is that we still do overcommit and the system oom killer can be 
used to free memory from a low priority process.

> As we all know, the system can barely work
> adequate under global memory shortage: network packets are dropped, latency
> is bad, weird kernel issues are revealed periodically, etc.
> I do not see, why you can't overcommit on deeper layers of cgroup hierarchy,
> avoiding system-wide OOM to happen.
> 

Whether it's a system oom or whether its part of the cgroup hierarchy 
doesn't really matter, what matters is that overcommit occurs and we'd 
like to kill based on cgroup usage for each cgroup and its subtree, much 
like your earlier iterations, and also have the ability for userspace to 
influence that.

Without a cgroup-aware oom killer, I can prefer against killing an 
important job that uses 80% of memory and I want it to continue using 80% 
of memory.  We don't have that control over the cgroup-aware oom killer 
although we want to consider cgroup and subtree usage when choosing 
amongst cgroups with the same priority.  If you are not interested in 
defining the oom priority, all can remain at the default and there is no 
compatibility issue.

> > I know exactly why earlier versions of this patchset iterated that usage 
> > up the tree so you would pick from students, pick from this troublemaking 
> > student, and then oom kill from his hierarchy.  Roman has made that point 
> > himself.  My suggestion was to add userspace influence to it so that 
> > enterprise users and users with business goals can actually define that we 
> > really do want 80% of memory to be used by this process or this hierarchy, 
> > it's in our best interest.
> 
> I'll repeat myself: I believe that there is a range of possible policies:
> from a complete flat (what Johannes did suggested few weeks ago), to a very
> hierarchical (as in v8). Each with their pros and cons.
> (Michal did provide a clear example of bad behavior of the hierarchical 
> approach).
> 
> I assume, that v10 is a good middle point, and it's good because it doesn't
> prevent further development. Just for example, you can introduce a third state
> of oom_group knob, which will mean "evaluate as a whole, but do not kill all".
> And this is what will solve your particular case, right?
> 

I would need to add patches to add the "evaluate as a whole but do not 
kill all" knob and a knob for "oom priority" so that userspace has the 
same influence over a cgroup based comparison that it does with a process 
based comparison to meet business goals.  I'm not sure I'd be happy to 
pollute the mem cgroup v2 filesystem with such knobs when you can easily 
just not set the priority if you don't want to, and increase the priority 
if you have a leaf cgroup that should be preferred to be killed because of 
excess usage.  It has worked quite well in practice.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes

On Thu, 5 Oct 2017, Johannes Weiner wrote:

> > It is, because it can quite clearly be a DoSand was prevented with 
> > Roman's earlier design of iterating usage up the hierarchy and comparing 
> > siblings based on that criteria.  I know exactly why he chose that 
> > implementation detail early on, and it was to prevent cases such as this 
> > and to not let userspace hide from the oom killer.
> 
> This doesn't address how it's different from a single process
> following the same pattern right now.
> 

Are you referring to a single process being rewritten into N different 
subprocesses that do the same work as the single process but is separated 
in this manner to avoid having a large rss for any single process to avoid  
being oom killed?

This is solved by a cgroup-aware oom killer because these subprocesses 
should not be able to escape their own chargable entity.  It's exactly the 
usecase that Roman is addressing, correct?  My suggestion is to continue 
to iterate the usage up the hierarchy so that users can't easily defeat 
this by creating N subcontainers instead.

> > Let's resolve that global oom is a real condition and getting into that 
> > situation is not a userspace problem.  It's the result of overcommiting 
> > the system, and is used in the enterprise to address business goals.  If 
> > the above is true, and its up to memcg to prevent global oom in the first 
> > place, then this entire patchset is absolutely pointless.  Limit userspace 
> > to 95% of memory and when usage is approaching that limit, let userspace 
> > attached to the root memcg iterate the hierarchy itself and kill from the 
> > largest consumer.
> > 
> > This patchset exists because overcommit is real, exactly the same as 
> > overcommit within memcg hierarchies is real.  99% of the time we don't run 
> > into global oom because people aren't using their limits so it just works 
> > out.  1% of the time we run into global oom and we need a decision to made 
> > based for forward progress.  Using Michal's earlier example of admins and 
> > students, a student can easily use all of his limit and also, with v10 of 
> > this patchset, 99% of the time avoid being oom killed just by forking N 
> > processes over N cgroups.  It's going to oom kill an admin every single 
> > time.
> 
> We overcommit too, but our workloads organize themselves based on
> managing their resources, not based on evading the OOM killer. I'd
> wager that's true for many if not most users.
> 

No workloads are based on evading the oom killer, we are specifically 
trying to avoid that with oom priorities.  They have the power over 
increasing their own priority to be preferred to kill, but not decreasing 
their oom priority that was set by an activity manager.  This is exactly 
the same as how /proc/pid/oom_score_adj works.  With a cgroup-aware oom 
killer, which we'd love, nothing can possibly evade the oom killer if 
there are oom priorities.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes

On Wed, 4 Oct 2017, Johannes Weiner wrote:

> > By only considering leaf memcgs, does this penalize users if their memcg 
> > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > processes to be members of that memcg, which would otherwise be the 
> > standard behavior?
> > 
> > What prevents me from spreading my memcg with N processes attached over N 
> > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > each child memcg specifically to avoid being oom killed?
> 
> It's no different from forking out multiple mm to avoid being the
> biggest process.
> 

It is, because it can quite clearly be a DoS, and was prevented with 
Roman's earlier design of iterating usage up the hierarchy and comparing 
siblings based on that criteria.  I know exactly why he chose that 
implementation detail early on, and it was to prevent cases such as this 
and to not let userspace hide from the oom killer.

> It's up to the parent to enforce limits on that group and prevent you
> from being able to cause global OOM in the first place, in particular
> if you delegate to untrusted and potentially malicious users.
> 

Let's resolve that global oom is a real condition and getting into that 
situation is not a userspace problem.  It's the result of overcommiting 
the system, and is used in the enterprise to address business goals.  If 
the above is true, and its up to memcg to prevent global oom in the first 
place, then this entire patchset is absolutely pointless.  Limit userspace 
to 95% of memory and when usage is approaching that limit, let userspace 
attached to the root memcg iterate the hierarchy itself and kill from the 
largest consumer.

This patchset exists because overcommit is real, exactly the same as 
overcommit within memcg hierarchies is real.  99% of the time we don't run 
into global oom because people aren't using their limits so it just works 
out.  1% of the time we run into global oom and we need a decision to made 
based for forward progress.  Using Michal's earlier example of admins and 
students, a student can easily use all of his limit and also, with v10 of 
this patchset, 99% of the time avoid being oom killed just by forking N 
processes over N cgroups.  It's going to oom kill an admin every single 
time.

I know exactly why earlier versions of this patchset iterated that usage 
up the tree so you would pick from students, pick from this troublemaking 
student, and then oom kill from his hierarchy.  Roman has made that point 
himself.  My suggestion was to add userspace influence to it so that 
enterprise users and users with business goals can actually define that we 
really do want 80% of memory to be used by this process or this hierarchy, 
it's in our best interest.

Earlier iterations of this patchset did this, and did it correctly.  
Userspace influence over the decisionmaking makes it a very powerful 
combination because you _can_ specify what your goals are or choose to 
leave the priorities as default so you can compare based solely on usage.  
It was a beautiful solution to the problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread David Rientjes

On Wed, 4 Oct 2017, Roman Gushchin wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b4de17a78dc1..79f30c281185 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct 
> mem_cgroup *memcg)
>   return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +   const nodemask_t *nodemask,
> +   unsigned long totalpages)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + /*
> +  * We don't have necessary stats for the root memcg,
> +  * so we define it's oom_score as the maximum oom_score
> +  * of the belonging tasks.
> +  *
> +  * As tasks in the root memcg unlikely are parts of a
> +  * single workload, and we don't have to implement
> +  * group killing, this approximation is reasonable.
> +  *
> +  * But if we will have necessary stats for the root memcg,
> +  * we might switch to the approach which is used for all
> +  * other memcgs.
> +  */
> + if (memcg == root_mem_cgroup) {
> + struct css_task_iter it;
> + struct task_struct *task;
> + long score, max_score = 0;
> +
> + css_task_iter_start(>css, 0, );
> + while ((task = css_task_iter_next())) {
> + score = oom_badness(task, memcg, nodemask,
> + totalpages);
> + if (score > max_score)
> + max_score = score;
> + }
> + css_task_iter_end();
> +
> + return max_score;
> + }
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *0: memcg is not eligible, e.g. all belonging tasks are protected
> + *   by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *   of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +const nodemask_t *nodemask,
> +unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> +  * Memcg is OOM eligible if there are OOM killable tasks inside.
> +  *
> +  * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +  * as unkillable.
> +  *
> +  * If there are inflight OOM victim tasks inside the memcg,
> +  * we return -1.
> +  */
> + css_task_iter_start(>css, 0, );
> + while ((task = css_task_iter_next())) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end();
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> +  * The oom_score is calculated for leaf memory cgroups (including
> +  * the root memcg).
> +  */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter))
> + continue;
> +
> + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> + /*
> +  * Ignore empty and non-eligible memory cgroups.
> +  */
> + if (score == 0)
> + continue;
> +
> + /*
> +  * If there are inflight OOM victims, we don't need
> +  * to look further for new victims.
> +  */
> +

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread David Rientjes

On Wed, 4 Oct 2017, Roman Gushchin wrote:

> > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct 
> > > *victim)
> > >   struct mm_struct *mm;
> > >   bool can_oom_reap = true;
> > >  
> > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > + victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > + put_task_struct(victim);
> > > + return;
> > > + }
> > > +
> > >   p = find_lock_task_mm(victim);
> > >   if (!p) {
> > >   put_task_struct(victim);
> > 
> > Is this necessary? The callers of this function use oom_badness() to
> > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> 
> It is. __oom_kill_process() is used to kill all processes belonging
> to the selected memory cgroup, so we should perform these checks
> to avoid killing unkillable processes.
> 

That's only true after the next patch in the series which uses the 
oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
Would it be possible to move this check to that patch so it's more 
obvious?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-04 Thread David Rientjes

On Wed, 4 Oct 2017, Roman Gushchin wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d5f3a62887cf..b4de17a78dc1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct 
> mem_cgroup *dead_memcg)
>   * value, the function breaks the iteration loop and returns the value.
>   * Otherwise, it will iterate over all tasks and return 0.
>   *
> - * This function must not be called for the root memory cgroup.
> + * If memcg is the root memory cgroup, this function will iterate only
> + * over tasks belonging directly to the root memory cgroup.
>   */
>  int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
> int (*fn)(struct task_struct *, void *), void *arg)
> @@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>   struct mem_cgroup *iter;
>   int ret = 0;
>  
> - BUG_ON(memcg == root_mem_cgroup);
> -
>   for_each_mem_cgroup_tree(iter, memcg) {
>   struct css_task_iter it;
>   struct task_struct *task;
> @@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>   while (!ret && (task = css_task_iter_next()))
>   ret = fn(task, arg);
>   css_task_iter_end();
> - if (ret) {
> + if (ret || memcg == root_mem_cgroup) {
>   mem_cgroup_iter_break(memcg, iter);
>   break;
>   }

I think this is fine, but a little strange to start an iteration that 
never loops :)  No objection to the patch but it could also be extracted 
into a new mem_cgroup_scan_tasks() which actually scans the tasks in that 
mem cgroup and then a wrapper that does the iteration that calls into it, 
say, mem_cgroup_scan_tasks_tree().
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-26 Thread David Rientjes

On Tue, 26 Sep 2017, Michal Hocko wrote:

> > No, I agree that we shouldn't compare sibling memory cgroups based on 
> > different criteria depending on whether group_oom is set or not.
> > 
> > I think it would be better to compare siblings based on the same criteria 
> > independent of group_oom if the user has mounted the hierarchy with the 
> > new mode (I think we all agree that the mount option is needed).  It's 
> > very easy to describe to the user and the selection is simple to 
> > understand. 
> 
> I disagree. Just take the most simplistic example when cgroups reflect
> some other higher level organization - e.g. school with teachers,
> students and admins as the top level cgroups to control the proper cpu
> share load. Now you want to have a fair OOM selection between different
> entities. Do you consider selecting students all the time as an expected
> behavior just because their are the largest group? This just doesn't
> make any sense to me.
> 

Are you referring to this?

root
   /\
studentsadmins
/  \/\
A  BCD

If the cumulative usage of all students exceeds the cumulative usage of 
all admins, yes, the choice is to kill from the /students tree.  This has 
been Roman's design from the very beginning.  If the preference is to kill 
the single largest process, which may be attached to either subtree, you 
would not have opted-in to the new heuristic.

> > Then, once a cgroup has been chosen as the victim cgroup, 
> > kill the process with the highest badness, allowing the user to influence 
> > that with /proc/pid/oom_score_adj just as today, if group_oom is disabled; 
> > otherwise, kill all eligible processes if enabled.
> 
> And now, what should be the semantic of group_oom on an intermediate
> (non-leaf) memcg? Why should we compare it to other killable entities?
> Roman was mentioning a setup where a _single_ workload consists of a
> deeper hierarchy which has to be shut down at once. It absolutely makes
> sense to consider the cumulative memory of that hierarchy when we are
> going to kill it all.
> 

If group_oom is enabled on an intermediate memcg, I think the intuitive 
way to handle it would be that all descendants are also implicitly or 
explicitly group_oom.  It is compared to sibling cgroups based on 
cumulative usage at the time of oom and the largest is chosen and 
iterated.  The point is to separate out the selection heuristic (policy) 
from group_oom (mechanism) so that we don't bias or prefer subtrees based 
on group_oom, which makes this much more complex.

> But what you are proposing is something different from oom_score_adj.
> That only sets bias to the killable entities while priorities on
> intermediate non-killable memcgs controls how the whole oom hierarchy
> is traversed. So a non-killable intermediate memcg can hugely influence
> what gets killed in the end.

Why is there an intermediate non-killable memcg allowed?  Cgroup oom 
priorities should not be allowed to disable oom killing, it should only 
set a priority.  The only reason an intermediate cgroup should be 
non-killable is if there are no processes attached, but I don't think 
anyone is arguing we should just do nothing in that scenario.  The point 
is that the user has infleunce over the decisionmaking with a per-process 
heuristic with oom_score_adj and should also have influence over the 
decisionmaking with a per-cgroup heuristic.

> This is IMHO a tricky and I would even dare
> to claim a wrong semantic. I can see priorities being very useful on
> killable entities for sure. I am not entirely sure what would be the
> best approach yet and that is why I've suggested that to postpone to
> after we settle with a simple approach first. Bringing priorities back
> to the discussion again will not help to move that forward I am afraid.
> 

I agree to keep it as simple as possible, especially since some users want 
specific victim selection, it should be clear to document, and it 
shouldn't be influenced by some excessive amount of usage in another 
subtree the user has no control over (/admins over /students) to prevent 
the user from defining that it really wants to be the first oom victim or 
the admin from defining it really prefers something else killed first.

My suggestion is that Roman's implementation is clear, well defined, and 
has real-world usecases and it should be the direction that this moves in.  
I think victim selection and group_oom are distinct and should not 
influence the decisionmaking.  I think that oom_priority should influence 
the decisionmaking.

When mounted with the new option, as the oom hierarchy is iterated, 
compare all sibling cgroups regarding cumulative size unless an oom 
priority overrides that (either user specifying it wants to be oom killed 
or admin specifying it prefers something else).  When a victim memcg is 
chosen, use group_oom to determine what should be killed, otherwise choose 
by oom_score_adj.  I can't imagine

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-25 Thread David Rientjes

On Mon, 25 Sep 2017, Johannes Weiner wrote:

> > True but we want to have the semantic reasonably understandable. And it
> > is quite hard to explain that the oom killer hasn't selected the largest
> > memcg just because it happened to be in a deeper hierarchy which has
> > been configured to cover a different resource.
> 
> Going back to Michal's example, say the user configured the following:
> 
>root
>   /\
>  A  D
> / \
>B   C
> 
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
> 
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
> 
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
> 
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
> 

No, I agree that we shouldn't compare sibling memory cgroups based on 
different criteria depending on whether group_oom is set or not.

I think it would be better to compare siblings based on the same criteria 
independent of group_oom if the user has mounted the hierarchy with the 
new mode (I think we all agree that the mount option is needed).  It's 
very easy to describe to the user and the selection is simple to 
understand.  Then, once a cgroup has been chosen as the victim cgroup, 
kill the process with the highest badness, allowing the user to influence 
that with /proc/pid/oom_score_adj just as today, if group_oom is disabled; 
otherwise, kill all eligible processes if enabled.

That, to me, is a very clear semantic and I believe it addresses Roman's 
usecase.  My desire to have oom priorities amongst siblings is so that 
userspace can influence which cgroup is chosen, just as it can influence 
which process is chosen.

I see group_oom as a mechanism to be used when victim selection has 
already been done instead of something that should be considered in the 
policy of victim selection.

> Let's consider an extreme case of this conundrum:
> 
>   root
>   / \
>  A   B
> /|\  |
>  A1-A1000B1
> 
> Again we find:
> - A > B
> - A1 to A1000 and B1 are oomgroups
> But:
> - A1 to A1000 individually are tiny, B1 is huge
> 
> Going level by level, we'd pick A as the bigger hierarchy in the
> system, and then kill off one of the tiny groups A1 to A1000.
> 
> Conversely, going for biggest consumer regardless of hierarchy, we'd
> compare A1 to A1000 and B1, then pick B1 as the biggest single atomic
> memory consumer in the system and kill all its tasks.
> 

If we compare sibling memcgs independent of group_oom, we don't 
necessarily pick A unless it really is larger than B.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-23 Thread David Rientjes

On Fri, 22 Sep 2017, Tejun Heo wrote:

> > If you have this low priority maintenance job charging memory to the high 
> > priority hierarchy, you're already misconfigured unless you adjust 
> > /proc/pid/oom_score_adj because it will oom kill any larger process than 
> > itself in today's kernels anyway.
> > 
> > A better configuration would be attach this hypothetical low priority 
> > maintenance job to its own sibling cgroup with its own memory limit to 
> > avoid exactly that problem: it going berserk and charging too much memory 
> > to the high priority container that results in one of its processes 
> > getting oom killed.
> 
> And how do you guarantee that across delegation boundaries?  The
> points you raise on why the priority should be applied level-by-level
> are exactly the same points why this doesn't really work.  OOM killing
> priority isn't something which can be distributed across cgroup
> hierarchy level-by-level.  The resulting decision tree doesn't make
> any sense.
> 

It works very well in practice with real world usecases, and Roman has 
developed the same design independently that we have used for the past 
four years.  Saying it doesn't make any sense doesn't hold a lot of weight 
when we both independently designed and implemented the same solution to 
address our usecases.

> I'm not against adding something which works but strict level-by-level
> comparison isn't the solution.
> 

Each of the eight versions of Roman's cgroup aware oom killer has done 
comparisons between siblings at each level.  Userspace influence on that 
comparison would thus also need to be done at each level.  It's a very 
powerful combination in practice.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-22 Thread David Rientjes

On Thu, 21 Sep 2017, Johannes Weiner wrote:

> > The issue is that if you opt-in to the new feature, then you are forced to 
> > change /proc/pid/oom_score_adj of all processes attached to a cgroup that 
> > you do not want oom killed based on size to be oom disabled.
> 
> You're assuming that most people would want to influence the oom
> behavior in the first place. I think the opposite is the case: most
> people don't care as long as the OOM killer takes the intent the user
> has expressed wrt runtime containerization/grouping into account.
> 

If you do not want to influence the oom behavior, do not change 
memory.oom_priority from its default.  It's that simple.

> > The kernel provides no other remedy without oom priorities since the
> > new feature would otherwise disregard oom_score_adj.
> 
> As of v8, it respects this setting and doesn't kill min score tasks.
> 

That's the issue.  To protect a memory cgroup from being oom killed in a 
system oom condition, you need to change oom_score_adj of *all* processes 
attached to be oom disabled.  Then, you have a huge problem in memory 
cgroup oom conditions because nothing can be killed in that hierarchy 
itself.

> > The patchset compares memory cgroup size relative to sibling cgroups only, 
> > the same comparison for memory.oom_priority.  There is a guarantee 
> > provided on how cgroup size is compared in select_victim_memcg(), it 
> > hierarchically accumulates the "size" from leaf nodes up to the root memcg 
> > and then iterates the tree comparing sizes between sibling cgroups to 
> > choose a victim memcg.  That algorithm could be more elaborately described 
> > in the documentation, but we simply cannot change the implementation of 
> > select_victim_memcg() later even without oom priorities since users cannot 
> > get inconsistent results after opting into a feature between kernel 
> > versions.  I believe the selection criteria should be implemented to be 
> > deterministic, as select_victim_memcg() does, and the documentation should 
> > fully describe what the selection criteria is, and then allow the user to 
> > decide.
> 
> I wholeheartedly disagree. We have changed the behavior multiple times
> in the past. In fact, you have arguably done the most drastic changes
> to the algorithm since the OOM killer was first introduced. E.g.
> 
>   a63d83f427fb oom: badness heuristic rewrite
> 
> And that's completely fine. Because this thing is not a resource
> management tool for userspace, it's the kernel saving itself. At best
> in a manner that's not too surprising to userspace.
> 

When I did that, I had to add /proc/pid/oom_score_adj to allow userspace 
to influence selection.  We came up with /proc/pid/oom_score_adj when 
working with kde, openssh, chromium, and udev because they cared about the 
ability to influence the decisionmaking.  I'm perfectly happy with the new 
heuristic presented in this patchset, I simply want userspace to be able 
to influence it, if it desires.  Requiring userspace to set all processes 
to be oom disabled to protect a hierarchy is totally and completely 
broken.  It livelocks the memory cgroup if it is oom itself.

> To me, your argument behind the NAK still boils down to "this doesn't
> support my highly specialized usecase." But since it doesn't prohibit
> your usecase - which isn't even supported upstream, btw - this really
> doesn't carry much weight.
> 
> I'd say if you want configurability on top of Roman's code, please
> submit patches and push the case for these in a separate effort.
> 

Roman implemented memory.oom_priority himself, it has my Tested-by, and it 
allows users who want to protect high priority memory cgroups from using 
the size based comparison for all other cgroups that we very much desire.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-22 Thread David Rientjes

On Fri, 22 Sep 2017, Tejun Heo wrote:

> > It doesn't have anything to do with my particular usecase, but rather the 
> > ability of userspace to influence the decisions of the kernel.  Previous 
> > to this patchset, when selection is done based on process size, userspace 
> > has full control over selection.  After this patchset, userspace has no 
> > control other than setting all processes to be oom disabled if the largest 
> > memory consumer is to be protected.  Roman's memory.oom_priority provides 
> > a perfect solution for userspace to be able to influence this decision 
> > making and causes no change in behavior for users who choose not to tune 
> > memory.oom_priority.  The nack originates from the general need for 
> > userspace influence over oom victim selection and to avoid userspace 
> > needing to take the rather drastic measure of setting all processes to be 
> > oom disabled to prevent oom kill in kernels before oom priorities are 
> > introduced.
> 
> Overall, I think that OOM killing is the wrong place to implement
> sophisticated intelligence in.  It's too late to be smart - the
> workload already has suffered significantly and there's only very
> limited amount of computing which can be performed.  That said, if
> there's a useful and general enough mechanism to configure OOM killer
> behavior from userland, that can definitely be useful.
> 

What is under discussion is a new way to compare sibling cgroups when 
selecting a victim for oom kill.  It's a new heuristic based on a 
characteristic of the memory cgroup rather than the individual process.  
We want this behavior that the patchset implements.  The only desire is a 
way for userspace to influence that decision making in the same way that 
/proc/pid/oom_score_adj allows userspace to influence the current 
heuristic.

Current heuristic based on processes is coupled with per-process
/proc/pid/oom_score_adj.  The proposed 
heuristic has no ability to be influenced by userspace, and it needs one.  
The proposed heuristic based on memory cgroups coupled with Roman's 
per-memcg memory.oom_priority is appropriate and needed.  It is not 
"sophisticated intelligence," it merely allows userspace to protect vital 
memory cgroups when opting into the new features (cgroups compared based 
on size and memory.oom_group) that we very much want.

> We even change the whole scheduling behaviors and try really hard to
> not get locked into specific implementation details which exclude
> future improvements.  Guaranteeing OOM killing selection would be
> crazy.  Why would we prevent ourselves from doing things better in the
> future?  We aren't talking about the semantics of read(2) here.  This
> is a kernel emergency mechanism to avoid deadlock at the last moment.
> 

We merely want to prefer other memory cgroups are oom killed on system oom 
conditions before important ones, regardless if the important one is using 
more memory than the others because of the new heuristic this patchset 
introduces.  This is exactly the same as /proc/pid/oom_score_adj for the 
current heuristic.

> Here's a really simple use case.  Imagine a system which hosts two
> containers of services and one is somewhat favored over the other and
> wants to set up cgroup hierarchy so that resources are split at the
> top level between the two containers.  oom_priority is set accordingly
> too.  Let's say a low priority maintenance job in higher priority
> container goes berserk, as they oftne do, and pushing the system into
> OOM.
> 
> With the proposed static oom_priority mechanism, the only
> configuration which can be expressed is "kill all of the lower top
> level subtree before any of the higher one", which is a silly
> restriction leading to silly behavior and a direct result of
> conflating resource distribution network with level-by-level OOM
> killing decsion.
> 

The problem you're describing is an issue with the top-level limits after 
this patchset is merged, not memory.oom_priority at all.

If they are truly split evenly, this patchset kills the largest process 
from the hierarchy with the most charged memory.  That's unchanged if the 
two priorities are equal.  By changing the priority to be more preferred 
for a hierarchy, you indeed prefer oom kills from the lower priority 
hierarchy.  You've opted in.  One hierarchy is more important than the 
other, regardless of any hypothetical low priority maintenance job going 
berserk.

If you have this low priority maintenance job charging memory to the high 
priority hierarchy, you're already misconfigured unless you adjust 
/proc/pid/oom_score_adj because it will oom kill any larger process than 
itself in today's kernels anyway.

A better configuration would be attach this hypothetical low priority 
maintenance job to its own sibling cgroup with its own memory limit to 
avoid exactly that problem: it going berserk and charging too much memory 
to the high priority container that results in one of its processes

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread David Rientjes

On Thu, 21 Sep 2017, Johannes Weiner wrote:

> That's a ridiculous nak.
> 
> The fact that this patch series doesn't solve your particular problem
> is not a technical argument to *reject* somebody else's work to solve
> a different problem. It's not a regression when behavior is completely
> unchanged unless you explicitly opt into a new functionality.
> 
> So let's stay reasonable here.
> 

The issue is that if you opt-in to the new feature, then you are forced to 
change /proc/pid/oom_score_adj of all processes attached to a cgroup that 
you do not want oom killed based on size to be oom disabled.  The kernel 
provides no other remedy without oom priorities since the new feature 
would otherwise disregard oom_score_adj.  In that case, userspace is 
racing in two ways: (1) attach of process to a memcg you want to protect 
from oom kill (first class, vital, large memory hog job) to set to oom 
disable and (2) adjustment of other cgroups to make them eligible after 
first oom kill.

It doesn't have anything to do with my particular usecase, but rather the 
ability of userspace to influence the decisions of the kernel.  Previous 
to this patchset, when selection is done based on process size, userspace 
has full control over selection.  After this patchset, userspace has no 
control other than setting all processes to be oom disabled if the largest 
memory consumer is to be protected.  Roman's memory.oom_priority provides 
a perfect solution for userspace to be able to influence this decision 
making and causes no change in behavior for users who choose not to tune 
memory.oom_priority.  The nack originates from the general need for 
userspace influence over oom victim selection and to avoid userspace 
needing to take the rather drastic measure of setting all processes to be 
oom disabled to prevent oom kill in kernels before oom priorities are 
introduced.

> The patch series has merit as it currently stands. It makes OOM
> killing in a cgrouped system fairer and less surprising. Whether you
> have the ability to influence this in a new way is an entirely
> separate discussion. It's one that involves ABI and user guarantees.
> 
> Right now Roman's patches make no guarantees on how the cgroup tree is
> descended. But once we define an interface for prioritization, it
> locks the victim algorithm into place to a certain extent.
> 

The patchset compares memory cgroup size relative to sibling cgroups only, 
the same comparison for memory.oom_priority.  There is a guarantee 
provided on how cgroup size is compared in select_victim_memcg(), it 
hierarchically accumulates the "size" from leaf nodes up to the root memcg 
and then iterates the tree comparing sizes between sibling cgroups to 
choose a victim memcg.  That algorithm could be more elaborately described 
in the documentation, but we simply cannot change the implementation of 
select_victim_memcg() later even without oom priorities since users cannot 
get inconsistent results after opting into a feature between kernel 
versions.  I believe the selection criteria should be implemented to be 
deterministic, as select_victim_memcg() does, and the documentation should 
fully describe what the selection criteria is, and then allow the user to 
decide.

> It also involves a discussion about how much control userspace should
> have over OOM killing in the first place. It's a last-minute effort to
> save the kernel from deadlocking on memory. Whether that is the time
> and place to have userspace make clever resource management decisions
> is an entirely different thing than what Roman is doing.
> 
> But this patch series doesn't prevent any such future discussion and
> implementations, and it's not useless without it. So let's not
> conflate these two things, and hold the priority patch for now.
> 

Roman is planning on introducing memory.oom_priority back into the 
patchset per https://marc.info/?l=linux-kernel=150574701126877 and I 
agree with the very clear semantic that it introduces: to have the 
size-based comparison use the same rules as the userspace priority 
comparison.  It's very powerful and I'm happy to ack the final version 
that he plans on posting.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread David Rientjes

On Mon, 18 Sep 2017, Roman Gushchin wrote:

> > As said in other email. We can make priorities hierarchical (in the same
> > sense as hard limit or others) so that children cannot override their
> > parent.
> 
> You mean they can set the knob to any value, but parent's value is enforced,
> if it's greater than child's value?
> 
> If so, this sounds logical to me. Then we have size-based comparison and
> priority-based comparison with similar rules, and all use cases are covered.
> 
> Ok, can we stick with this design?
> Then I'll return oom_priorities in place, and post a (hopefully) final 
> version.
> 

I just want to make sure that we are going with your original 
implementation here: that oom_priority is only effective for compare 
sibling memory cgroups and nothing beyond that.  The value alone has no 
relationship to any ancestor.  We can't set oom_priority based on the 
priorities of any other memory cgroups other than our own siblings because 
we have no control over how those change.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-21 Thread David Rientjes

On Wed, 20 Sep 2017, Roman Gushchin wrote:

> > It's actually much more complex because in our environment we'd need an 
> > "activity manager" with CAP_SYS_RESOURCE to control oom priorities of user 
> > subcontainers when today it need only be concerned with top-level memory 
> > cgroups.  Users can create their own hierarchies with their own oom 
> > priorities at will, it doesn't alter the selection heuristic for another 
> > other user running on the same system and gives them full control over the 
> > selection in their own subtree.  We shouldn't need to have a system-wide 
> > daemon with CAP_SYS_RESOURCE be required to manage subcontainers when 
> > nothing else requires it.  I believe it's also much easier to document: 
> > oom_priority is considered for all sibling cgroups at each level of the 
> > hierarchy and the cgroup with the lowest priority value gets iterated.
> 
> I do agree actually. System-wide OOM priorities make no sense.
> 
> Always compare sibling cgroups, either by priority or size, seems to be
> simple, clear and powerful enough for all reasonable use cases. Am I right,
> that it's exactly what you've used internally? This is a perfect confirmation,
> I believe.
> 

We've used it for at least four years, I added my Tested-by to your patch, 
we would convert to your implementation if it is merged upstream, and I 
would enthusiastically support your patch if you would integrate it back 
into your series.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-19 Thread David Rientjes

On Fri, 15 Sep 2017, Roman Gushchin wrote:

> > > > But then you just enforce a structural restriction on your configuration
> > > > because
> > > > root
> > > > /  \
> > > >AD
> > > >   /\   
> > > >  B  C
> > > > 
> > > > is a different thing than
> > > > root
> > > > / | \
> > > >B  C  D
> > > >
> > > 
> > > I actually don't have a strong argument against an approach to select
> > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > no much difference.
> > > 
> > > The only real concern I have is that then we have to do the same with
> > > oom_priorities (select largest priority tree-wide), and this will limit
> > > an ability to enforce the priority by parent cgroup.
> > > 
> > 
> > Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> > that reason.  We need the ability to control from which subtree the kill 
> > occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> > cgroups and they can own memory.oom_priority for their own subcontainers, 
> > this becomes quite powerful so they can define their own oom priorities.   
> > Otherwise, they can easily override the oom priorities of other cgroups.
> 
> I believe, it's a solvable problem: we can require CAP_SYS_RESOURCE to set
> the oom_priority below parent's value, or something like this.
> 
> But it looks more complex, and I'm not sure there are real examples,
> when we have to compare memcgs, which are on different levels
> (or in different subtrees).
> 

It's actually much more complex because in our environment we'd need an 
"activity manager" with CAP_SYS_RESOURCE to control oom priorities of user 
subcontainers when today it need only be concerned with top-level memory 
cgroups.  Users can create their own hierarchies with their own oom 
priorities at will, it doesn't alter the selection heuristic for another 
other user running on the same system and gives them full control over the 
selection in their own subtree.  We shouldn't need to have a system-wide 
daemon with CAP_SYS_RESOURCE be required to manage subcontainers when 
nothing else requires it.  I believe it's also much easier to document: 
oom_priority is considered for all sibling cgroups at each level of the 
hierarchy and the cgroup with the lowest priority value gets iterated.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-15 Thread David Rientjes

On Fri, 15 Sep 2017, Roman Gushchin wrote:

> > But then you just enforce a structural restriction on your configuration
> > because
> > root
> > /  \
> >AD
> >   /\   
> >  B  C
> > 
> > is a different thing than
> > root
> > / | \
> >B  C  D
> >
> 
> I actually don't have a strong argument against an approach to select
> largest leaf or kill-all-set memcg. I think, in practice there will be
> no much difference.
> 
> The only real concern I have is that then we have to do the same with
> oom_priorities (select largest priority tree-wide), and this will limit
> an ability to enforce the priority by parent cgroup.
> 

Yes, oom_priority cannot select the largest priority tree-wide for exactly 
that reason.  We need the ability to control from which subtree the kill 
occurs in ancestor cgroups.  If multiple jobs are allocated their own 
cgroups and they can own memory.oom_priority for their own subcontainers, 
this becomes quite powerful so they can define their own oom priorities.   
Otherwise, they can easily override the oom priorities of other cgroups.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-14 Thread David Rientjes

On Thu, 14 Sep 2017, Michal Hocko wrote:

> > It is certainly possible to add oom priorities on top before it is merged, 
> > but I don't see why it isn't part of the patchset.
> 
> Because the semantic of the priority for non-leaf memcgs is not fully
> clear and I would rather have the core of the functionality merged
> before this is sorted out.
> 

We can't merge the core of the feature before this is sorted out because 
then users start to depend on behavior and we must be backwards 
compatible.  We need a full patchset that introduces the new selection 
heuristic and a way for userspace to control it to either bias or prefer 
one cgroup over another.  The kill-all mechanism is a more orthogonal 
feature for the cgroup-aware oom killer than oom priorities.

I have a usecase for both the cgroup-aware oom killer and its oom 
priorities from previous versions of this patchset, I assume that Roman 
does as well, and would like to see it merged bacause there are real-world 
usecases for it rather than hypothetical usecases that would want to do 
something different.

> > We need it before its 
> > merged to avoid users playing with /proc/pid/oom_score_adj to prevent any 
> > killing in the most preferable memcg when they could have simply changed 
> > the oom priority.
> 
> I am sorry but I do not really understand your concern. Are you
> suggesting that users would start oom disable all tasks in a memcg to
> give it a higher priority? Even if that was the case why should such an
> abuse be a blocker for generic memcg aware oom killer being merged?

If users do not have any way to control victim selection because of a 
shortcoming in the kernel implementation, they will be required to oom 
disable processes and let that be inherited by children they fork in the 
memcg hierarchy to protect cgroups that they do not want to be oom killed, 
regardless of their size.  They simply are left with no other alternative 
if they want to use the cgroup-aware oom killer and/or the kill-all 
mechanism.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 2/4] mm, oom: cgroup-aware OOM killer

2017-09-13 Thread David Rientjes

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 15af3da5af02..da2b12ea4667 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2661,6 +2661,231 @@ static inline bool memcg_has_children(struct 
> mem_cgroup *memcg)
>   return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +   const nodemask_t *nodemask,
> +   unsigned long totalpages)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + /*
> +  * We don't have necessary stats for the root memcg,
> +  * so we define it's oom_score as the maximum oom_score
> +  * of the belonging tasks.
> +  */
> + if (memcg == root_mem_cgroup) {
> + struct css_task_iter it;
> + struct task_struct *task;
> + long score, max_score = 0;
> +
> + css_task_iter_start(>css, 0, );
> + while ((task = css_task_iter_next())) {
> + score = oom_badness(task, memcg, nodemask,
> + totalpages);
> + if (max_score > score)

score > max_score

> + max_score = score;
> + }
> + css_task_iter_end();
> +
> + return max_score;
> + }
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-13 Thread David Rientjes

On Wed, 13 Sep 2017, Michal Hocko wrote:

> > > This patchset makes the OOM killer cgroup-aware.
> > > 
> > > v8:
> > >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> > >   - Make the whole thing opt-in with cgroup mount option control
> > >   - Drop oom_priority for further discussions
> > 
> > Nack, we specifically require oom_priority for this to function correctly, 
> > otherwise we cannot prefer to kill from low priority leaf memcgs as 
> > required.
> 
> While I understand that your usecase might require priorities I do not
> think this part missing is a reason to nack the cgroup based selection
> and kill-all parts. This can be done on top. The only important part
> right now is the current selection semantic - only leaf memcgs vs. size
> of the hierarchy). I strongly believe that comparing only leaf memcgs
> is more straightforward and it doesn't lead to unexpected results as
> mentioned before (kill a small memcg which is a part of the larger
> sub-hierarchy).
> 

The problem is that we cannot enable the cgroup-aware oom killer and 
oom_group behavior because, without oom priorities, we have no ability to 
influence the cgroup that it chooses.  It is doing two things: providing 
more fairness amongst cgroups by selecting based on cumulative usage 
rather than single large process (good!), and effectively is removing all 
userspace control of oom selection (bad).  We want the former, but it 
needs to be coupled with support so that we can protect vital cgroups, 
regardless of their usage.

It is certainly possible to add oom priorities on top before it is merged, 
but I don't see why it isn't part of the patchset.  We need it before its 
merged to avoid users playing with /proc/pid/oom_score_adj to prevent any 
killing in the most preferable memcg when they could have simply changed 
the oom priority.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-09-12 Thread David Rientjes

On Tue, 12 Sep 2017, Roman Gushchin wrote:

> > I can't imagine that Tejun would be happy with a new mount option, 
> > especially when it's not required.
> > 
> > OOM behavior does not need to be defined at mount time and for the entire 
> > hierarchy.  It's possible to very easily implement a tunable as part of 
> > mem cgroup that is propagated to descendants and controls the oom scoring 
> > behavior for that hierarchy.  It does not need to be system wide and 
> > affect scoring of all processes based on which mem cgroup they are 
> > attached to at any given time.
> 
> No, I don't think that mixing per-cgroup and per-process OOM selection
> algorithms is a good idea.
> 
> So, there are 3 reasonable options:
> 1) boot option
> 2) sysctl
> 3) cgroup mount option
> 
> I believe, 3) is better, because it allows changing the behavior dynamically,
> and explicitly depends on v2 (what sysctl lacks).
> 
> So, the only question is should it be opt-in or opt-out option.
> Personally, I would prefer opt-out, but Michal has a very strong opinion here.
> 

If it absolutely must be a mount option, then I would agree it should be 
opt-in so that it's known what is being changed rather than changing how 
selection was done in the past and requiring legacy users to now mount in 
a new way.

I'd be interested to hear Tejun's comments, however, about whether we want 
to add controller specific mount options like this instead of a tunable at 
the root level, for instance, that controls victim selection and would be 
isolated to the memory cgroup controller as opposed to polluting mount 
options.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 1/4] mm, oom: refactor the oom_kill_process() function

2017-09-11 Thread David Rientjes

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: David Rientjes <rient...@google.com>

https://marc.info/?l=linux-kernel=150274805412752
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-09-11 Thread David Rientjes

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
> 
> The behavior can be changed dynamically by remounting the cgroupfs.

I can't imagine that Tejun would be happy with a new mount option, 
especially when it's not required.

OOM behavior does not need to be defined at mount time and for the entire 
hierarchy.  It's possible to very easily implement a tunable as part of 
mem cgroup that is propagated to descendants and controls the oom scoring 
behavior for that hierarchy.  It does not need to be system wide and 
affect scoring of all processes based on which mem cgroup they are 
attached to at any given time.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v8 0/4] cgroup-aware OOM killer

2017-09-11 Thread David Rientjes

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> This patchset makes the OOM killer cgroup-aware.
> 
> v8:
>   - Do not kill tasks with OOM_SCORE_ADJ -1000
>   - Make the whole thing opt-in with cgroup mount option control
>   - Drop oom_priority for further discussions

Nack, we specifically require oom_priority for this to function correctly, 
otherwise we cannot prefer to kill from low priority leaf memcgs as 
required.  v8 appears to implement new functionality that we want, to 
compare two memcgs based on usage, but without the ability to influence 
that decision to protect important userspace, so now I'm in a position 
where (1) nothing has changed if I don't use the new mount option or (2) I 
get completely different oom kill selection with the new mount option but 
not the ability to influence it.  I was much happier with the direction 
that v7 was taking, but since v8 causes us to regress without the ability 
to change memcg priority, this has to be nacked.

>   - Kill the whole cgroup if oom_group is set and it's
> memory.max is reached
>   - Update docs and commit messages
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-09 Thread David Rientjes

On Fri, 8 Sep 2017, Christopher Lameter wrote:

> Ok. Certainly there were scalability issues (lots of them) and the sysctl
> may have helped there if set globally. But the ability to kill the
> allocating tasks was primarily used in cpusets for constrained allocation.
> 

I remember discussing it with him and he had some data with pretty extreme 
numbers for how long the tasklist iteration was taking.  Regardless, I 
agree it's not pertinent to the discussion if anybody is actively using 
the sysctl, just fun to try to remember the discussions from 10 years ago.  

The problem I'm having with the removal, though, is that the kernel source 
actually uses it itself in tools/testing/fault-injection/failcmd.sh.  
That, to me, suggests there are people outside the kernel source that are 
also probably use it.  We use it as part of our unit testing, although we 
could convert away from it.

These are things that can probably be worked around, but I'm struggling to 
see the whole benefit of it.  It's only defined, there's generic sysctl 
handling, and there's a single conditional in the oom killer.  I wouldn't 
risk the potential userspace breakage.

> The issue of scaling is irrelevant in the context of deciding what to do
> about the sysctl. You can address the issue differently if it still
> exists. The systems with super high NUMA nodes (hundreds to a
> thousand) have somehow fallen out of fashion a bit. So I doubt that this
> is still an issue. And no one of the old stakeholders is speaking up.
> 
> What is the current approach for an OOM occuring in a cpuset or cgroup
> with a restricted numa node set?
> 

It's always been shaky, we simply exclude potential kill victims based on 
whether or not they share mempolicy nodes or cpuset mems with the 
allocating process.  Of course, this could result in no memory freeing 
because a potential victim being allowed to allocate on a particular node 
right now doesn't mean killing it will free memory on that node.  It's 
just more probable in practice.  Nobody has complained about that 
methodology, but we do have internal code that simply kills current for 
mempolicy ooms.  That is because we have priority based oom killing much 
like this patchset implements and then extends it even further to 
processes.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-07 Thread David Rientjes

On Thu, 7 Sep 2017, Christopher Lameter wrote:

> > SGI required it when it was introduced simply to avoid the very expensive
> > tasklist scan.  Adding Christoph Lameter to the cc since he was involved
> > back then.
> 
> Really? From what I know and worked on way back when: The reason was to be
> able to contain the affected application in a cpuset. Multiple apps may
> have been running in multiple cpusets on a large NUMA machine and the OOM
> condition in one cpuset should not affect the other. It also helped to
> isolate the application behavior causing the oom in numerous cases.
> 
> Doesnt this requirement transfer to cgroups in the same way?
> 
> Left SGI in 2008 so adding Dimitri who may know about the current
> situation. Robin Holt also left SGI as far as I know.
> 

It may have been Paul Jackson, but I remember the oom_kill_allocating_task 
knob being required due to very slow oom killer due to the very lengthy 
iteration of the tasklist.  It would be helpful if someone from SGI could 
confirm whether or not they actively use this sysctl.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-06 Thread David Rientjes

On Wed, 6 Sep 2017, Roman Gushchin wrote:

> From f6e2339926a07500834d86548f3f116af7335d71 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <g...@fb.com>
> Date: Wed, 6 Sep 2017 17:43:44 +0100
> Subject: [PATCH] mm, oom: first step towards oom_kill_allocating_task
>  deprecation
> 
> The oom_kill_allocating_task sysctl which causes the OOM killer
> to simple kill the allocating task is useless. Killing the random
> task is not the best idea.
> 
> Nobody likes it, and hopefully nobody uses it.
> We want to completely deprecate it at some point.
> 

SGI required it when it was introduced simply to avoid the very expensive 
tasklist scan.  Adding Christoph Lameter to the cc since he was involved 
back then.

I attempted to deprecate the old /proc/pid/oom_adj in this same manner; we 
warned about it for over a year and then finally removed it, one person 
complained of breakage, and it was reverted with a strict policy that 
Linux doesn't break userspace.

Although it would be good to do, I'm not sure that this is possible unless 
it can be shown nobody is using it.  Talking to SGI would be the first 
step.

I'm not sure what this has to do with the overall patchset though :)

> To make a first step towards deprecation, let's warn potential
> users about deprecation plans.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: David Rientjes <rient...@google.com>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  kernel/sysctl.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 655686d546cb..9158f1980584 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -220,6 +220,17 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
> int write,
>  
>  #endif
>  
> +static int proc_oom_kill_allocating_tasks(struct ctl_table *table, int write,
> +void __user *buffer, size_t *lenp,
> +loff_t *ppos)
> +{
> + pr_warn_once("The oom_kill_allocating_task sysctl will be deprecated.\n"
> +  "If you're using it, please, report to "
> +  "linux...@kvack.kernel.org.\n");
> +
> + return proc_dointvec(table, write, buffer, lenp, ppos);
> +}
> +
>  static struct ctl_table kern_table[];
>  static struct ctl_table vm_table[];
>  static struct ctl_table fs_table[];
> @@ -1235,7 +1246,7 @@ static struct ctl_table vm_table[] = {
>   .data   = _oom_kill_allocating_task,
>   .maxlen = sizeof(sysctl_oom_kill_allocating_task),
>   .mode   = 0644,
> - .proc_handler   = proc_dointvec,
> + .proc_handler   = proc_oom_kill_allocating_tasks,
>   },
>   {
>   .procname   = "oom_dump_tasks",
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-31 Thread David Rientjes

On Thu, 31 Aug 2017, Roman Gushchin wrote:

> So, it looks to me that we're close to an acceptable version,
> and the only remaining question is the default behavior
> (when oom_group is not set).
> 

Nit: without knowledge of the implementation, I still don't think I would 
know what an "out of memory group" is.  Out of memory doesn't necessarily 
imply a kill.  I suggest oom_kill_all or something that includes the verb.

> Michal suggests to ignore non-oom_group memcgs, and compare tasks with
> memcgs with oom_group set. This makes the whole thing completely opt-in,
> but then we probably need another knob (or value) to select between
> "select memcg, kill biggest task" and "select memcg, kill all tasks".

It seems like that would either bias toward or bias against cgroups that 
opt-in.  I suggest comparing memory cgroups at each level in the hierarchy 
based on your new badness heuristic, regardless of any tunables it has 
enabled.  Then kill either the largest process or all the processes 
attached depending on oom_group or oom_kill_all.

> Also, as the whole thing is based on comparison between processes and
> memcgs, we probably need oom_priority for processes.

I think with the constraints of cgroup v2 that a victim memcg must first 
be chosen, and then a victim process attached to that memcg must be chosen 
or all eligible processes attached to it be killed, depending on the 
tunable.

The simplest and most clear way to define this, in my opinion, is to 
implement a heuristic that compares sibling memcgs based on usage, as you 
have done.  This can be overridden by a memory.oom_priority that userspace 
defines, and is enough support for them to change victim selection (no 
mount option needed, just set memory.oom_priority).  Then kill the largest 
process or all eligible processes attached.  We only use per-process 
priority to override process selection compared to sibling memcgs, but 
with cgroup v2 process constraints it doesn't seem like that is within the 
scope of your patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-30 Thread David Rientjes

On Wed, 30 Aug 2017, Roman Gushchin wrote:

> I've spent some time to implement such a version.
> 
> It really became shorter and more existing code were reused,
> howewer I've met a couple of serious issues:
> 
> 1) Simple summing of per-task oom_score doesn't make sense.
>First, we calculate oom_score per-task, while should sum per-process 
> values,
>or, better, per-mm struct. We can take only threa-group leader's score
>into account, but it's also not 100% accurate.
>And, again, we have a question what to do with per-task oom_score_adj,
>if we don't task the task's oom_score into account.
> 
>Using memcg stats still looks to me as a more accurate and consistent
>way of estimating memcg memory footprint.
> 

The patchset is introducing a new methodology for selecting oom victims so 
you can define how cgroups are compared vs other cgroups with your own 
"badness" calculation.  I think your implementation based heavily on anon 
and unevictable lrus and unreclaimable slab is fine and you can describe 
that detail in the documentation (along with the caveat that it is only 
calculated for nodes in the allocation's mempolicy).  With 
memory.oom_priority, the user has full ability to change that selection.  
Process selection heuristics have changed over time themselves, it's not 
something that must be backwards compatibile and trying to sum the usage 
from each of the cgroup's mm_struct's and respect oom_score_adj is 
unnecessarily complex.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v6 3/4] mm, oom: introduce oom_priority for memory cgroups

2017-08-28 Thread David Rientjes

On Thu, 24 Aug 2017, Roman Gushchin wrote:

> > > Do you have an example, which can't be effectively handled by an approach
> > > I'm suggesting?
> > 
> > No, I do not have any which would be _explicitly_ requested but I do
> > envision new requirements will emerge. The most probable one would be
> > kill the youngest container because that would imply the least amount of
> > work wasted.
> 
> I agree, this a nice feature. It can be implemented in userspace
> by setting oom_priority.
> 

Yes, the "kill the newest memory cgroup as a tiebreak" is not strictly 
required in the kernel and no cgroup should depend on this implementation 
detail to avoid being killed if it shares the same memory.oom_priority as 
another cgroup.  As you mention, it can be effectively implemented by 
userspace itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-23 Thread David Rientjes

On Wed, 23 Aug 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
> 
> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
> 
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 

I'm very happy with the rest of the patchset, but I feel that I must renew 
my objection to memory.oom_kill_all_tasks being able to override the 
setting of the admin of setting a process to be oom disabled.  From my 
perspective, setting memory.oom_kill_all_tasks with an oom disabled 
process attached that now becomes killable either (1) overrides the 
CAP_SYS_RESOURCE oom disabled setting or (2) is lazy and doesn't modify 
/proc/pid/oom_score_adj itself.

I'm not sure what is objectionable about allowing 
memory.oom_kill_all_tasks to coexist with oom disabled processes.  Just 
kill everything else so that the oom disabled process can report the oom 
condition after notification, restart the task, etc.  If it's problematic, 
then whomever is declaring everything must be killed shall also modify 
/proc/pid/oom_score_adj of oom disabled processes.  If it doesn't have 
permission to change that, then I think there's a much larger concern.

> Tasks in the root cgroup are treated as independent memory consumers,
> and are compared with other memory consumers (e.g. leaf cgroups).
> The root cgroup doesn't support the oom_kill_all_tasks feature.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-23 Thread David Rientjes

On Wed, 23 Aug 2017, Roman Gushchin wrote:

> > It's better to have newbies consult the documentation once than making
> > everybody deal with long and cumbersome names for the rest of time.
> > 
> > Like 'ls' being better than 'read_and_print_directory_contents'.
> 
> I don't think it's a good argument here: realistically, nobody will type
> the knob's name often. Your option is shorter only by 3 characters :)
> 
> Anyway, I'm ok with memory.oom_group too, if everybody else prefer it.
> Michal, David?
> What's your opinion?
> 

I'm probably the worst person in the world for succinctly naming stuff, 
but I at least think the knob should have the word "kill" in it to 
describe the behavior.  ("oom_group", out of memory group, what exactly is 
that?)
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-20 Thread David Rientjes

On Wed, 16 Aug 2017, Roman Gushchin wrote:

> It's natural to expect that inside a container there are their own sshd,
> "activity manager" or some other stuff, which can play with oom_score_adj.
> If it can override the upper cgroup-level settings, the whole delegation model
> is broken.
> 

I don't think any delegation model related to core cgroups or memory 
cgroup is broken, I think it's based on how memory.oom_kill_all_tasks is 
defined.  It could very well behave as memory.oom_kill_all_eligible_tasks 
when enacted upon.

> You can think about the oom_kill_all_tasks like the panic_on_oom,
> but on a cgroup level. It should _guarantee_, that in case of oom
> the whole cgroup will be destroyed completely, and will not remain
> in a non-consistent state.
> 

Only CAP_SYS_ADMIN has this ability to set /proc/pid/oom_score_adj to 
OOM_SCORE_ADJ_MIN, so it preserves the ability to change that setting, if 
needed, when it sets memory.oom_kill_all_tasks.  If a user gains 
permissions to change memory.oom_kill_all_tasks, I disagree it should 
override the CAP_SYS_ADMIN setting of /proc/pid/oom_score_adj.

I would prefer not to exclude oom disabled processes to their own sibling 
cgroups because they would require their own reservation with cgroup v2 
and it makes the single hierarchy model much more difficult to arrange 
alongside cpusets, for example.

> The model you're describing is based on a trust given to these oom-unkillable
> processes on system level. But we can't really trust some unknown processes
> inside a cgroup that they will be able to do some useful work and finish
> in a reasonable time; especially in case of a global memory shortage.

Yes, we prefer to panic instead of sshd, for example, being oom killed.  
We trust that sshd, as well as our own activity manager and security 
daemons are trusted to do useful work and that we never want the kernel to 
do this.  I'm not sure why you are describing processes that CAP_SYS_ADMIN 
has set to be oom disabled as unknown processes.

I'd be interested in hearing the opinions of others related to a per-memcg 
knob being allowed to override the setting of the sysadmin.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-08-20 Thread David Rientjes

On Thu, 17 Aug 2017, Roman Gushchin wrote:

> Hi David!
> 
> Please, find an updated version of docs patch below.
> 

Looks much better, thanks!  I think the only pending issue is discussing 
the relationship of memory.oom_kill_all_tasks with /proc/pid/oom_score_adj 
== OOM_SCORE_ADJ_MIN.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-15 Thread David Rientjes

On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > I'm curious about the decision made in this conditional and how 
> > oom_kill_memcg_member() ignores task->signal->oom_score_adj.  It means 
> > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it 
> > should otherwise be disabled.
> > 
> > It's undocumented in the changelog, but I'm questioning whether it's the 
> > right decision.  Doesn't it make sense to kill all tasks that are not oom 
> > disabled, and allow the user to still protect certain processes by their 
> > /proc/pid/oom_score_adj setting?  Otherwise, there's no way to do that 
> > protection without a sibling memcg and its own reservation of memory.  I'm 
> > thinking about a process that governs jobs inside the memcg and if there 
> > is an oom kill, it wants to do logging and any cleanup necessary before 
> > exiting itself.  It seems like a powerful combination if coupled with oom 
> > notification.
> 
> Good question!
> I think, that an ability to override any oom_score_adj value and get all tasks
> killed is more important, than an ability to kill all processes with some
> exceptions.
> 

I'm disagreeing because getting all tasks killed is not necessarily 
something that only the kernel can do.  If all processes are oom disabled, 
that's a configuration issue done by sysadmin and the kernel should decide 
to kill the next largest memory cgroup or lower priority memory cgroup.  
It's not killing things like sshd that intentionally oom disable 
themselves.

You could argue that having an oom disabled process attached to these 
memcgs in the first place is also a configuration issue, but the problem 
is that in cgroup v2 with a restriction on processes only being attached 
at the leaf cgroups that there is no competition for memory in this case.  
I must assign memory resources to that sshd, or "Activity Manager" 
described by the cgroup v1 documentation, just to prevent it from being 
killed.

I think the latter of what you describe, killing all processes with some 
exceptions, is actually quite powerful.  I can guarantee that processes 
that set themselves to oom disabled are really oom disabled and I don't 
need to work around that in the cgroup hierarchy only because of this 
caveat.  I can also oom disable my Activity Manger that wants to wait on 
oom notification and collect the oom kill logs, raise notifications, and 
perhaps restart the process that it manages.

> In your example someone still needs to look after the remaining process,
> and kill it after some timeout, if it will not quit by itself, right?
> 

No, it can restart the process that was oom killed; or it can be sshd and 
I can still ssh into my machine.

> The special treatment of the -1000 value (without oom_kill_all_tasks)
> is required only to not to break the existing setups.
> 

I think as a general principle that allowing an oom disabled process to be 
oom killed is incorrect and if you really do want these to be killed, then 
(1) either your oom_score_adj is already wrong or (2) you can wait on oom 
notification and exit.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-08-15 Thread David Rientjes

On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > > index dec5afdaa36d..22108f31e09d 100644
> > > --- a/Documentation/cgroup-v2.txt
> > > +++ b/Documentation/cgroup-v2.txt
> > > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> > > 5-2-1. Memory Interface Files
> > > 5-2-2. Usage Guidelines
> > > 5-2-3. Memory Ownership
> > > +   5-2-4. Cgroup-aware OOM Killer
> > 
> > Random curiousness, why cgroup-aware oom killer and not memcg-aware oom 
> > killer?
> 
> I don't think we use the term "memcg" somewhere in v2 docs.
> Do you think that "Memory cgroup-aware OOM killer" is better?
> 

I think it would be better to not describe it as its own entity, but 
rather a part of how the memory cgroup works, so simply describing it in 
section 5-2, perhaps as its own subsection, as how the oom killer works 
when using the memory cgroup is sufficient.  I wouldn't separate it out as 
a distinct cgroup feature in the documentation.

> > > + cgroups.  The default is "0".
> > > +
> > > + Defines whether the OOM killer should treat the cgroup
> > > + as a single entity during the victim selection.
> > 
> > Isn't this true independent of the memory.oom_kill_all_tasks setting?  
> > The cgroup aware oom killer will consider memcg's as logical units when 
> > deciding what to kill with or without memory.oom_kill_all_tasks, right?
> > 
> > I think you cover this fact in the cgroup aware oom killer section below 
> > so this might result in confusion if described alongside a setting of
> > memory.oom_kill_all_tasks.
> > 

I assume this is fixed so that it's documented that memory cgroups are 
considered logical units by the oom killer and that 
memory.oom_kill_all_tasks is separate?  The former defines the policy on 
how a memory cgroup is targeted and the latter defines the mechanism it 
uses to free memory.

> > > + If set, OOM killer will kill all belonging tasks in
> > > + corresponding cgroup is selected as an OOM victim.
> > 
> > Maybe
> > 
> > "If set, the OOM killer will kill all threads attached to the memcg if 
> > selected as an OOM victim."
> > 
> > is better?
> 
> Fixed to the following (to conform with core v2 concepts):
>   If set, OOM killer will kill all processes attached to the cgroup
>   if selected as an OOM victim.
> 

Thanks.

> > > +Cgroup-aware OOM Killer
> > > +~~~
> > > +
> > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > +It means that it treats memory cgroups as first class OOM entities.
> > > +
> > > +Under OOM conditions the memory controller tries to make the best
> > > +choise of a victim, hierarchically looking for the largest memory
> > > +consumer. By default, it will look for the biggest task in the
> > > +biggest leaf cgroup.
> > > +
> > > +Be default, all cgroups have oom_priority 0, and OOM killer will
> > > +chose the largest cgroup recursively on each level. For non-root
> > > +cgroups it's possible to change the oom_priority, and it will cause
> > > +the OOM killer to look athe the priority value first, and compare
> > > +sizes only of cgroups with equal priority.
> > 
> > Maybe some description of "largest" would be helpful here?  I think you 
> > could briefly describe what is accounted for in the decisionmaking.
> 
> I'm afraid that it's too implementation-defined to be described.
> Do you have an idea, how to describe it without going too much into details?
> 

The point is that "largest cgroup" is ambiguous here: largest in what 
sense?  The cgroup with the largest number of processes attached?  Using 
the largest amount of memory?

I think the documentation should clearly define that the oom killer 
selects the memory cgroup that has the most memory managed at each level.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-08-14 Thread David Rientjes

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index dec5afdaa36d..22108f31e09d 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> 5-2-1. Memory Interface Files
> 5-2-2. Usage Guidelines
> 5-2-3. Memory Ownership
> +   5-2-4. Cgroup-aware OOM Killer

Random curiousness, why cgroup-aware oom killer and not memcg-aware oom 
killer?

>   5-3. IO
> 5-3-1. IO Interface Files
> 5-3-2. Writeback
> @@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back.
>   high limit is used and monitored properly, this limit's
>   utility is limited to providing the final safety net.
>  
> +  memory.oom_kill_all_tasks
> +
> + A read-write single value file which exits on non-root

s/exits/exists/

> + cgroups.  The default is "0".
> +
> + Defines whether the OOM killer should treat the cgroup
> + as a single entity during the victim selection.

Isn't this true independent of the memory.oom_kill_all_tasks setting?  
The cgroup aware oom killer will consider memcg's as logical units when 
deciding what to kill with or without memory.oom_kill_all_tasks, right?

I think you cover this fact in the cgroup aware oom killer section below 
so this might result in confusion if described alongside a setting of
memory.oom_kill_all_tasks.

> +
> + If set, OOM killer will kill all belonging tasks in
> + corresponding cgroup is selected as an OOM victim.

Maybe

"If set, the OOM killer will kill all threads attached to the memcg if 
selected as an OOM victim."

is better?

> +
> + Be default, OOM killer respect /proc/pid/oom_score_adj value
> + -1000, and will never kill the task, unless oom_kill_all_tasks
> + is set.
> +
> +  memory.oom_priority
> +
> + A read-write single value file which exits on non-root

s/exits/exists/

> + cgroups.  The default is "0".
> +
> + An integer number within the [-1, 1] range,
> + which defines the order in which the OOM killer selects victim
> + memory cgroups.
> +
> + OOM killer prefers memory cgroups with larger priority if they
> + are populated with elegible tasks.

s/elegible/eligible/

> +
> + The oom_priority value is compared within sibling cgroups.
> +
> + The root cgroup has the oom_priority 0, which cannot be changed.
> +
>memory.events
>   A read-only flat-keyed file which exists on non-root cgroups.
>   The following entries are defined.  Unless specified
> @@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of 
> memory areas
>  belonging to the affected files to ensure correct memory ownership.
>  
>  
> +Cgroup-aware OOM Killer
> +~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as first class OOM entities.
> +
> +Under OOM conditions the memory controller tries to make the best
> +choise of a victim, hierarchically looking for the largest memory
> +consumer. By default, it will look for the biggest task in the
> +biggest leaf cgroup.
> +
> +Be default, all cgroups have oom_priority 0, and OOM killer will
> +chose the largest cgroup recursively on each level. For non-root
> +cgroups it's possible to change the oom_priority, and it will cause
> +the OOM killer to look athe the priority value first, and compare
> +sizes only of cgroups with equal priority.

Maybe some description of "largest" would be helpful here?  I think you 
could briefly describe what is accounted for in the decisionmaking.

s/athe/at the/

Reading through this, it makes me wonder if doing s/cgroup/memcg/ over 
most of it would be better.

> +
> +But a user can change this behavior by enabling the per-cgroup
> +oom_kill_all_tasks option. If set, it causes the OOM killer treat
> +the whole cgroup as an indivisible memory consumer. In case if it's
> +selected as on OOM victim, all belonging tasks will be killed.
> +
> +Tasks in the root cgroup are treated as independent memory consumers,
> +and are compared with other memory consumers (e.g. leaf cgroups).
> +The root cgroup doesn't support the oom_kill_all_tasks feature.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to the sub-tree
> +of the OOM'ing cgroup.
> +
>  IO
>  --
>  
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 3/4] mm, oom: introduce oom_priority for memory cgroups

2017-08-14 Thread David Rientjes

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> Introduce a per-memory-cgroup oom_priority setting: an integer number
> within the [-1, 1] range, which defines the order in which
> the OOM killer selects victim memory cgroups.
> 
> OOM killer prefers memory cgroups with larger priority if they are
> populated with elegible tasks.
> 
> The oom_priority value is compared within sibling cgroups.
> 
> The root cgroup has the oom_priority 0, which cannot be changed.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: David Rientjes <rient...@google.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Tested-by: David Rientjes <rient...@google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-14 Thread David Rientjes

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 8a266e2be5a6..b7ec3bd441be 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -39,6 +39,7 @@ struct oom_control {
>   unsigned long totalpages;
>   struct task_struct *chosen;
>   unsigned long chosen_points;
> + struct mem_cgroup *chosen_memcg;
>  };
>  
>  extern struct mutex oom_lock;
> @@ -79,6 +80,8 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index df6f63ee95d6..0b81dc55c6ac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct 
> mem_cgroup *memcg)
>   return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +   const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}

I'm indifferent to the memcg evaluation criteria used to determine which 
memcg should be selected over others with the same priority, others may 
feel differently.

> +
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +const nodemask_t *nodemask)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int elegible = 0;
> +
> + css_task_iter_start(>css, 0, );
> + while ((task = css_task_iter_next())) {
> + /*
> +  * If there are no tasks, or all tasks have oom_score_adj set
> +  * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> +  * don't select this memory cgroup.
> +  */
> + if (!elegible &&
> + (memcg->oom_kill_all_tasks ||
> +  task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> + elegible = 1;

I'm curious about the decision made in this conditional and how 
oom_kill_memcg_member() ignores task->signal->oom_score_adj.  It means 
that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it 
should otherwise be disabled.

It's undocumented in the changelog, but I'm questioning whether it's the 
right decision.  Doesn't it make sense to kill all tasks that are not oom 
disabled, and allow the user to still protect certain processes by their 
/proc/pid/oom_score_adj setting?  Otherwise, there's no way to do that 
protection without a sibling memcg and its own reservation of memory.  I'm 
thinking about a process that governs jobs inside the memcg and if there 
is an oom kill, it wants to do logging and any cleanup necessary before 
exiting itself.  It seems like a powerful combination if coupled with oom 
notification.

Also, s/elegible/eligible/

Otherwise, looks good!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function

2017-08-14 Thread David Rientjes

On Mon, 14 Aug 2017, Roman Gushchin wrote:

> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: David Rientjes <rient...@google.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: David Rientjes <rient...@google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-08-08 Thread David Rientjes

On Wed, 26 Jul 2017, Roman Gushchin wrote:

> +Cgroup-aware OOM Killer
> +~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as first class OOM entities.
> +
> +Under OOM conditions the memory controller tries to make the best
> +choise of a victim, hierarchically looking for the largest memory
> +consumer. By default, it will look for the biggest task in the
> +biggest leaf cgroup.
> +
> +Be default, all cgroups have oom_priority 0, and OOM killer will
> +chose the largest cgroup recursively on each level. For non-root
> +cgroups it's possible to change the oom_priority, and it will cause
> +the OOM killer to look athe the priority value first, and compare
> +sizes only of cgroups with equal priority.
> +
> +But a user can change this behavior by enabling the per-cgroup
> +oom_kill_all_tasks option. If set, it causes the OOM killer treat
> +the whole cgroup as an indivisible memory consumer. In case if it's
> +selected as on OOM victim, all belonging tasks will be killed.
> +
> +Tasks in the root cgroup are treated as independent memory consumers,
> +and are compared with other memory consumers (e.g. leaf cgroups).
> +The root cgroup doesn't support the oom_kill_all_tasks feature.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to the sub-tree
> +of the OOM'ing cgroup.
> +
>  IO
>  --

Thanks very much for following through with this.

As described in http://marc.info/?l=linux-kernel=149980660611610 this is 
very similar to what we do for priority based oom killing.

I'm wondering your comments on extending it one step further, however: 
include process priority as part of the selection rather than simply memcg 
priority.

memory.oom_priority will dictate which memcg the kill will originate from, 
but processes have no ability to specify that they should actually be 
killed as opposed to a leaf memcg.  I'm not sure how important this is for 
your usecase, but we have found it useful to be able to specify process 
priority as part of the decisionmaking.

At each level of consideration, we simply kill a process with lower 
/proc/pid/oom_priority if there are no memcgs with a lower 
memory.oom_priority.  This allows us to define the exact process that will 
be oom killed, absent oom_kill_all_tasks, and not require that the process 
be attached to leaf memcg.  Most notably these are processes that are best 
effort: stats collection, logging, etc.

Do you think it would be helpful to introduce per-process oom priority as 
well?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v4 3/4] mm, oom: introduce oom_priority for memory cgroups

2017-08-08 Thread David Rientjes

On Wed, 26 Jul 2017, Roman Gushchin wrote:

> Introduce a per-memory-cgroup oom_priority setting: an integer number
> within the [-1, 1] range, which defines the order in which
> the OOM killer selects victim memory cgroups.
> 
> OOM killer prefers memory cgroups with larger priority if they are
> populated with elegible tasks.
> 
> The oom_priority value is compared within sibling cgroups.
> 
> The root cgroup has the oom_priority 0, which cannot be changed.
> 

Awesome!  Very excited to see that you implemented this suggestion and it 
is similar to priority based oom killing that we have done.  I think this 
kind of support is long overdue in the oom killer.

Comment inline.

> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Vladimir Davydov <vdavydov@gmail.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: David Rientjes <rient...@google.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org
> ---
>  include/linux/memcontrol.h |  3 +++
>  mm/memcontrol.c| 55 
> --
>  2 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b21bbb0edc72..d31ac58e08ad 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -206,6 +206,9 @@ struct mem_cgroup {
>   /* cached OOM score */
>   long oom_score;
>  
> + /* OOM killer priority */
> + short oom_priority;
> +
>   /* handle for "memory.events" */
>   struct cgroup_file events_file;
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ba72d1cf73d0..2c1566995077 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2710,12 +2710,21 @@ static void select_victim_memcg(struct mem_cgroup 
> *root, struct oom_control *oc)
>   for (;;) {
>   struct cgroup_subsys_state *css;
>   struct mem_cgroup *memcg = NULL;
> + short prio = SHRT_MIN;
>   long score = LONG_MIN;
>  
>   css_for_each_child(css, >css) {
>   struct mem_cgroup *iter = mem_cgroup_from_css(css);
>  
> - if (iter->oom_score > score) {
> + if (iter->oom_score == 0)
> + continue;
> +
> + if (iter->oom_priority > prio) {
> + memcg = iter;
> + prio = iter->oom_priority;
> + score = iter->oom_score;
> + } else if (iter->oom_priority == prio &&
> +iter->oom_score > score) {
>   memcg = iter;
>   score = iter->oom_score;
>   }

Your tiebreaking is done based on iter->oom_score, which I suppose makes 
sense given that the oom killer traditionally tries to kill from the 
largest memory hogging process.

We actually tiebreak on a timestamp of memcg creation and prefer to kill 
from the newer memcg when iter->oom_priority is the same.  The reasoning 
is that we schedule jobs on a machine that have an inherent priority but 
is unaware of other jobs running at the same priority and so the kill 
decision, if based on iter->oom_score, may differ based on current memory 
usage.

I'm not necessarily arguing against using iter->oom_score, but was 
wondering if you would also find that tiebreaking based on a timestamp 
when priorities are the same is a more clear semantic to describe?  It's 
similar to how the system oom killer tiebreaked based on which task_struct 
appeared later in the tasklist when memory usage was the same.

Your approach makes oom killing less likely in the near term since it 
kills a more memory hogging memcg, but has the potential to lose less 
work.  A timestamp based approach loses the least amount of work by 
preferring to kill newer memcgs but oom killing may be more frequent if 
smaller child memcgs are killed.  I would argue the former is the 
responsibility of the user for using the same priority.

> @@ -2782,7 +2791,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control 
> *oc)
>* For system-wide OOMs we should consider tasks in the root cgroup
>* with oom_score larger than oc->chosen_points.
>*/
> - if (!oc->memcg) {
> + if (!oc->memcg && !(oc->chosen_memcg &&

Re: [v4 2/4] mm, oom: cgroup-aware OOM killer

2017-08-08 Thread David Rientjes

On Tue, 1 Aug 2017, Roman Gushchin wrote:

> > To the rest of the patch. I have to say I do not quite like how it is
> > implemented. I was hoping for something much simpler which would hook
> > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > then we would update the cumulative memcg badness (more specifically the
> > badness of the topmost parent with kill-all flag). Memcg will then
> > compete with existing self contained tasks (oom_badness will have to
> > tell whether points belong to a task or a memcg to allow the caller to
> > deal with it). But it shouldn't be much more complex than that.
> 
> I'm not sure, it will be any simpler. Basically I'm doing the same:
> the difference is that you want to iterate over tasks and for each
> task traverse the memcg tree, update per-cgroup oom score and find
> the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> 
> Also, please note, that even without the kill-all flag the decision is made
> on per-cgroup level (except tasks in the root cgroup).
> 

I think your implementation is preferred and is actually quite simple to 
follow, and I would encourage you to follow through with it.  It has a 
similar implementation to what we have done for years to kill a process 
from a leaf memcg.

I did notice that oom_kill_memcg_victim() calls directly into 
__oom_kill_process(), however, so we lack the traditional oom killer 
output that shows memcg usage and potential tasklist.  I think we should 
still be dumping this information to the kernel log so that we can see a 
breakdown of charged memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

2017-07-12 Thread David Rientjes

On Wed, 12 Jul 2017, Roman Gushchin wrote:

> > It's a no-op if nobody sets up priorities or the system-wide sysctl is 
> > disabled.  Presumably, as in our model, the Activity Manager sets the 
> > sysctl and is responsible for configuring the priorities if present.  All 
> > memcgs at the sibling level or subcontainer level remain the default if 
> > not defined by the chown'd user, so this falls back to an rss model for 
> > backwards compatibility.
> 
> Hm, this is interesting...
> 
> What I'm thinking about, is that we can introduce the following model:
> each memory cgroup has an integer oom priority value, 0 be default.
> Root cgroup priority is always 0, other cgroups can have both positive
> or negative priorities.
> 

For our purposes we use a range of [0, 1] for the per-process oom 
priority; 1 implies the process is not oom killable, 5000 is the 
default.  We use a range of [0, ] for the per-memcg oom priority since 
memcgs cannot disable themselves from oom killing (although they could oom 
disable all attached processes).  We can obviously remap our priorities to 
whatever we decide here, but I think we should give ourselves more room 
and provide 1 priorities at the minimum (we have 5000 true priorities 
plus overlimit bias).  I'm not sure that negative priorities make sense in 
this model, is there a strong reason to prefer [-5000, 5000] over 
[0, 1]?

And, yes, the root memcg remains a constant oom priority and is never 
actually checked.

> During OOM victim selection we compare cgroups on each hierarchy level
> based on priority and size, if there are several cgroups with equal priority.
> Per-task oom_score_adj will affect task selection inside a cgroup if
> oom_kill_all_tasks is not set. -1000 special value will also completely
> protect a task from being killed, if only oom_kill_all_tasks is not set.
> 

If there are several cgroups of equal priority, we prefer the one that was 
created the most recently just to avoid losing work that has been done for 
a long period of time.  But the key in this proposal is that we _always_ 
continue to iterate the memcg hierarchy until we find a process attached 
to a memcg with the lowest priority relative to sibling cgroups, if any.

To adapt your model to this proposal, memory.oom_kill_all_tasks would only 
be effective if there are no descendant memcgs.  In that case, iteration 
stops anyway and in my model we kill the process with the lowest 
per-process priority.  This could trivially check 
memory.oom_kill_all_tasks and kill everything, and I'm happy to support 
that feature since we have had a need for it in the past as well.

We should talk about when this priority-based scoring becomes effective.  
We enable it by default in our kernel, but it could be guarded with a VM 
sysctl if necessary to enact a system-wide policy.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

2017-07-11 Thread David Rientjes

On Tue, 11 Jul 2017, Roman Gushchin wrote:

> > Yes, the original motivation was to limit killing to a single process, if 
> > possible.  To do that, we kill the process with the largest rss to free 
> > the most memory and rely on the user to configure /proc/pid/oom_score_adj 
> > if something else should be prioritized.
> > 
> > With containerization and overcommit of system memory, we concur that 
> > killing the single largest process isn't always preferable and neglects 
> > the priority of its memcg.  Your motivation seems to be to provide 
> > fairness between one memcg with a large process and one memcg with a large 
> > number of small processes; I'm curious if you are concerned about the 
> > priority of a memcg hierarchy (how important that "job" is) or whether you 
> > are strictly concerned with "largeness" of memcgs relative to each other.
> 
> I'm pretty sure we should provide some way to prioritize some cgroups
> over other (in terms of oom killer preferences), but I'm not 100% sure yet,
> what's the best way to do it. I've suggested something similar to the existing
> oom_score_adj for tasks, mostly to folow the existing design.
> 
> One of the questions to answer in priority-based model is
> how to compare tasks in the root cgroup with cgroups?
> 

We do this with an alternate scoring mechanism, that is purely priority 
based and tiebreaks based on largest rss.  An additional tunable is added 
for each process, under /proc/pid, and also to the memcg hierarchy, and is 
enabled via a system-wide sysctl.  I way to mesh the two scoring 
mechanisms together would be helpful, but for our purposes we don't use 
oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be 
oom disabled when written by third party apps.

For memcg oom conditions, iteration of the hierarchy begins at the oom 
memcg.  For system oom conditions, this is the root memcg.

All processes attached to the oom memcg have their priority based value 
and this is compared to all child memcg's priority value at that level.  
If a process has the lowest priority, it is killed and we're done; we 
could implement a "kill all" mechanism for this memcg that is checked 
before the process is killed.

If a memcg has the lowest priority compared to attached processes, it is 
iterated as well, and so on throughout the memcg hierarchy until we find 
the lowest priority process in the lowest priority leaf memcg.  This way, 
we can fully control which process is killed for both system and memcg oom 
conditions.  I can easily post patches for this, we have used it for 
years.

> > These are two different things, right?  We can adjust how the system oom 
> > killer chooses victims when memcg hierarchies overcommit the system to not 
> > strictly prefer the single process with the largest rss without killing 
> > everything attached to the memcg.
> 
> They are different, and I thought about providing two independent knobs.
> But after all I haven't found enough real life examples, where it can be 
> useful.
> Can you provide something here?
> 

Yes, we have users who we chown their memcg hierarchy to and have full 
control over setting up their hierarchy however we want.  Our "Activity 
Manager", using Documentation/cgroup-v1/memory.txt terminology, only is 
aware of the top level memcg that was chown'd to the user.  That user runs 
a series of batch jobs that are submitted to it and each job is 
represented as a subcontainer to enforce strict limits on the amount of 
memory that job can use.  When it becomes oom, we have found that it is 
preferable to oom kill the entire batch job rather than leave it in an 
inconsistent state, so enabling such a knob here would be helpful.

Other top-level jobs are fine with individual processes being oom killed.  
It can be a low priority process for which they have full control over 
defining the priority through the new per-process and per-memcg value 
described above.  Easy example is scraping logs periodically or other 
best-effort tasks like cleanup.  They can happily be oom killed and 
rescheduled without taking down the entire first-class job.

> Also, they are different only for non-leaf cgroups; leaf cgroups
> are always treated as indivisible memory consumers during victim selection.
> 
> I assume, that containerized systems will always set oom_kill_all_tasks for
> top-level container memory cgroups. By default it's turned off
> to provide backward compatibility with current behavior and avoid
> excessive kills and support oom_score_adj==-1000 (I've added this to v4,
> will post soon).
> 

We certainly would not be enabling it for top-level memcgs, there would be 
no way that we could because we have best-effort processes, but we would 
like to enable it for small batch jobs that are run on behalf of a user in 
their own subcontainer.  We have had this usecase for ~3 years and solely 
because of the problem that you pointed out earlier: it is often much more 
reliable for the

Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

2017-07-10 Thread David Rientjes

On Wed, 21 Jun 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 

Yes, the original motivation was to limit killing to a single process, if 
possible.  To do that, we kill the process with the largest rss to free 
the most memory and rely on the user to configure /proc/pid/oom_score_adj 
if something else should be prioritized.

With containerization and overcommit of system memory, we concur that 
killing the single largest process isn't always preferable and neglects 
the priority of its memcg.  Your motivation seems to be to provide 
fairness between one memcg with a large process and one memcg with a large 
number of small processes; I'm curious if you are concerned about the 
priority of a memcg hierarchy (how important that "job" is) or whether you 
are strictly concerned with "largeness" of memcgs relative to each other.

> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much more safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 

We agree.

> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
> 

This should only be a consequence of overcommiting memcgs at the top level 
so the system oom killer is actually ever invoked, otherwise per-process 
oom_score_adj works well for memcg oom killing.

> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
> 
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 

These are two different things, right?  We can adjust how the system oom 
killer chooses victims when memcg hierarchies overcommit the system to not 
strictly prefer the single process with the largest rss without killing 
everything attached to the memcg.

Separately: do you not intend to support memcg priorities at all, but 
rather strictly consider the "largeness" of a memcg versus other memcgs?

In our methodology, each memcg is assigned a priority value and the 
iteration of the hierarchy simply compares and visits the memcg with the 
lowest priority at each level and then selects the largest process to 
kill.  This could also support a "kill-all" knob.

struct mem_cgroup *memcg = root_mem_cgroup;
struct mem_cgroup *low_memcg;
unsigned long low_priority;

next:
low_memcg = NULL;
low_priority = ULONG_MAX;
for_each_child_of_memcg(memcg) {
unsigned long prio = memcg_oom_priority(memcg);

if (prio < low_priority) {
low_memcg = memcg;
low_priority = prio;
}   
}
if (low_memcg)
goto next;
oom_kill_process_from_memcg(memcg);

So this is a priority based model that is different than your aggregate 
usage model but I think it allows userspace to define a more powerful 
policy.  We certainly may want to kill from a memcg with a single large 
process, or we may want to kill from a memcg with several small processes, 
it depends on the importance of that job.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

2017-06-06 Thread David Rientjes

On Tue, 6 Jun 2017, Roman Gushchin wrote:

> Hi David!
> 
> Thank you for sharing this!
> 
> It's very interesting, and it looks like,
> it's not that far from what I've suggested.
> 
> So we definitily need to come up with some common solution.
> 

Hi Roman,

Yes, definitely.  I could post a series of patches to do everything that 
was listed in my email sans the fully inclusive kmem accounting, which may 
be pursued at a later date, if it would be helpful to see where there is 
common ground?

Another question is what you think about userspace oom handling?  We 
implement our own oom kill policies in userspace for both the system and 
for user-controlled memcg hierarchies because it often does not match the 
kernel implementation and there is some action that can be taken other 
than killing a process.  Have you tried to implement functionality to do 
userspace oom handling, or are you considering it?  This is the main 
motivation behind allowing an oom delay to be configured.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument

2017-06-04 Thread David Rientjes

We use a heavily modified system and memcg oom killer and I'm wondering
if there is some opportunity for collaboration because we may have some
shared goals.

I can summarize how we currently use the oom killer at a high level so
that it is not overwhelming with implementation details and give some
rationale for why we have converged onto this strategy over the period of
a few years.

For victim selection, we have strict priority based oom killing both at
the memcg level and the process level.

Each process has its own "badness" value that is independent of
oom_score_adj, although some conversion is done if a third-party thread
chooses to disable itself from oom killing for backwards compatibility.
Lower values are more preferable victims, but that detail should not
matter significantly.  If two processes share the same badness value,
tiebreaks are done by selecting the largest rss.

Each memcg in a hierarchy also has its own badness value which
semantically means the same as the per-process value, although it
considers the entire memcg as a unit, similar to your approach, when
iterating the hierarchy to choose a process.  The benefit of the
per-memcg and per-process approach is that you can kill the lowest
priority process from the lowest priority memcg.

The above scoring is enabled with a VM sysctl for the system and is used
for both system (global) and memcg oom kills.  For system overcommit,
this means we can kill the lowest priority job on the system; for memcg,
we can allow users to define their oom kill priorities at each level of
their own hierarchy.

When the system or root of an oom memcg hierarchy encounters its limit,
we iterate each level of the memcg hierarchy to find the lowest priority
job.  This is done by comparing the badness of the sibling memcgs at
each level, finding the lowest, and iterating that subtree.  If there are
lower priority processes per the per-process badness value compared to
all sibling memcgs, that process is killed.

We also have complete userspace oom handling support.  This complements
the existing memory.oom_control notification when a memcg is oom with a
separate notifier that notifies when the kernel has oom killed a process.
It is possible to delay the oom killer from killing a process for memcg
oom kills with a configurable, per-memcg, oom delay.  If set, the kernel
will wait for userspace to respond to its oom notification and effect its
own policy decisions until memory is uncharged to that memcg hierarchy,
or the oom delay expires.  If the oom delay expires, the kernel oom
killer kills a process based on badness.

Our oom kill notification file used to get an fd to register with
cgroup.event_control also provides oom kill statistics based on system,
memcg, local, hierarchical, and user-induced oom kills when read().

We also have a convenient "kill all" knob that userspace can write when
handling oom conditions to iterate all threads attached to a particular
memcg and kill them.  This is merely to prevent races where userspace
does the oom killing itself, which is not problematic in itself, but
additional tasks continue to be attached to an oom memcg.

A caveat here is that we also support fully inclusive kmem accounting to
memcg hierarchies, so we call the oom killer as part of the memcg charge
path rather than only upon returning from fault with VM_FAULT_OOM.  We
have our own oom killer livelock detection that isn't necessarily
important in this thread, but we haven't encountered a situation where we
livelock by calling the oom killer during charge, and this is a
requirement for memcg charging as part of slab allocation.

I could post many patches to implement all of this functionality that we
have used for a few years, but I first wanted to send this email to see
if there is any common ground or to better understand your methodology
for using the kernel oom killer for both system and memcg oom kills.

Otherwise, very interesting stuff!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] memory_hotplug: introduce config and command line options to set the default onlining policy

2016-04-20 Thread David Rientjes

On Tue, 19 Apr 2016, Vitaly Kuznetsov wrote:

> > I'd personally disagree that we need more and more config options to take 
> > care of something that an initscript can easily do and most distros 
> > already have their own initscripts that this can be added to.  I don't see 
> > anything that the config option adds.
> 
> Yes, but why does every distro need to solve the exact same issue by 
> a distro-specific init script when we can allow setting reasonable
> default in kernel?
> 

No, only distros that want to change the long-standing default which is 
"offline" since they apparently aren't worried about breaking existing 
userspace.

Changing defaults is always risky business in the kernel, especially when 
it's long standing.  If the default behavior is changeable, userspace 
needs to start testing for that and acting accordingly if it actually 
wants to default to offline (and there are existing tools that suppose the 
long-standing default).  The end result is that the kernel default doesn't 
matter anymore, we've just pushed it to userspace to either online or 
offline at the time of hotplug.

> If the config option itself is a problem (though I don't understand why)
> we can get rid of it making the default 'online' and keeping the command
> line parameter to disable it for cases when something goes wrong but why
> not leave an option for those who want it the other way around?
> 

That could break existing userspace that assumes the default is offline; 
if users are currently hotadding memory and then onlining it when needed 
rather than immediately, they break.  So that's not a possibility.

> Other than the above, let's imagine a 'unikernel' scenario when there
> are no initscripts and we're in a virtualized environment. We may want to
> have memory hotplug there too, but where would we put the 'onlining'
> logic? In every userspace we want to run? This doesn't sound right.
> 

Nobody is resisting hotplug notifiers.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 102 matches

Mail list logo