Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-11-20 Thread Michal Koutný
Hi.

On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt  
wrote:
> > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> > can change at any particular moment.
The switch can happen only when singular (i.e. root-only) hierarchy
exists. (Or it could if rebind_subsystems() waited until all memcgs are
completely free'd.)

> Since the commit 0158115f702b0 ("memcg, kmem: deprecate
> kmem.limit_in_bytes"), we are in the process of deprecating the limit
> on kmem. If we decide that now is the time to deprecate it, we can
> convert the kmem page counter to a memcg stat, update it for both v1
> and v2 and serve v1's kmem.usage_in_bytes from that memcg stat.
So with the single memcg, it may be possible to reconstruct the
necessary counters in both directions using the statistics (or some
complementarity, without fine grained counters removal).

I didn't check all the charging/uncharging places, these are just my 2
cents to the issue.

Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-11-10 Thread Roman Gushchin
On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt wrote:
> On Mon, Nov 9, 2020 at 5:28 PM Roman Gushchin  wrote:
> >
> > On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> > > On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > > > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > > > If we want these counter to function properly, then we should go 
> > > > > > into the opposite
> > > > > > direction and remove the special handling of the root memory cgroup 
> > > > > > in many places.
> > > > >
> > > > > I suspect this is also by far the most robust solution from a code and
> > > > > maintenance POV.
> > > > >
> > > > > I don't recall the page counter at the root level having been a
> > > > > concern in recent years, even though it's widely used in production
> > > > > environments. It's lockless and cache compact. It's also per-cpu
> > > > > batched, which means it isn't actually part of the memcg hotpath.
> > > >
> > > >
> > > > I agree.
> > > >
> > > > Here is my first attempt. Comments are welcome!
> > > >
> > > > It doesn't solve the original problem though (use_hierarchy == false and
> > > > objcg reparenting), I'll send a separate patch for that.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > >
> > > > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > > > From: Roman Gushchin 
> > > > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > > > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> > > >  specially
> > > >
> > > > Currently the root memory cgroup is treated in a special way:
> > > > it's not charged and uncharged directly (only indirectly with their
> > > > descendants), processes belonging to the root memory cgroup are exempt
> > > > from the kernel- and the socket memory accounting.
> > > >
> > > > At the same time some of root level statistics and data are available
> > > > to a user:
> > > >   - cgroup v2: memory.stat
> > > >   - cgroup v1: memory.stat, memory.usage_in_bytes, 
> > > > memory.memsw.usage_in_bytes,
> > > >memory.kmem.usage_in_bytes and 
> > > > memory.kmem.tcp.usage_in_bytes
> > > >
> > > > Historically the reason for a special treatment was an avoidance
> > > > of extra performance cost, however now it's unlikely a good reason:
> > > > over years there was a significant improvement in the performance
> > > > of the memory cgroup code. Also on a modern system actively using
> > > > cgroups (e.g. managed by systemd) there are usually no (significant)
> > > > processes in the root memory cgroup.
> > > >
> > > > The special treatment of the root memory cgroups creates a number of
> > > > issues visible to a user:
> > > > 1) slab stats on the root level do not include the slab memory
> > > >consumed by processes in the root memory cgroup
> > > > 2) non-slab kernel memory consumed by processes in the root memory 
> > > > cgroup
> > > >is not included into memory.kmem.usage_in_bytes
> > > > 3) socket memory consumed by processes in the root memory cgroup
> > > >is not included into memory.kmem.tcp.usage_in_bytes
> > > >
> > > > It complicates the code and increases a risk of new bugs.
> > > >
> > > > This patch removes a number of exceptions related to the handling of
> > > > the root memory cgroup. With this patch applied the root memory cgroup
> > > > is treated uniformly to other cgroups in the following cases:
> > > > 1) root memory cgroup is charged and uncharged directly, try_charge()
> > > >and cancel_charge() do not return immediately if the root memory
> > > >cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> > > >do not handle the root memory cgroup specially.
> > > > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > > > 3) shrinkers infra treats the root memory cgroup as any other memory
> > > >cgroup
> > > > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> > > >by processes belonging to the root memory cgroup
> > > > 5) if a socket is opened by a process in the root memory cgroup,
> > > >the socket memory is accounted
> > > > 6) root cgroup is charged for the used swap memory.
> > > >
> > > > Signed-off-by: Roman Gushchin 
> > > > Suggested-by: Johannes Weiner 
> > >
> > > This looks great.
> > >
> > > The try_charge(), cancel_charge() etc. paths are relatively
> > > straight-forward and look correct to me.
> > >
> > > The swap counters too.
> > >
> > > Slab is a bit trickier, but it also looks correct to me.
> > >
> > > I'm having some trouble with the shrinkers. Currently, tracked objects
> > > allocated in non-root cgroups live in that cgroup. Tracked objects in
> > > the root cgroup, as well as untracked objects, live in a global pool.
> > > When reclaim iterates all memcgs and calls shrink_slab(), we special
> > > case the root_mem_cgroup and redirect to the global pool.
> > >
> 

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-11-10 Thread Shakeel Butt
On Mon, Nov 9, 2020 at 5:28 PM Roman Gushchin  wrote:
>
> On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> > On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > > If we want these counter to function properly, then we should go into 
> > > > > the opposite
> > > > > direction and remove the special handling of the root memory cgroup 
> > > > > in many places.
> > > >
> > > > I suspect this is also by far the most robust solution from a code and
> > > > maintenance POV.
> > > >
> > > > I don't recall the page counter at the root level having been a
> > > > concern in recent years, even though it's widely used in production
> > > > environments. It's lockless and cache compact. It's also per-cpu
> > > > batched, which means it isn't actually part of the memcg hotpath.
> > >
> > >
> > > I agree.
> > >
> > > Here is my first attempt. Comments are welcome!
> > >
> > > It doesn't solve the original problem though (use_hierarchy == false and
> > > objcg reparenting), I'll send a separate patch for that.
> > >
> > > Thanks!
> > >
> > > --
> > >
> > > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > > From: Roman Gushchin 
> > > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> > >  specially
> > >
> > > Currently the root memory cgroup is treated in a special way:
> > > it's not charged and uncharged directly (only indirectly with their
> > > descendants), processes belonging to the root memory cgroup are exempt
> > > from the kernel- and the socket memory accounting.
> > >
> > > At the same time some of root level statistics and data are available
> > > to a user:
> > >   - cgroup v2: memory.stat
> > >   - cgroup v1: memory.stat, memory.usage_in_bytes, 
> > > memory.memsw.usage_in_bytes,
> > >memory.kmem.usage_in_bytes and 
> > > memory.kmem.tcp.usage_in_bytes
> > >
> > > Historically the reason for a special treatment was an avoidance
> > > of extra performance cost, however now it's unlikely a good reason:
> > > over years there was a significant improvement in the performance
> > > of the memory cgroup code. Also on a modern system actively using
> > > cgroups (e.g. managed by systemd) there are usually no (significant)
> > > processes in the root memory cgroup.
> > >
> > > The special treatment of the root memory cgroups creates a number of
> > > issues visible to a user:
> > > 1) slab stats on the root level do not include the slab memory
> > >consumed by processes in the root memory cgroup
> > > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> > >is not included into memory.kmem.usage_in_bytes
> > > 3) socket memory consumed by processes in the root memory cgroup
> > >is not included into memory.kmem.tcp.usage_in_bytes
> > >
> > > It complicates the code and increases a risk of new bugs.
> > >
> > > This patch removes a number of exceptions related to the handling of
> > > the root memory cgroup. With this patch applied the root memory cgroup
> > > is treated uniformly to other cgroups in the following cases:
> > > 1) root memory cgroup is charged and uncharged directly, try_charge()
> > >and cancel_charge() do not return immediately if the root memory
> > >cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> > >do not handle the root memory cgroup specially.
> > > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > > 3) shrinkers infra treats the root memory cgroup as any other memory
> > >cgroup
> > > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> > >by processes belonging to the root memory cgroup
> > > 5) if a socket is opened by a process in the root memory cgroup,
> > >the socket memory is accounted
> > > 6) root cgroup is charged for the used swap memory.
> > >
> > > Signed-off-by: Roman Gushchin 
> > > Suggested-by: Johannes Weiner 
> >
> > This looks great.
> >
> > The try_charge(), cancel_charge() etc. paths are relatively
> > straight-forward and look correct to me.
> >
> > The swap counters too.
> >
> > Slab is a bit trickier, but it also looks correct to me.
> >
> > I'm having some trouble with the shrinkers. Currently, tracked objects
> > allocated in non-root cgroups live in that cgroup. Tracked objects in
> > the root cgroup, as well as untracked objects, live in a global pool.
> > When reclaim iterates all memcgs and calls shrink_slab(), we special
> > case the root_mem_cgroup and redirect to the global pool.
> >
> > After your patch we have tracked objects allocated in the root cgroup
> > actually live in the root cgroup. Removing the shrinker special case
> > is correct in order to shrink those - but it removes the call to
> > shrink the global pool of untracked allocation classes.
> 

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-11-09 Thread Roman Gushchin
On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > If we want these counter to function properly, then we should go into 
> > > > the opposite
> > > > direction and remove the special handling of the root memory cgroup in 
> > > > many places.
> > > 
> > > I suspect this is also by far the most robust solution from a code and
> > > maintenance POV.
> > > 
> > > I don't recall the page counter at the root level having been a
> > > concern in recent years, even though it's widely used in production
> > > environments. It's lockless and cache compact. It's also per-cpu
> > > batched, which means it isn't actually part of the memcg hotpath.
> > 
> > 
> > I agree.
> > 
> > Here is my first attempt. Comments are welcome!
> > 
> > It doesn't solve the original problem though (use_hierarchy == false and
> > objcg reparenting), I'll send a separate patch for that.
> > 
> > Thanks!
> > 
> > --
> > 
> > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin 
> > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> >  specially
> > 
> > Currently the root memory cgroup is treated in a special way:
> > it's not charged and uncharged directly (only indirectly with their
> > descendants), processes belonging to the root memory cgroup are exempt
> > from the kernel- and the socket memory accounting.
> > 
> > At the same time some of root level statistics and data are available
> > to a user:
> >   - cgroup v2: memory.stat
> >   - cgroup v1: memory.stat, memory.usage_in_bytes, 
> > memory.memsw.usage_in_bytes,
> >memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > 
> > Historically the reason for a special treatment was an avoidance
> > of extra performance cost, however now it's unlikely a good reason:
> > over years there was a significant improvement in the performance
> > of the memory cgroup code. Also on a modern system actively using
> > cgroups (e.g. managed by systemd) there are usually no (significant)
> > processes in the root memory cgroup.
> > 
> > The special treatment of the root memory cgroups creates a number of
> > issues visible to a user:
> > 1) slab stats on the root level do not include the slab memory
> >consumed by processes in the root memory cgroup
> > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> >is not included into memory.kmem.usage_in_bytes
> > 3) socket memory consumed by processes in the root memory cgroup
> >is not included into memory.kmem.tcp.usage_in_bytes
> > 
> > It complicates the code and increases a risk of new bugs.
> > 
> > This patch removes a number of exceptions related to the handling of
> > the root memory cgroup. With this patch applied the root memory cgroup
> > is treated uniformly to other cgroups in the following cases:
> > 1) root memory cgroup is charged and uncharged directly, try_charge()
> >and cancel_charge() do not return immediately if the root memory
> >cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> >do not handle the root memory cgroup specially.
> > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > 3) shrinkers infra treats the root memory cgroup as any other memory
> >cgroup
> > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> >by processes belonging to the root memory cgroup
> > 5) if a socket is opened by a process in the root memory cgroup,
> >the socket memory is accounted
> > 6) root cgroup is charged for the used swap memory.
> > 
> > Signed-off-by: Roman Gushchin 
> > Suggested-by: Johannes Weiner 
> 
> This looks great.
> 
> The try_charge(), cancel_charge() etc. paths are relatively
> straight-forward and look correct to me.
> 
> The swap counters too.
> 
> Slab is a bit trickier, but it also looks correct to me.
> 
> I'm having some trouble with the shrinkers. Currently, tracked objects
> allocated in non-root cgroups live in that cgroup. Tracked objects in
> the root cgroup, as well as untracked objects, live in a global pool.
> When reclaim iterates all memcgs and calls shrink_slab(), we special
> case the root_mem_cgroup and redirect to the global pool.
> 
> After your patch we have tracked objects allocated in the root cgroup
> actually live in the root cgroup. Removing the shrinker special case
> is correct in order to shrink those - but it removes the call to
> shrink the global pool of untracked allocation classes.
> 
> I think we need to restore the double call to shrink_slab() we had
> prior to this:
> 
> commit aeed1d325d429ac9699c4bf62d17156d60905519
> Author: Vladimir Davydov 
> Date:   Fri Aug 17 15:48:17 2018 -0700
> 
> mm/vmscan.c: generalize 

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-11-03 Thread Roman Gushchin
On Tue, Nov 03, 2020 at 02:22:21PM +0100, Michal Hocko wrote:
> Hi,
> I am sorry but I was quite busy at the time this has been discussed.
> Now that I got to this thread finally I am wondering whether this
> resp. other root cgroup exceptions have any follow up.

I'll address a feedback I've got from Johannes and will post and updated
version soon.

> 
> I have seen the issue is fixed in Linus tree by 8de15e920dc8 ("mm:
> memcg: link page counters to root if use_hierarchy is false"). Thanks
> for taking care of that! I do agree that this approach is the most
> reasonable immediate fix.

Thanks!

> 
> Btw. we have been carrying a warning about non hierarchical hierarchies
> for years in our distribution kernels and haven't heard of any actual
> bug reports (except for LTP driven ones). So we might be really close to
> simply drop this functionality completely. This would simplify the code
> and prevent from future surprises.

Just sent an rfc patchset. Your feedback will be greatly appreciated!

Thanks!


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-11-03 Thread Michal Hocko
Hi,
I am sorry but I was quite busy at the time this has been discussed.
Now that I got to this thread finally I am wondering whether this
resp. other root cgroup exceptions have any follow up.

I have seen the issue is fixed in Linus tree by 8de15e920dc8 ("mm:
memcg: link page counters to root if use_hierarchy is false"). Thanks
for taking care of that! I do agree that this approach is the most
reasonable immediate fix.

Btw. we have been carrying a warning about non hierarchical hierarchies
for years in our distribution kernels and haven't heard of any actual
bug reports (except for LTP driven ones). So we might be really close to
simply drop this functionality completely. This would simplify the code
and prevent from future surprises.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-23 Thread Johannes Weiner
On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > If we want these counter to function properly, then we should go into the 
> > > opposite
> > > direction and remove the special handling of the root memory cgroup in 
> > > many places.
> > 
> > I suspect this is also by far the most robust solution from a code and
> > maintenance POV.
> > 
> > I don't recall the page counter at the root level having been a
> > concern in recent years, even though it's widely used in production
> > environments. It's lockless and cache compact. It's also per-cpu
> > batched, which means it isn't actually part of the memcg hotpath.
> 
> 
> I agree.
> 
> Here is my first attempt. Comments are welcome!
> 
> It doesn't solve the original problem though (use_hierarchy == false and
> objcg reparenting), I'll send a separate patch for that.
> 
> Thanks!
> 
> --
> 
> From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin 
> Date: Tue, 20 Oct 2020 18:05:43 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
>  specially
> 
> Currently the root memory cgroup is treated in a special way:
> it's not charged and uncharged directly (only indirectly with their
> descendants), processes belonging to the root memory cgroup are exempt
> from the kernel- and the socket memory accounting.
> 
> At the same time some of root level statistics and data are available
> to a user:
>   - cgroup v2: memory.stat
>   - cgroup v1: memory.stat, memory.usage_in_bytes, 
> memory.memsw.usage_in_bytes,
>memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> 
> Historically the reason for a special treatment was an avoidance
> of extra performance cost, however now it's unlikely a good reason:
> over years there was a significant improvement in the performance
> of the memory cgroup code. Also on a modern system actively using
> cgroups (e.g. managed by systemd) there are usually no (significant)
> processes in the root memory cgroup.
> 
> The special treatment of the root memory cgroups creates a number of
> issues visible to a user:
> 1) slab stats on the root level do not include the slab memory
>consumed by processes in the root memory cgroup
> 2) non-slab kernel memory consumed by processes in the root memory cgroup
>is not included into memory.kmem.usage_in_bytes
> 3) socket memory consumed by processes in the root memory cgroup
>is not included into memory.kmem.tcp.usage_in_bytes
> 
> It complicates the code and increases a risk of new bugs.
> 
> This patch removes a number of exceptions related to the handling of
> the root memory cgroup. With this patch applied the root memory cgroup
> is treated uniformly to other cgroups in the following cases:
> 1) root memory cgroup is charged and uncharged directly, try_charge()
>and cancel_charge() do not return immediately if the root memory
>cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
>do not handle the root memory cgroup specially.
> 2) per-memcg slab statistics is gathered for the root memory cgroup
> 3) shrinkers infra treats the root memory cgroup as any other memory
>cgroup
> 4) non-slab kernel memory accounting doesn't exclude pages allocated
>by processes belonging to the root memory cgroup
> 5) if a socket is opened by a process in the root memory cgroup,
>the socket memory is accounted
> 6) root cgroup is charged for the used swap memory.
> 
> Signed-off-by: Roman Gushchin 
> Suggested-by: Johannes Weiner 

This looks great.

The try_charge(), cancel_charge() etc. paths are relatively
straight-forward and look correct to me.

The swap counters too.

Slab is a bit trickier, but it also looks correct to me.

I'm having some trouble with the shrinkers. Currently, tracked objects
allocated in non-root cgroups live in that cgroup. Tracked objects in
the root cgroup, as well as untracked objects, live in a global pool.
When reclaim iterates all memcgs and calls shrink_slab(), we special
case the root_mem_cgroup and redirect to the global pool.

After your patch we have tracked objects allocated in the root cgroup
actually live in the root cgroup. Removing the shrinker special case
is correct in order to shrink those - but it removes the call to
shrink the global pool of untracked allocation classes.

I think we need to restore the double call to shrink_slab() we had
prior to this:

commit aeed1d325d429ac9699c4bf62d17156d60905519
Author: Vladimir Davydov 
Date:   Fri Aug 17 15:48:17 2018 -0700

mm/vmscan.c: generalize shrink_slab() calls in shrink_node()

The patch makes shrink_slab() be called for root_mem_cgroup in the same
way as it's called for the rest of cgroups.  This simplifies the logic
and improves the readability.

where we iterate through all cgroups, including the root, to 

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-21 Thread Roman Gushchin
On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > If we want these counter to function properly, then we should go into the 
> > opposite
> > direction and remove the special handling of the root memory cgroup in many 
> > places.
> 
> I suspect this is also by far the most robust solution from a code and
> maintenance POV.
> 
> I don't recall the page counter at the root level having been a
> concern in recent years, even though it's widely used in production
> environments. It's lockless and cache compact. It's also per-cpu
> batched, which means it isn't actually part of the memcg hotpath.


I agree.

Here is my first attempt. Comments are welcome!

It doesn't solve the original problem though (use_hierarchy == false and
objcg reparenting), I'll send a separate patch for that.

Thanks!

--

>From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Tue, 20 Oct 2020 18:05:43 -0700
Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
 specially

Currently the root memory cgroup is treated in a special way:
it's not charged and uncharged directly (only indirectly with their
descendants), processes belonging to the root memory cgroup are exempt
from the kernel- and the socket memory accounting.

At the same time some of root level statistics and data are available
to a user:
  - cgroup v2: memory.stat
  - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
   memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes

Historically the reason for a special treatment was an avoidance
of extra performance cost, however now it's unlikely a good reason:
over years there was a significant improvement in the performance
of the memory cgroup code. Also on a modern system actively using
cgroups (e.g. managed by systemd) there are usually no (significant)
processes in the root memory cgroup.

The special treatment of the root memory cgroups creates a number of
issues visible to a user:
1) slab stats on the root level do not include the slab memory
   consumed by processes in the root memory cgroup
2) non-slab kernel memory consumed by processes in the root memory cgroup
   is not included into memory.kmem.usage_in_bytes
3) socket memory consumed by processes in the root memory cgroup
   is not included into memory.kmem.tcp.usage_in_bytes

It complicates the code and increases a risk of new bugs.

This patch removes a number of exceptions related to the handling of
the root memory cgroup. With this patch applied the root memory cgroup
is treated uniformly to other cgroups in the following cases:
1) root memory cgroup is charged and uncharged directly, try_charge()
   and cancel_charge() do not return immediately if the root memory
   cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
   do not handle the root memory cgroup specially.
2) per-memcg slab statistics is gathered for the root memory cgroup
3) shrinkers infra treats the root memory cgroup as any other memory
   cgroup
4) non-slab kernel memory accounting doesn't exclude pages allocated
   by processes belonging to the root memory cgroup
5) if a socket is opened by a process in the root memory cgroup,
   the socket memory is accounted
6) root cgroup is charged for the used swap memory.

Signed-off-by: Roman Gushchin 
Suggested-by: Johannes Weiner 
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c| 82 ++
 mm/vmscan.c|  9 +
 3 files changed, 31 insertions(+), 63 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e391e3c56de5..d3653eb5d1b2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -416,8 +416,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
 static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
 {
/*
-* The root memcg doesn't account charges, and doesn't support
-* protection.
+* The root memcg doesn't support memory protection.
 */
return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..a8bdca0f58f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -438,9 +438,6 @@ static void memcg_free_shrinker_maps(struct mem_cgroup 
*memcg)
struct memcg_shrinker_map *map;
int nid;
 
-   if (mem_cgroup_is_root(memcg))
-   return;
-
for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
map = rcu_dereference_protected(pn->shrinker_map, true);
@@ -455,9 +452,6 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup 
*memcg)
struct memcg_shrinker_map *map;
int nid, size, ret = 0;
 
-   if (mem_cgroup_is_root(memcg))
-   return 0;
-

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Johannes Weiner
On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> If we want these counter to function properly, then we should go into the 
> opposite
> direction and remove the special handling of the root memory cgroup in many 
> places.

I suspect this is also by far the most robust solution from a code and
maintenance POV.

I don't recall the page counter at the root level having been a
concern in recent years, even though it's widely used in production
environments. It's lockless and cache compact. It's also per-cpu
batched, which means it isn't actually part of the memcg hotpath.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Roman Gushchin
On Tue, Oct 20, 2020 at 09:55:38AM -0700, Shakeel Butt wrote:
> On Mon, Oct 19, 2020 at 3:28 PM Roman Gushchin  wrote:
> >
> > On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> > > On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner 
> > >  wrote:
> > > > The central try_charge() function charges recursively all the way up
> > > > to and including the root.
> > > Except for use_hiearchy=0 (which is the case here as Richard
> > > wrote). The reparenting is hence somewhat incompatible with
> > > new_parent.use_hiearchy=0 :-/
> > >
> > > > We should clean this up one way or another: either charge the root or
> > > > don't, but do it consistently.
> > > I agree this'd be good to unify. One upside of excluding root memcg from
> > > charging is that users are spared from the charging overhead when memcg
> > > tree is not created.  (Actually, I thought that was the reason for this
> > > exception.)
> >
> > Yeah, I'm completely on the same page. Moving a process to the root memory
> > cgroup is currently a good way to estimate the memory cgroup overhead.
> >
> > How about the patch below, which consistently avoids charging the root
> > memory cgroup? It seems like it doesn't add too many checks.
> >
> > Thanks!
> >
> > --
> >
> > From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin 
> > Date: Mon, 19 Oct 2020 14:37:35 -0700
> > Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
> >
> > Currently the root memory cgroup is never charged directly, but
> > if an ancestor cgroup is charged, the charge is propagated up to the
> > root memory cgroup. The root memory cgroup doesn't show the charge
> > to a user, neither it does allow to set any limits/protections.
> > So the information about the current charge is completely useless.
> >
> > Avoiding to charge the root memory cgroup allows to:
> > 1) simplify the model and the code, so, hopefully, fewer bugs will
> >be introduced in the future;
> > 2) avoid unnecessary atomic operations, which are used to (un)charge
> >corresponding root page counters.
> >
> > In the default hierarchy case or if use_hiearchy == true, it's very
> > straightforward: when the page counters tree is traversed to the root,
> > the root page counter (the one with parent == NULL), should be
> > skipped. To avoid multiple identical checks over the page counters
> > code, for_each_nonroot_ancestor() macro is introduced.
> >
> > To handle the use_hierarchy == false case without adding custom
> > checks, let's make page counters of all non-root memory cgroup
> > direct ascendants of the corresponding root memory cgroup's page
> > counters. In this case for_each_nonroot_ancestor() will work correctly
> > as well.
> >
> > Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> > However, it's not based on page counters (refer to mem_cgroup_usage()).
> >
> > Signed-off-by: Roman Gushchin 
> 
> This patch is only doing the page counter part of the cleanup (i.e. to
> not update root's page counters when descendent's page counter is
> updated) but not the stats part.
> 
> For the user memory, we do update the stats for the root memcg and do
> set page->mem_cgroup = root_mem_cgroup. However this is not done for
> the kmem/obj. I thought this is what Johannes was asking for the
> cleanup.

Yes, it's not the whole story, of course.

Actually I missed that we do export root kmem and tcpmem counters
on cgroup v1 (thanks Michal to pointing at it!). If we want them to function
properly, we have to go into the opposite direction and start charging
the root cgroup for all kinds of kernel memory allocations.

We also have the same problem with root MEMCG_SOCK stats, which seems
to be broken now.

I'll master a patch.

Thanks!


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Roman Gushchin
On Tue, Oct 20, 2020 at 06:27:14PM +0200, Michal Koutny wrote:
> Hi.
> 
> On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin  wrote:
> > Currently the root memory cgroup is never charged directly, but
> > if an ancestor cgroup is charged, the charge is propagated up to the
> s/ancestor/descendant/

Oops, will fix, thanks!

> 
> > The root memory cgroup doesn't show the charge to a user, neither it
> > does allow to set any limits/protections.
> An appealing claim, I'd like this to be true...
> 
> > Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> > However, it's not based on page counters (refer to mem_cgroup_usage()).
> ...and it almost is. But there are still exposed kmem and tcpmem counters.

Hm, I wonder what do they show given that we never set sk->sk_memcg
to the root_mem_cgroup (see mem_cgroup_sk_alloc()) and we never charge
the root_mem_cgroup for !slab kmem allocations (see __memcg_kmem_charge_page()).

So yeah, it's quite a mess now, and it looks like it has been broken
in multiple places and for a while.

If we want these counter to function properly, then we should go into the 
opposite
direction and remove the special handling of the root memory cgroup in many 
places.

> > To avoid multiple identical checks over the page counters
> > code, for_each_nonroot_ancestor() macro is introduced.
> If the assumptions behind this patch's idea were true, I think the
> implementation would be simpler by merely (not)connecting the root
> counters and keep the traversal as is.

We use some fields in root page counters to calculate protections:
see propagate_protected_usage().

Thanks!


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Shakeel Butt
On Mon, Oct 19, 2020 at 3:28 PM Roman Gushchin  wrote:
>
> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> > On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner 
> >  wrote:
> > > The central try_charge() function charges recursively all the way up
> > > to and including the root.
> > Except for use_hiearchy=0 (which is the case here as Richard
> > wrote). The reparenting is hence somewhat incompatible with
> > new_parent.use_hiearchy=0 :-/
> >
> > > We should clean this up one way or another: either charge the root or
> > > don't, but do it consistently.
> > I agree this'd be good to unify. One upside of excluding root memcg from
> > charging is that users are spared from the charging overhead when memcg
> > tree is not created.  (Actually, I thought that was the reason for this
> > exception.)
>
> Yeah, I'm completely on the same page. Moving a process to the root memory
> cgroup is currently a good way to estimate the memory cgroup overhead.
>
> How about the patch below, which consistently avoids charging the root
> memory cgroup? It seems like it doesn't add too many checks.
>
> Thanks!
>
> --
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin 
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin 

This patch is only doing the page counter part of the cleanup (i.e. to
not update root's page counters when descendent's page counter is
updated) but not the stats part.

For the user memory, we do update the stats for the root memcg and do
set page->mem_cgroup = root_mem_cgroup. However this is not done for
the kmem/obj. I thought this is what Johannes was asking for the
cleanup.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Michal Koutný
Hi.

On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin  wrote:
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
s/ancestor/descendant/

> The root memory cgroup doesn't show the charge to a user, neither it
> does allow to set any limits/protections.
An appealing claim, I'd like this to be true...

> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
...and it almost is. But there are still exposed kmem and tcpmem counters.


> To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
If the assumptions behind this patch's idea were true, I think the
implementation would be simpler by merely (not)connecting the root
counters and keep the traversal as is.

> direct ascendants of the corresponding root memory cgroup's page
s/asc/desc/ ;-)

Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Richard Palethorpe
Hello Roman,

Roman Gushchin  writes:

> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin 
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin 
> ---
>  mm/memcontrol.c   | 21 -
>  mm/page_counter.c | 21 -
>  2 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2636f8bad908..34cac7522e74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
> *parent_css)
>   memcg->swappiness = mem_cgroup_swappiness(parent);
>   memcg->oom_kill_disable = parent->oom_kill_disable;
>   }
> - if (parent && parent->use_hierarchy) {
> + if (!parent) {
> + /* root memory cgroup */
> + page_counter_init(>memory, NULL);
> + page_counter_init(>swap, NULL);
> + page_counter_init(>kmem, NULL);
> + page_counter_init(>tcpmem, NULL);
> + } else if (parent->use_hierarchy) {
>   memcg->use_hierarchy = true;
>   page_counter_init(>memory, >memory);
>   page_counter_init(>swap, >swap);
>   page_counter_init(>kmem, >kmem);
>   page_counter_init(>tcpmem, >tcpmem);
>   } else {
> - page_counter_init(>memory, NULL);
> - page_counter_init(>swap, NULL);
> - page_counter_init(>kmem, NULL);
> - page_counter_init(>tcpmem, NULL);
> + /*
> +  * If use_hierarchy == false, consider all page counters direct
> +  * descendants of the corresponding root level counters.
> +  */
> + page_counter_init(>memory, _mem_cgroup->memory);
> + page_counter_init(>swap, _mem_cgroup->swap);
> + page_counter_init(>kmem, _mem_cgroup->kmem);
> + page_counter_init(>tcpmem, _mem_cgroup->tcpmem);
> +
>   /*
>* Deeper hierachy with use_hierarchy == false doesn't make
>* much sense so let cgroup subsystem know about this
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b24a60b28bb0..8901b184b9d5 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -13,6 +13,9 @@
>  #include 
>  #include 
>  
> +#define for_each_nonroot_ancestor(c, counter) \
> + for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
> +
>  static void propagate_protected_usage(struct page_counter *c,
> unsigned long usage)
>  {
> @@ -20,9 +23,6 @@ static void propagate_protected_usage(struct page_counter 
> *c,
>   unsigned long low, min;
>   long delta;
>  
> - if (!c->parent)
> - return;
> -
>   min = READ_ONCE(c->min);
>   if (min || atomic_long_read(>min_usage)) {
>   protected = min(usage, min);
> @@ -68,7 +68,7 @@ void page_counter_charge(struct page_counter *counter, 
> unsigned long nr_pages)
>  {
>   struct page_counter *c;
>  
> - for (c = counter; c; c = c->parent) {
> + for_each_nonroot_ancestor(c, counter) {
>   long new;
>  
>   new = atomic_long_add_return(nr_pages, >usage);
> @@ -97,7 +97,7 @@ bool page_counter_try_charge(struct page_counter *counter,
>  {
>   struct page_counter *c;
>  
> - for (c = counter; c; c = c->parent) {
> + 

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Richard Palethorpe
Hello,

Richard Palethorpe  writes:

> Hello Roman,
>
> Roman Gushchin  writes:
>
>> -page_counter_init(>memory, NULL);
>> -page_counter_init(>swap, NULL);
>> -page_counter_init(>kmem, NULL);
>> -page_counter_init(>tcpmem, NULL);
>> +/*
>> + * If use_hierarchy == false, consider all page counters direct
>> + * descendants of the corresponding root level counters.
>> + */
>> +page_counter_init(>memory, _mem_cgroup->memory);
>> +page_counter_init(>swap, _mem_cgroup->swap);
>> +page_counter_init(>kmem, _mem_cgroup->kmem);
>> +page_counter_init(>tcpmem, _mem_cgroup->tcpmem);
>> +
>>  /*
>>   * Deeper hierachy with use_hierarchy == false doesn't make
>>   * much sense so let cgroup subsystem know about this
>
> Perhaps in this case, where the hierarchy is broken, objcgs should also
> be reparented directly to root? Otherwise it will still be possible to
> underflow the counter in a descendant of root which has use_hierarchy=0,
> but also has children.

Sorry ignore me, parent_mem_cgroup already selects root. So in the case
of a broken hierarchy objcgs are reparented directly to root.

-- 
Thank you,
Richard.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-20 Thread Richard Palethorpe
Hello Roman,

Roman Gushchin  writes:

> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
>> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner 
>>  wrote:
>> > The central try_charge() function charges recursively all the way up
>> > to and including the root.
>> Except for use_hiearchy=0 (which is the case here as Richard
>> wrote). The reparenting is hence somewhat incompatible with
>> new_parent.use_hiearchy=0 :-/
>> 
>> > We should clean this up one way or another: either charge the root or
>> > don't, but do it consistently.
>> I agree this'd be good to unify. One upside of excluding root memcg from
>> charging is that users are spared from the charging overhead when memcg
>> tree is not created.  (Actually, I thought that was the reason for this
>> exception.)
>
> Yeah, I'm completely on the same page. Moving a process to the root memory
> cgroup is currently a good way to estimate the memory cgroup overhead.
>
> How about the patch below, which consistently avoids charging the root
> memory cgroup? It seems like it doesn't add too many checks.
>
> Thanks!
>
> --
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin 
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin 
> ---
>  mm/memcontrol.c   | 21 -
>  mm/page_counter.c | 21 -
>  2 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2636f8bad908..34cac7522e74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
> *parent_css)
>   memcg->swappiness = mem_cgroup_swappiness(parent);
>   memcg->oom_kill_disable = parent->oom_kill_disable;
>   }
> - if (parent && parent->use_hierarchy) {
> + if (!parent) {
> + /* root memory cgroup */
> + page_counter_init(>memory, NULL);
> + page_counter_init(>swap, NULL);
> + page_counter_init(>kmem, NULL);
> + page_counter_init(>tcpmem, NULL);
> + } else if (parent->use_hierarchy) {
>   memcg->use_hierarchy = true;
>   page_counter_init(>memory, >memory);
>   page_counter_init(>swap, >swap);
>   page_counter_init(>kmem, >kmem);
>   page_counter_init(>tcpmem, >tcpmem);
>   } else {
> - page_counter_init(>memory, NULL);
> - page_counter_init(>swap, NULL);
> - page_counter_init(>kmem, NULL);
> - page_counter_init(>tcpmem, NULL);
> + /*
> +  * If use_hierarchy == false, consider all page counters direct
> +  * descendants of the corresponding root level counters.
> +  */
> + page_counter_init(>memory, _mem_cgroup->memory);
> + page_counter_init(>swap, _mem_cgroup->swap);
> + page_counter_init(>kmem, _mem_cgroup->kmem);
> + page_counter_init(>tcpmem, _mem_cgroup->tcpmem);
> +
>   /*
>* Deeper hierachy with use_hierarchy == false doesn't make
>* much sense so let cgroup subsystem know about this

Perhaps in this case, where the hierarchy is broken, objcgs should also
be reparented directly to root? Otherwise it will still be possible to
underflow the counter in a descendant of root which has use_hierarchy=0,
but also has children.

> diff --git a/mm/page_counter.c b/mm/page_counter.c
> 

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-19 Thread Roman Gushchin
On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner 
>  wrote:
> > The central try_charge() function charges recursively all the way up
> > to and including the root.
> Except for use_hiearchy=0 (which is the case here as Richard
> wrote). The reparenting is hence somewhat incompatible with
> new_parent.use_hiearchy=0 :-/
> 
> > We should clean this up one way or another: either charge the root or
> > don't, but do it consistently.
> I agree this'd be good to unify. One upside of excluding root memcg from
> charging is that users are spared from the charging overhead when memcg
> tree is not created.  (Actually, I thought that was the reason for this
> exception.)

Yeah, I'm completely on the same page. Moving a process to the root memory
cgroup is currently a good way to estimate the memory cgroup overhead.

How about the patch below, which consistently avoids charging the root
memory cgroup? It seems like it doesn't add too many checks.

Thanks!

--

>From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Mon, 19 Oct 2020 14:37:35 -0700
Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup

Currently the root memory cgroup is never charged directly, but
if an ancestor cgroup is charged, the charge is propagated up to the
root memory cgroup. The root memory cgroup doesn't show the charge
to a user, neither it does allow to set any limits/protections.
So the information about the current charge is completely useless.

Avoiding to charge the root memory cgroup allows to:
1) simplify the model and the code, so, hopefully, fewer bugs will
   be introduced in the future;
2) avoid unnecessary atomic operations, which are used to (un)charge
   corresponding root page counters.

In the default hierarchy case or if use_hiearchy == true, it's very
straightforward: when the page counters tree is traversed to the root,
the root page counter (the one with parent == NULL), should be
skipped. To avoid multiple identical checks over the page counters
code, for_each_nonroot_ancestor() macro is introduced.

To handle the use_hierarchy == false case without adding custom
checks, let's make page counters of all non-root memory cgroup
direct ascendants of the corresponding root memory cgroup's page
counters. In this case for_each_nonroot_ancestor() will work correctly
as well.

Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
However, it's not based on page counters (refer to mem_cgroup_usage()).

Signed-off-by: Roman Gushchin 
---
 mm/memcontrol.c   | 21 -
 mm/page_counter.c | 21 -
 2 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..34cac7522e74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
memcg->swappiness = mem_cgroup_swappiness(parent);
memcg->oom_kill_disable = parent->oom_kill_disable;
}
-   if (parent && parent->use_hierarchy) {
+   if (!parent) {
+   /* root memory cgroup */
+   page_counter_init(>memory, NULL);
+   page_counter_init(>swap, NULL);
+   page_counter_init(>kmem, NULL);
+   page_counter_init(>tcpmem, NULL);
+   } else if (parent->use_hierarchy) {
memcg->use_hierarchy = true;
page_counter_init(>memory, >memory);
page_counter_init(>swap, >swap);
page_counter_init(>kmem, >kmem);
page_counter_init(>tcpmem, >tcpmem);
} else {
-   page_counter_init(>memory, NULL);
-   page_counter_init(>swap, NULL);
-   page_counter_init(>kmem, NULL);
-   page_counter_init(>tcpmem, NULL);
+   /*
+* If use_hierarchy == false, consider all page counters direct
+* descendants of the corresponding root level counters.
+*/
+   page_counter_init(>memory, _mem_cgroup->memory);
+   page_counter_init(>swap, _mem_cgroup->swap);
+   page_counter_init(>kmem, _mem_cgroup->kmem);
+   page_counter_init(>tcpmem, _mem_cgroup->tcpmem);
+
/*
 * Deeper hierachy with use_hierarchy == false doesn't make
 * much sense so let cgroup subsystem know about this
diff --git a/mm/page_counter.c b/mm/page_counter.c
index b24a60b28bb0..8901b184b9d5 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 
+#define for_each_nonroot_ancestor(c, counter) \
+   for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
+
 static void propagate_protected_usage(struct page_counter *c,
  unsigned long usage)
 {
@@ -20,9 +23,6 @@ 

Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-19 Thread Richard Palethorpe
Hello,

Michal Koutný  writes:

> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner 
>  wrote:
>> The central try_charge() function charges recursively all the way up
>> to and including the root.
> Except for use_hiearchy=0 (which is the case here as Richard
> wrote). The reparenting is hence somewhat incompatible with
> new_parent.use_hiearchy=0 :-/
>

Yes and it also seems

new_parent.use_hierarch=0 -> new_child.use_hierarchy=0

and

new_parent.use_hierarch=0 -> new_child.use_hierarchy=1

are considered valid on cgroupsV1. The kernel will also allow more
descendants on new_child.use_hierarchy=0, but sets
broken_hierarchy=1. However this will not stop the stack trace occuring
(AFAICT) when the reparenting happens between two descendants.

>> We should clean this up one way or another: either charge the root or
>> don't, but do it consistently.
> I agree this'd be good to unify. One upside of excluding root memcg from
> charging is that users are spared from the charging overhead when memcg
> tree is not created.  (Actually, I thought that was the reason for this
> exception.)
>
> Michal


-- 
Thank you,
Richard.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Michal Koutný
On Fri, Oct 16, 2020 at 04:05:21PM +0100, Richard Palethorpe 
 wrote:
> I'm don't know if that could happen without reparenting. I suppose if
> use_hierarchy=1 then actually this patch will result in root being
> overcharged, so perhaps it should also check for use_hierarchy?
Right, you'd need to distinguish whether the uncharged objcg was
originally (not)charged in the root memcg or it was only reparented to
it. (I originally considered only the genuine root objcgs.)

In this light, homogenous charing to root memcg looks really simpler but
I wonder what other surprises it brings about.

Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Michal Koutný
On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner  
wrote:
> The central try_charge() function charges recursively all the way up
> to and including the root.
Except for use_hiearchy=0 (which is the case here as Richard
wrote). The reparenting is hence somewhat incompatible with
new_parent.use_hiearchy=0 :-/

> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.
I agree this'd be good to unify. One upside of excluding root memcg from
charging is that users are spared from the charging overhead when memcg
tree is not created.  (Actually, I thought that was the reason for this
exception.)

Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Roman Gushchin
On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner wrote:
> On Fri, Oct 16, 2020 at 11:47:02AM +0200, Michal Koutný wrote:
> > Hello.
> > 
> > On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe 
> >  wrote:
> > > SLAB objects which outlive their memcg are moved to their parent
> > > memcg where they may be uncharged. However if they are moved to the
> > > root memcg, uncharging will result in negative page counter values as
> > > root has no page counters.
> > Why do you think those are reparented objects? If those are originally
> > charged in a non-root cgroup, then the charge value should be propagated up 
> > the
> > hierarchy, including root memcg, so if they're later uncharged in root
> > after reparenting, it should still break even. (Or did I miss some stock
> > imbalance?)
> 
> Looking a bit closer at this code, it's kind of a mess right now.
> 
> The central try_charge() function charges recursively all the way up
> to and including the root. But not if it's called directly on the
> root, in which case it bails and does nothing.
> 
> kmem and objcg use try_charge(), so they have the same
> behavior. get_obj_cgroup_from_current() does it's own redundant
> filtering for root_mem_cgroup, whereas get_mem_cgroup_from_current()
> does not, but its callsite __memcg_kmem_charge_page() does.
> 
> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.

+1

> 
> Since we export memory.stat at the root now, we should probably just
> always charge the root instead of special-casing it all over the place
> and risking bugs.

Hm, we export memory.stat but not memory.current. Charging the root memcg
seems to be an extra atomic operation, which can be avoided.

I wonder if we can handle it in page_counter.c, so there will be a single
place where we do the check.

> 
> Indeed, it looks like there is at least one bug where the root-level
> memory.stat shows non-root slab objects, but not root ones, whereas it
> shows all anon and cache pages, root or no root.

I'll take a look.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Richard Palethorpe
Hello,

Richard Palethorpe  writes:

> Hello Michal,
>
> Michal Koutný  writes:
>
>> Hello.
>>
>> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe 
>>  wrote:
>>> SLAB objects which outlive their memcg are moved to their parent
>>> memcg where they may be uncharged. However if they are moved to the
>>> root memcg, uncharging will result in negative page counter values as
>>> root has no page counters.
>> Why do you think those are reparented objects? If those are originally
>> charged in a non-root cgroup, then the charge value should be propagated up 
>> the
>> hierarchy, including root memcg, so if they're later uncharged in root
>> after reparenting, it should still break even. (Or did I miss some stock
>> imbalance?)
>
> I traced it and can see they are reparented objects and that the root
> groups counters are zero (or negative if I run madvise06 multiple times)
> before a drain takes place. I'm guessing this is because the root group
> has 'use_hierachy' set to false so that the childs page_counter parents
> are set to NULL. However I will check, because I'm not sure about
> either.

Yes, it appears that use_hierarchy=0 which is probably because the test
mounts cgroup v1, creates a child group within that and does not set
use_hierarchy on the root. On v2 root always has use_hierarchy enabled.

>
>>
>> (But the patch seems justifiable to me as objects (not)charged directly to
>> root memcg may be incorrectly uncharged.)
>>
>> Thanks,
>> Michal

I'm don't know if that could happen without reparenting. I suppose if
use_hierarchy=1 then actually this patch will result in root being
overcharged, so perhaps it should also check for use_hierarchy?

-- 
Thank you,
Richard.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Johannes Weiner
On Fri, Oct 16, 2020 at 11:47:02AM +0200, Michal Koutný wrote:
> Hello.
> 
> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe 
>  wrote:
> > SLAB objects which outlive their memcg are moved to their parent
> > memcg where they may be uncharged. However if they are moved to the
> > root memcg, uncharging will result in negative page counter values as
> > root has no page counters.
> Why do you think those are reparented objects? If those are originally
> charged in a non-root cgroup, then the charge value should be propagated up 
> the
> hierarchy, including root memcg, so if they're later uncharged in root
> after reparenting, it should still break even. (Or did I miss some stock
> imbalance?)

Looking a bit closer at this code, it's kind of a mess right now.

The central try_charge() function charges recursively all the way up
to and including the root. But not if it's called directly on the
root, in which case it bails and does nothing.

kmem and objcg use try_charge(), so they have the same
behavior. get_obj_cgroup_from_current() does it's own redundant
filtering for root_mem_cgroup, whereas get_mem_cgroup_from_current()
does not, but its callsite __memcg_kmem_charge_page() does.

We should clean this up one way or another: either charge the root or
don't, but do it consistently.

Since we export memory.stat at the root now, we should probably just
always charge the root instead of special-casing it all over the place
and risking bugs.

Indeed, it looks like there is at least one bug where the root-level
memory.stat shows non-root slab objects, but not root ones, whereas it
shows all anon and cache pages, root or no root.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Richard Palethorpe
Hello Michal,

Michal Koutný  writes:

> Hello.
>
> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe 
>  wrote:
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
> Why do you think those are reparented objects? If those are originally
> charged in a non-root cgroup, then the charge value should be propagated up 
> the
> hierarchy, including root memcg, so if they're later uncharged in root
> after reparenting, it should still break even. (Or did I miss some stock
> imbalance?)

I traced it and can see they are reparented objects and that the root
groups counters are zero (or negative if I run madvise06 multiple times)
before a drain takes place. I'm guessing this is because the root group
has 'use_hierachy' set to false so that the childs page_counter parents
are set to NULL. However I will check, because I'm not sure about
either.

>
> (But the patch seems justifiable to me as objects (not)charged directly to
> root memcg may be incorrectly uncharged.)
>
> Thanks,
> Michal


-- 
Thank you,
Richard.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-16 Thread Michal Koutný
Hello.

On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe 
 wrote:
> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
Why do you think those are reparented objects? If those are originally
charged in a non-root cgroup, then the charge value should be propagated up the
hierarchy, including root memcg, so if they're later uncharged in root
after reparenting, it should still break even. (Or did I miss some stock
imbalance?)

(But the patch seems justifiable to me as objects (not)charged directly to
root memcg may be incorrectly uncharged.)

Thanks,
Michal


signature.asc
Description: Digital signature


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-15 Thread Richard Palethorpe
Hello Roman,

Roman Gushchin  writes:

> Hi Richard!
>
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
>> 
>> To prevent this, we check whether we are about to uncharge the root
>> memcg and skip it if we are. Possibly instead; the obj_cgroups should
>> be removed from their slabs and any per cpu stocks instead of
>> reparenting them to root?
>
> It would be really complex. I think your fix is totally fine.
> We have similar checks in cancel_charge(), uncharge_batch(),
> mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.
>
>
> Acked-by: Roman Gushchin 
>
> Thanks!

Great I will respin.

-- 
Thank you,
Richard.


Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-14 Thread Roman Gushchin
Hi Richard!

> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
> 
> To prevent this, we check whether we are about to uncharge the root
> memcg and skip it if we are. Possibly instead; the obj_cgroups should
> be removed from their slabs and any per cpu stocks instead of
> reparenting them to root?

It would be really complex. I think your fix is totally fine.
We have similar checks in cancel_charge(), uncharge_batch(),
mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.

> 
> The warning can be, unreliably, reproduced with the LTP test
> madvise06 if the entire patch series
> https://lore.kernel.org/linux-mm/20200623174037.3951353-1-g...@fb.com/
> is present. Although the listed commit in 'fixes' appears to introduce
> the bug, I can not reproduce it with just that commit and bisecting
> runs into other bugs.
> 
> [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 
> page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 
> mm/page_counter.c:156)
> [   12.029539] Modules linked in:
> [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 
> 5.9.0-rc7-22-default #76
> [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 
> mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe 
> ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b 
> eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> [   12.030258] RSP: 0018:a5d8000efd08 EFLAGS: 00010086
> [   12.030344] RAX:  RBX:  RCX: 
> 0009
> [   12.030455] RDX: 000b RSI:  RDI: 
> 8ef8c7d2b248
> [   12.030561] RBP: 8ef8c7d2b248 R08: 8ef8c78b19c8 R09: 
> 0001
> [   12.030672] R10:  R11: 8ef8c780e0d0 R12: 
> 0001
> [   12.030784] R13:  R14: 8ef9478b19c8 R15: 
> 
> [   12.030895] FS:  () GS:8ef8fbc8() 
> knlGS:
> [   12.031017] CS:  0010 DS:  ES:  CR0: 80050033
> [   12.031104] CR2: 7f72c0af93ec CR3: 5c40a000 CR4: 
> 06e0
> [   12.031209] Call Trace:
> [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 
> mm/memcontrol.c:3114)
> [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 
> ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 
> kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> ...
> Reported-By: l...@lists.linux.it
> Signed-off-by: Richard Palethorpe 
> Cc: Johannes Weiner 
> Cc: Roman Gushchin 
> Cc: Andrew Morton 
> Cc: Shakeel Butt 
> Cc: Christoph Lameter 
> Cc: Michal Hocko 
> Cc: Tejun Heo 
> Cc: Vlastimil Babka 
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")

Acked-by: Roman Gushchin 

Thanks!


[RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

2020-10-14 Thread Richard Palethorpe
SLAB objects which outlive their memcg are moved to their parent
memcg where they may be uncharged. However if they are moved to the
root memcg, uncharging will result in negative page counter values as
root has no page counters.

To prevent this, we check whether we are about to uncharge the root
memcg and skip it if we are. Possibly instead; the obj_cgroups should
be removed from their slabs and any per cpu stocks instead of
reparenting them to root?

The warning can be, unreliably, reproduced with the LTP test
madvise06 if the entire patch series
https://lore.kernel.org/linux-mm/20200623174037.3951353-1-g...@fb.com/
is present. Although the listed commit in 'fixes' appears to introduce
the bug, I can not reproduce it with just that commit and bisecting
runs into other bugs.

[   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 
page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 
mm/page_counter.c:156)
[   12.029539] Modules linked in:
[   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 
5.9.0-rc7-22-default #76
[   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 
mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff 
ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 
90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
[   12.030258] RSP: 0018:a5d8000efd08 EFLAGS: 00010086
[   12.030344] RAX:  RBX:  RCX: 0009
[   12.030455] RDX: 000b RSI:  RDI: 8ef8c7d2b248
[   12.030561] RBP: 8ef8c7d2b248 R08: 8ef8c78b19c8 R09: 0001
[   12.030672] R10:  R11: 8ef8c780e0d0 R12: 0001
[   12.030784] R13:  R14: 8ef9478b19c8 R15: 
[   12.030895] FS:  () GS:8ef8fbc8() 
knlGS:
[   12.031017] CS:  0010 DS:  ES:  CR0: 80050033
[   12.031104] CR2: 7f72c0af93ec CR3: 5c40a000 CR4: 06e0
[   12.031209] Call Trace:
[   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 
mm/memcontrol.c:3114)
[   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
[   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 
./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
[   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
[   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 
kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
[   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 
./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 
kernel/softirq.c:299)
[   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 
./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
[   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
[   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
[   12.033059] kthread (kernel/kthread.c:292)
[   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
[   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
[   12.033357] ---[ end trace 961dbfc01c109d1f ]---

[9.841552] [ cut here ]
[9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 
page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 
mm/page_counter.c:156)
[9.841982] Modules linked in:
[9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 
5.9.0-rc7-22-default #77
[9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[9.842571] Workqueue: events drain_local_stock
[9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 
mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff 
ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 
90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[9.843438] RSP: 0018:b1c18006be28 EFLAGS: 00010086
[9.843585] RAX:  RBX:  RCX: 94803bc2cae0
[9.843806] RDX: 0001 RSI:  RDI: 948007d2b248
[9.844026] RBP: 948007d2b248 R08: 948007c58eb0 R09: 948007da05ac
[9.844248] R10: 0018 R11: 0018 R12: 0001
[9.844477] R13:  R14:  R15: 94803bc2cac0
[9.844696] FS:  () GS:94803bc0() 
knlGS:
[9.844915] CS:  0010 DS:  ES:  CR0: 80050033
[9.845096] CR2: