Re: [RFC PATCH -mm] memcg: reparent only LRUs during mem_cgroup_css_offline

2014-03-07 Thread Michal Hocko
On Wed 26-02-14 18:49:10, Hugh Dickins wrote:
> On Wed, 19 Feb 2014, Michal Hocko wrote:
> 
> > css_offline callback exported by the cgroup core is not intended to get
> > rid of all the charges but rather to get rid of cached charges for the
> > soon destruction. For the memory controller we have 2 different types of
> > "cached" charges which prevent from the memcg destruction (because they
> > pin memcg by css reference). Swapped out pages (when swap accounting is
> > enabled) and kmem charges. None of them are dealt with in the current
> > code.
> > 
> > What we do instead is that we are reducing res counter charges (reduced
> > by kmem charges) to 0. And this hard down-to-0 requirement has led to
> > several issues in the past when the css_offline loops without any way
> > out e.g. memcg: reparent charges of children before processing parent.
> > 
> > The important thing is that we actually do not have to drop all the
> > charges. Instead we want to reduce LRU pages (which do not pin memcg) as
> > much as possible because they are not reachable by memcg iterators after
> > css_offline code returns, thus they are not reclaimable anymore.
> 
> That worries me.
> 
> > 
> > This patch simply extracts LRU reparenting into mem_cgroup_reparent_lrus
> > which doesn't care about charges and it is called from css_offline
> > callback and the original mem_cgroup_reparent_charges stays in
> > css_offline callback. The original workaround for the endless loop is no
> > longer necessary because child vs. parent ordering is no longer and
> > issue. The only requirement is that the parent has be still online at
> > the time of css_offline.
> 
> But isn't that precisely what we just found is not guaranteed?

OK, this implicitly relies on cgroup_mutex and later when cgroup_mutex
is away we would need our own lock around reparenting.

> And in fact your patch has the necessary loop up to find the
> first ancestor it can successfully css_tryget.  Maybe you meant
> to say "still there" rather than "still online".

I meant online because we have to make sure that the reparented pages
have to to be reachable by iterators.

> (Tangential, I don't think you rely on this any more than we do
> at present, and I may be wrong to suggest any problem: but I would
> feel more comfortable if kernel/cgroup.c's css_free_work_fn() did
> parent = css->parent; css->ss->css_free(css); css_put(parent);
> instead of putting the parent before freeing the child.)

that makes sense to me.

> > mem_cgroup_reparent_charges also doesn't have to exclude kmem charges
> > because there shouldn't be any at the css_free stage. Let's add BUG_ON
> > to make sure we haven't screwed anything.
> > 
> > mem_cgroup_reparent_lrus is racy but this is tolerable as the inflight
> > pages which will eventually get back to the memcg's LRU shouldn't
> > constitute a lot of memory.
> > 
> > Signed-off-by: Michal Hocko 
> > ---
> > This is on top of 
> > memcg-reparent-charges-of-children-before-processing-parent.patch
> > and I am not suggesting to replace it (I think Filipe's patch is more
> > appropriate for the stable tree).
> > Nevertheless I find this approach slightly better because it makes
> > semantical difference between offline and free more obvious and we can
> > build on top of it later (when offlining is no longer synchronized by
> > cgroup_mutex). But if you think that it is not worth touching this area
> > until we find a good way to reparent swapped out and kmem pages then I
> > am OK with it and stay with Filipe's patch.
> 
> I'm ambivalent about it.  I like it, and I like very much that the loop
> waiting for RES_USAGE to go down to 0 is without cgroup_mutex held; but
> I dislike that any pages temporarily off LRU at the time of css_offline's
> list_empty check, will then go AWOL (unreachable by reclaim), until
> css_free later gets around to reparenting them.

Yes it is not nice but my impression is that we are not talking about
too many pages. Maybe I am underestimating this.

> It's conceivable that some code could be added to mem_cgroup_page_lruvec()
> (near my "Surreptitiously" comment), to reparent when they're put back on
> LRU; but more probably not, that's already tricky, and probably bad to
> make it any trickier, even if it turned out to be possible.

That would work, but as you write, it would make this code even
trickier.

> So  I'm inclined to wait until the swap and kmem situation is sorted out

Vladimir Davydov has already posted kmem reparenting patchset but I
didn't get to read through it. Swap reparenting has already been posted
by you and Johannes.

I am not planning to push this patch, it was more an example what I was
referring to earlier in discussion.

> (when the delay between offline and free should become much briefer);
> but would be happy if you found a good way to make the missing pages
> reclaimable in the meantime.
> 
> A couple of un-comments below.
[...]
> > @@ -6613,13 +6614,20 @@ static void 

Re: [RFC PATCH -mm] memcg: reparent only LRUs during mem_cgroup_css_offline

2014-03-07 Thread Michal Hocko
On Wed 26-02-14 18:49:10, Hugh Dickins wrote:
 On Wed, 19 Feb 2014, Michal Hocko wrote:
 
  css_offline callback exported by the cgroup core is not intended to get
  rid of all the charges but rather to get rid of cached charges for the
  soon destruction. For the memory controller we have 2 different types of
  cached charges which prevent from the memcg destruction (because they
  pin memcg by css reference). Swapped out pages (when swap accounting is
  enabled) and kmem charges. None of them are dealt with in the current
  code.
  
  What we do instead is that we are reducing res counter charges (reduced
  by kmem charges) to 0. And this hard down-to-0 requirement has led to
  several issues in the past when the css_offline loops without any way
  out e.g. memcg: reparent charges of children before processing parent.
  
  The important thing is that we actually do not have to drop all the
  charges. Instead we want to reduce LRU pages (which do not pin memcg) as
  much as possible because they are not reachable by memcg iterators after
  css_offline code returns, thus they are not reclaimable anymore.
 
 That worries me.
 
  
  This patch simply extracts LRU reparenting into mem_cgroup_reparent_lrus
  which doesn't care about charges and it is called from css_offline
  callback and the original mem_cgroup_reparent_charges stays in
  css_offline callback. The original workaround for the endless loop is no
  longer necessary because child vs. parent ordering is no longer and
  issue. The only requirement is that the parent has be still online at
  the time of css_offline.
 
 But isn't that precisely what we just found is not guaranteed?

OK, this implicitly relies on cgroup_mutex and later when cgroup_mutex
is away we would need our own lock around reparenting.

 And in fact your patch has the necessary loop up to find the
 first ancestor it can successfully css_tryget.  Maybe you meant
 to say still there rather than still online.

I meant online because we have to make sure that the reparented pages
have to to be reachable by iterators.

 (Tangential, I don't think you rely on this any more than we do
 at present, and I may be wrong to suggest any problem: but I would
 feel more comfortable if kernel/cgroup.c's css_free_work_fn() did
 parent = css-parent; css-ss-css_free(css); css_put(parent);
 instead of putting the parent before freeing the child.)

that makes sense to me.

  mem_cgroup_reparent_charges also doesn't have to exclude kmem charges
  because there shouldn't be any at the css_free stage. Let's add BUG_ON
  to make sure we haven't screwed anything.
  
  mem_cgroup_reparent_lrus is racy but this is tolerable as the inflight
  pages which will eventually get back to the memcg's LRU shouldn't
  constitute a lot of memory.
  
  Signed-off-by: Michal Hocko mho...@suse.cz
  ---
  This is on top of 
  memcg-reparent-charges-of-children-before-processing-parent.patch
  and I am not suggesting to replace it (I think Filipe's patch is more
  appropriate for the stable tree).
  Nevertheless I find this approach slightly better because it makes
  semantical difference between offline and free more obvious and we can
  build on top of it later (when offlining is no longer synchronized by
  cgroup_mutex). But if you think that it is not worth touching this area
  until we find a good way to reparent swapped out and kmem pages then I
  am OK with it and stay with Filipe's patch.
 
 I'm ambivalent about it.  I like it, and I like very much that the loop
 waiting for RES_USAGE to go down to 0 is without cgroup_mutex held; but
 I dislike that any pages temporarily off LRU at the time of css_offline's
 list_empty check, will then go AWOL (unreachable by reclaim), until
 css_free later gets around to reparenting them.

Yes it is not nice but my impression is that we are not talking about
too many pages. Maybe I am underestimating this.

 It's conceivable that some code could be added to mem_cgroup_page_lruvec()
 (near my Surreptitiously comment), to reparent when they're put back on
 LRU; but more probably not, that's already tricky, and probably bad to
 make it any trickier, even if it turned out to be possible.

That would work, but as you write, it would make this code even
trickier.

 So  I'm inclined to wait until the swap and kmem situation is sorted out

Vladimir Davydov has already posted kmem reparenting patchset but I
didn't get to read through it. Swap reparenting has already been posted
by you and Johannes.

I am not planning to push this patch, it was more an example what I was
referring to earlier in discussion.

 (when the delay between offline and free should become much briefer);
 but would be happy if you found a good way to make the missing pages
 reclaimable in the meantime.
 
 A couple of un-comments below.
[...]
  @@ -6613,13 +6614,20 @@ static void mem_cgroup_css_offline(struct 
  cgroup_subsys_state *css)
  kmem_cgroup_css_offline(memcg);
   
  

Re: [RFC PATCH -mm] memcg: reparent only LRUs during mem_cgroup_css_offline

2014-02-26 Thread Hugh Dickins
On Wed, 19 Feb 2014, Michal Hocko wrote:

> css_offline callback exported by the cgroup core is not intended to get
> rid of all the charges but rather to get rid of cached charges for the
> soon destruction. For the memory controller we have 2 different types of
> "cached" charges which prevent from the memcg destruction (because they
> pin memcg by css reference). Swapped out pages (when swap accounting is
> enabled) and kmem charges. None of them are dealt with in the current
> code.
> 
> What we do instead is that we are reducing res counter charges (reduced
> by kmem charges) to 0. And this hard down-to-0 requirement has led to
> several issues in the past when the css_offline loops without any way
> out e.g. memcg: reparent charges of children before processing parent.
> 
> The important thing is that we actually do not have to drop all the
> charges. Instead we want to reduce LRU pages (which do not pin memcg) as
> much as possible because they are not reachable by memcg iterators after
> css_offline code returns, thus they are not reclaimable anymore.

That worries me.

> 
> This patch simply extracts LRU reparenting into mem_cgroup_reparent_lrus
> which doesn't care about charges and it is called from css_offline
> callback and the original mem_cgroup_reparent_charges stays in
> css_offline callback. The original workaround for the endless loop is no
> longer necessary because child vs. parent ordering is no longer and
> issue. The only requirement is that the parent has be still online at
> the time of css_offline.

But isn't that precisely what we just found is not guaranteed?
And in fact your patch has the necessary loop up to find the
first ancestor it can successfully css_tryget.  Maybe you meant
to say "still there" rather than "still online".

(Tangential, I don't think you rely on this any more than we do
at present, and I may be wrong to suggest any problem: but I would
feel more comfortable if kernel/cgroup.c's css_free_work_fn() did
parent = css->parent; css->ss->css_free(css); css_put(parent);
instead of putting the parent before freeing the child.)

> mem_cgroup_reparent_charges also doesn't have to exclude kmem charges
> because there shouldn't be any at the css_free stage. Let's add BUG_ON
> to make sure we haven't screwed anything.
> 
> mem_cgroup_reparent_lrus is racy but this is tolerable as the inflight
> pages which will eventually get back to the memcg's LRU shouldn't
> constitute a lot of memory.
> 
> Signed-off-by: Michal Hocko 
> ---
> This is on top of 
> memcg-reparent-charges-of-children-before-processing-parent.patch
> and I am not suggesting to replace it (I think Filipe's patch is more
> appropriate for the stable tree).
> Nevertheless I find this approach slightly better because it makes
> semantical difference between offline and free more obvious and we can
> build on top of it later (when offlining is no longer synchronized by
> cgroup_mutex). But if you think that it is not worth touching this area
> until we find a good way to reparent swapped out and kmem pages then I
> am OK with it and stay with Filipe's patch.

I'm ambivalent about it.  I like it, and I like very much that the loop
waiting for RES_USAGE to go down to 0 is without cgroup_mutex held; but
I dislike that any pages temporarily off LRU at the time of css_offline's
list_empty check, will then go AWOL (unreachable by reclaim), until
css_free later gets around to reparenting them.

It's conceivable that some code could be added to mem_cgroup_page_lruvec()
(near my "Surreptitiously" comment), to reparent when they're put back on
LRU; but more probably not, that's already tricky, and probably bad to
make it any trickier, even if it turned out to be possible.

So I'm inclined to wait until the swap and kmem situation is sorted out
(when the delay between offline and free should become much briefer);
but would be happy if you found a good way to make the missing pages
reclaimable in the meantime.

A couple of un-comments below.

Hugh

> 
>  mm/memcontrol.c | 102 
> ++--
>  1 file changed, 55 insertions(+), 47 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 45c2a50954ac..9f8e54333b60 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3870,6 +3870,7 @@ out:
>   * @page: the page to move
>   * @pc: page_cgroup of the page
>   * @child: page's cgroup
> + * @parent: parent where to reparent
>   *
>   * move charges to its parent or the root cgroup if the group has no
>   * parent (aka use_hierarchy==0).
> @@ -3888,9 +3889,9 @@ out:
>   */
>  static int mem_cgroup_move_parent(struct page *page,
> struct page_cgroup *pc,
> -   struct mem_cgroup *child)
> +   struct mem_cgroup *child,
> +   struct mem_cgroup *parent)
>  {
> - struct mem_cgroup *parent;
>   unsigned int nr_pages;
>   unsigned long 

Re: [RFC PATCH -mm] memcg: reparent only LRUs during mem_cgroup_css_offline

2014-02-26 Thread Hugh Dickins
On Wed, 19 Feb 2014, Michal Hocko wrote:

 css_offline callback exported by the cgroup core is not intended to get
 rid of all the charges but rather to get rid of cached charges for the
 soon destruction. For the memory controller we have 2 different types of
 cached charges which prevent from the memcg destruction (because they
 pin memcg by css reference). Swapped out pages (when swap accounting is
 enabled) and kmem charges. None of them are dealt with in the current
 code.
 
 What we do instead is that we are reducing res counter charges (reduced
 by kmem charges) to 0. And this hard down-to-0 requirement has led to
 several issues in the past when the css_offline loops without any way
 out e.g. memcg: reparent charges of children before processing parent.
 
 The important thing is that we actually do not have to drop all the
 charges. Instead we want to reduce LRU pages (which do not pin memcg) as
 much as possible because they are not reachable by memcg iterators after
 css_offline code returns, thus they are not reclaimable anymore.

That worries me.

 
 This patch simply extracts LRU reparenting into mem_cgroup_reparent_lrus
 which doesn't care about charges and it is called from css_offline
 callback and the original mem_cgroup_reparent_charges stays in
 css_offline callback. The original workaround for the endless loop is no
 longer necessary because child vs. parent ordering is no longer and
 issue. The only requirement is that the parent has be still online at
 the time of css_offline.

But isn't that precisely what we just found is not guaranteed?
And in fact your patch has the necessary loop up to find the
first ancestor it can successfully css_tryget.  Maybe you meant
to say still there rather than still online.

(Tangential, I don't think you rely on this any more than we do
at present, and I may be wrong to suggest any problem: but I would
feel more comfortable if kernel/cgroup.c's css_free_work_fn() did
parent = css-parent; css-ss-css_free(css); css_put(parent);
instead of putting the parent before freeing the child.)

 mem_cgroup_reparent_charges also doesn't have to exclude kmem charges
 because there shouldn't be any at the css_free stage. Let's add BUG_ON
 to make sure we haven't screwed anything.
 
 mem_cgroup_reparent_lrus is racy but this is tolerable as the inflight
 pages which will eventually get back to the memcg's LRU shouldn't
 constitute a lot of memory.
 
 Signed-off-by: Michal Hocko mho...@suse.cz
 ---
 This is on top of 
 memcg-reparent-charges-of-children-before-processing-parent.patch
 and I am not suggesting to replace it (I think Filipe's patch is more
 appropriate for the stable tree).
 Nevertheless I find this approach slightly better because it makes
 semantical difference between offline and free more obvious and we can
 build on top of it later (when offlining is no longer synchronized by
 cgroup_mutex). But if you think that it is not worth touching this area
 until we find a good way to reparent swapped out and kmem pages then I
 am OK with it and stay with Filipe's patch.

I'm ambivalent about it.  I like it, and I like very much that the loop
waiting for RES_USAGE to go down to 0 is without cgroup_mutex held; but
I dislike that any pages temporarily off LRU at the time of css_offline's
list_empty check, will then go AWOL (unreachable by reclaim), until
css_free later gets around to reparenting them.

It's conceivable that some code could be added to mem_cgroup_page_lruvec()
(near my Surreptitiously comment), to reparent when they're put back on
LRU; but more probably not, that's already tricky, and probably bad to
make it any trickier, even if it turned out to be possible.

So I'm inclined to wait until the swap and kmem situation is sorted out
(when the delay between offline and free should become much briefer);
but would be happy if you found a good way to make the missing pages
reclaimable in the meantime.

A couple of un-comments below.

Hugh

 
  mm/memcontrol.c | 102 
 ++--
  1 file changed, 55 insertions(+), 47 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 45c2a50954ac..9f8e54333b60 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -3870,6 +3870,7 @@ out:
   * @page: the page to move
   * @pc: page_cgroup of the page
   * @child: page's cgroup
 + * @parent: parent where to reparent
   *
   * move charges to its parent or the root cgroup if the group has no
   * parent (aka use_hierarchy==0).
 @@ -3888,9 +3889,9 @@ out:
   */
  static int mem_cgroup_move_parent(struct page *page,
 struct page_cgroup *pc,
 -   struct mem_cgroup *child)
 +   struct mem_cgroup *child,
 +   struct mem_cgroup *parent)
  {
 - struct mem_cgroup *parent;
   unsigned int nr_pages;
   unsigned long uninitialized_var(flags);
   int ret;
 @@ -3905,13 +3906,6 @@ static 

[RFC PATCH -mm] memcg: reparent only LRUs during mem_cgroup_css_offline

2014-02-19 Thread Michal Hocko
css_offline callback exported by the cgroup core is not intended to get
rid of all the charges but rather to get rid of cached charges for the
soon destruction. For the memory controller we have 2 different types of
"cached" charges which prevent from the memcg destruction (because they
pin memcg by css reference). Swapped out pages (when swap accounting is
enabled) and kmem charges. None of them are dealt with in the current
code.

What we do instead is that we are reducing res counter charges (reduced
by kmem charges) to 0. And this hard down-to-0 requirement has led to
several issues in the past when the css_offline loops without any way
out e.g. memcg: reparent charges of children before processing parent.

The important thing is that we actually do not have to drop all the
charges. Instead we want to reduce LRU pages (which do not pin memcg) as
much as possible because they are not reachable by memcg iterators after
css_offline code returns, thus they are not reclaimable anymore.

This patch simply extracts LRU reparenting into mem_cgroup_reparent_lrus
which doesn't care about charges and it is called from css_offline
callback and the original mem_cgroup_reparent_charges stays in
css_offline callback. The original workaround for the endless loop is no
longer necessary because child vs. parent ordering is no longer and
issue. The only requirement is that the parent has be still online at
the time of css_offline.
mem_cgroup_reparent_charges also doesn't have to exclude kmem charges
because there shouldn't be any at the css_free stage. Let's add BUG_ON
to make sure we haven't screwed anything.

mem_cgroup_reparent_lrus is racy but this is tolerable as the inflight
pages which will eventually get back to the memcg's LRU shouldn't
constitute a lot of memory.

Signed-off-by: Michal Hocko 
---
This is on top of 
memcg-reparent-charges-of-children-before-processing-parent.patch
and I am not suggesting to replace it (I think Filipe's patch is more
appropriate for the stable tree).
Nevertheless I find this approach slightly better because it makes
semantical difference between offline and free more obvious and we can
build on top of it later (when offlining is no longer synchronized by
cgroup_mutex). But if you think that it is not worth touching this area
until we find a good way to reparent swapped out and kmem pages then I
am OK with it and stay with Filipe's patch.

 mm/memcontrol.c | 102 ++--
 1 file changed, 55 insertions(+), 47 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45c2a50954ac..9f8e54333b60 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3870,6 +3870,7 @@ out:
  * @page: the page to move
  * @pc: page_cgroup of the page
  * @child: page's cgroup
+ * @parent: parent where to reparent
  *
  * move charges to its parent or the root cgroup if the group has no
  * parent (aka use_hierarchy==0).
@@ -3888,9 +3889,9 @@ out:
  */
 static int mem_cgroup_move_parent(struct page *page,
  struct page_cgroup *pc,
- struct mem_cgroup *child)
+ struct mem_cgroup *child,
+ struct mem_cgroup *parent)
 {
-   struct mem_cgroup *parent;
unsigned int nr_pages;
unsigned long uninitialized_var(flags);
int ret;
@@ -3905,13 +3906,6 @@ static int mem_cgroup_move_parent(struct page *page,
 
nr_pages = hpage_nr_pages(page);
 
-   parent = parent_mem_cgroup(child);
-   /*
-* If no parent, move charges to root cgroup.
-*/
-   if (!parent)
-   parent = root_mem_cgroup;
-
if (nr_pages > 1) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
flags = compound_lock_irqsave(page);
@@ -4867,6 +4861,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
 /**
  * mem_cgroup_force_empty_list - clears LRU of a group
  * @memcg: group to clear
+ * @parent: parent group where to reparent
  * @node: NUMA node
  * @zid: zone id
  * @lru: lru to to clear
@@ -4876,6 +4871,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
  * group.
  */
 static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
+   struct mem_cgroup *parent,
int node, int zid, enum lru_list lru)
 {
struct lruvec *lruvec;
@@ -4909,7 +4905,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup 
*memcg,
 
pc = lookup_page_cgroup(page);
 
-   if (mem_cgroup_move_parent(page, pc, memcg)) {
+   if (mem_cgroup_move_parent(page, pc, memcg, parent)) {
/* found lock contention or "pc" is obsolete. */
busy = page;
cond_resched();
@@ -4918,6 +4914,28 @@ static void mem_cgroup_force_empty_list(struct 
mem_cgroup *memcg,
} 

[RFC PATCH -mm] memcg: reparent only LRUs during mem_cgroup_css_offline

2014-02-19 Thread Michal Hocko
css_offline callback exported by the cgroup core is not intended to get
rid of all the charges but rather to get rid of cached charges for the
soon destruction. For the memory controller we have 2 different types of
cached charges which prevent from the memcg destruction (because they
pin memcg by css reference). Swapped out pages (when swap accounting is
enabled) and kmem charges. None of them are dealt with in the current
code.

What we do instead is that we are reducing res counter charges (reduced
by kmem charges) to 0. And this hard down-to-0 requirement has led to
several issues in the past when the css_offline loops without any way
out e.g. memcg: reparent charges of children before processing parent.

The important thing is that we actually do not have to drop all the
charges. Instead we want to reduce LRU pages (which do not pin memcg) as
much as possible because they are not reachable by memcg iterators after
css_offline code returns, thus they are not reclaimable anymore.

This patch simply extracts LRU reparenting into mem_cgroup_reparent_lrus
which doesn't care about charges and it is called from css_offline
callback and the original mem_cgroup_reparent_charges stays in
css_offline callback. The original workaround for the endless loop is no
longer necessary because child vs. parent ordering is no longer and
issue. The only requirement is that the parent has be still online at
the time of css_offline.
mem_cgroup_reparent_charges also doesn't have to exclude kmem charges
because there shouldn't be any at the css_free stage. Let's add BUG_ON
to make sure we haven't screwed anything.

mem_cgroup_reparent_lrus is racy but this is tolerable as the inflight
pages which will eventually get back to the memcg's LRU shouldn't
constitute a lot of memory.

Signed-off-by: Michal Hocko mho...@suse.cz
---
This is on top of 
memcg-reparent-charges-of-children-before-processing-parent.patch
and I am not suggesting to replace it (I think Filipe's patch is more
appropriate for the stable tree).
Nevertheless I find this approach slightly better because it makes
semantical difference between offline and free more obvious and we can
build on top of it later (when offlining is no longer synchronized by
cgroup_mutex). But if you think that it is not worth touching this area
until we find a good way to reparent swapped out and kmem pages then I
am OK with it and stay with Filipe's patch.

 mm/memcontrol.c | 102 ++--
 1 file changed, 55 insertions(+), 47 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45c2a50954ac..9f8e54333b60 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3870,6 +3870,7 @@ out:
  * @page: the page to move
  * @pc: page_cgroup of the page
  * @child: page's cgroup
+ * @parent: parent where to reparent
  *
  * move charges to its parent or the root cgroup if the group has no
  * parent (aka use_hierarchy==0).
@@ -3888,9 +3889,9 @@ out:
  */
 static int mem_cgroup_move_parent(struct page *page,
  struct page_cgroup *pc,
- struct mem_cgroup *child)
+ struct mem_cgroup *child,
+ struct mem_cgroup *parent)
 {
-   struct mem_cgroup *parent;
unsigned int nr_pages;
unsigned long uninitialized_var(flags);
int ret;
@@ -3905,13 +3906,6 @@ static int mem_cgroup_move_parent(struct page *page,
 
nr_pages = hpage_nr_pages(page);
 
-   parent = parent_mem_cgroup(child);
-   /*
-* If no parent, move charges to root cgroup.
-*/
-   if (!parent)
-   parent = root_mem_cgroup;
-
if (nr_pages  1) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
flags = compound_lock_irqsave(page);
@@ -4867,6 +4861,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
 /**
  * mem_cgroup_force_empty_list - clears LRU of a group
  * @memcg: group to clear
+ * @parent: parent group where to reparent
  * @node: NUMA node
  * @zid: zone id
  * @lru: lru to to clear
@@ -4876,6 +4871,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
  * group.
  */
 static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
+   struct mem_cgroup *parent,
int node, int zid, enum lru_list lru)
 {
struct lruvec *lruvec;
@@ -4909,7 +4905,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup 
*memcg,
 
pc = lookup_page_cgroup(page);
 
-   if (mem_cgroup_move_parent(page, pc, memcg)) {
+   if (mem_cgroup_move_parent(page, pc, memcg, parent)) {
/* found lock contention or pc is obsolete. */
busy = page;
cond_resched();
@@ -4918,6 +4914,28 @@ static void mem_cgroup_force_empty_list(struct 
mem_cgroup *memcg,