Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-15 Thread Michal Hocko
On Sun 11-01-15 15:55:43, Johannes Weiner wrote:
> From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>  unbind
> 
> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
> 
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task. 

OK, that makes sense to me.

> Page cache pages are moved to the root memory cgroup.

OK, this is better than reclaiming them.

[...]
> +static void unbind_lru_list(struct mem_cgroup *memcg,
> + struct zone *zone, enum lru_list lru)
> +{
> + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> + struct list_head *list = >lists[lru];
> +
> + while (!list_empty(list)) {
> + unsigned int nr_pages;
> + unsigned long flags;
> + struct page *page;
> +
> + spin_lock_irqsave(>lru_lock, flags);
> + if (list_empty(list)) {
> + spin_unlock_irqrestore(>lru_lock, flags);
> + break;
> + }
> + page = list_last_entry(list, struct page, lru);

taking lru_lock for each page calls for troubles. The lock would bounce
like crazy. It shouldn't be a big problem to list_move to a local list
and then work on that one without the lock. Those pages wouldn't be
visible for the reclaim but that would be only temporary. Or if that is
not acceptable then just batch at least some number of pages (popular
SWAP_CLUSTER_MAX).

> + if (!get_page_unless_zero(page)) {
> + list_move(>lru, list);
> + spin_unlock_irqrestore(>lru_lock, flags);
> + continue;
> + }
> + BUG_ON(!PageLRU(page));
> + ClearPageLRU(page);
> + del_page_from_lru_list(page, lruvec, lru);
> + spin_unlock_irqrestore(>lru_lock, flags);
> +
> + compound_lock(page);
> + nr_pages = hpage_nr_pages(page);
> +
> + if (!mem_cgroup_move_account(page, nr_pages,
> +  memcg, root_mem_cgroup)) {
> + /*
> +  * root_mem_cgroup page counters are not used,
> +  * otherwise we'd have to charge them here.
> +  */
> + page_counter_uncharge(>memory, nr_pages);
> + if (do_swap_account)
> + page_counter_uncharge(>memsw, nr_pages);
> + css_put_many(>css, nr_pages);
> + }
> +
> + compound_unlock(page);
> +
> + putback_lru_page(page);
> + }
> +}
> +
> +static void unbind_work_fn(struct work_struct *work)
> +{
> + struct cgroup_subsys_state *css;
> +retry:
> + drain_all_stock(root_mem_cgroup);
> +
> + rcu_read_lock();
> + css_for_each_child(css, _mem_cgroup->css) {
> + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> + /* Drop references from swap-out records */
> + if (do_swap_account) {
> + long zapped;
> +
> + zapped = swap_cgroup_zap_records(memcg->css.id);
> + page_counter_uncharge(>memsw, zapped);
> + css_put_many(>css, zapped);
> + }
> +
> + /* Drop references from leftover LRU pages */
> + css_get(css);
> + rcu_read_unlock();
> + atomic_inc(>moving_account);
> + synchronize_rcu();

Why do we need this? Who can migrate to/from offline memcgs? 

> + while (page_counter_read(>memory) -
> +page_counter_read(>kmem) > 0) {
> + struct zone *zone;
> + enum lru_list lru;
> +
> + lru_add_drain_all();
> +
> + for_each_zone(zone)
> + for_each_lru(lru)
> + unbind_lru_list(memcg, zone, lru);
> +
> + cond_resched();
> + }
> + atomic_dec(>moving_account);
> + rcu_read_lock();
> + css_put(css);
> + }
> + rcu_read_unlock();
> + /*
> +  * Swap-in is racy:
> +  *
> +  * #0#1
> +  *   

Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-15 Thread Michal Hocko
On Sat 10-01-15 16:43:16, Tejun Heo wrote:
> Currently, if a hierarchy doesn't have any live children when it's
> unmounted, the hierarchy starts dying by killing its refcnt.  The
> expectation is that even if there are lingering dead children which
> are lingering due to remaining references, they'll be put in a finite
> amount of time.  When the children are finally released, the hierarchy
> is destroyed and all controllers bound to it also are released.
> 
> However, for memcg, the premise that the lingering refs will be put in
> a finite amount time is not true.  In the absense of memory pressure,
> dead memcg's may hang around indefinitely pinned by its pages.  This
> unfortunately may lead to indefinite hang on the next mount attempt
> involving memcg as the mount logic waits for it to get released.
> 
> While we can change hierarchy destruction logic such that a hierarchy
> is only destroyed when it's not mounted anywhere and all its children,
> live or dead, are gone, this makes whether the hierarchy gets
> destroyed or not to be determined by factors opaque to userland.
> Userland may or may not get a new hierarchy on the next mount attempt.
> Worse, if it explicitly wants to create a new hierarchy with different
> options or controller compositions involving memcg, it will fail in an
> essentially arbitrary manner.
> 
> We want to guarantee that a hierarchy is destroyed once the
> conditions, unmounted and no visible children, are met.  To aid it,
> this patch introduces a new callback cgroup_subsys->unbind() which is
> invoked right before the hierarchy a subsystem is bound to starts
> dying.  memcg can implement this callback and initiate draining of
> remaining refs so that the hierarchy can eventually be released in a
> finite amount of time.
> 
> Signed-off-by: Tejun Heo 
> Cc: Li Zefan 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 

Ohh, I have missed this one as I wasn't on the CC list.

FWIW this approach makes sense to me. I just think that we should have a
way to fail. E.g. kmem pages are impossible to reclaim because there
might be some objects lingering somewhere not bound to a task context
and reparenting is hard as Vladimir has pointed out several times
already.
Normal LRU pages should be reclaimable or reparented to the root easily.

I cannot judge the implementation but I agree with the fact that memcg
controller should be the one to take an action.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-15 Thread Michal Hocko
On Sun 11-01-15 15:55:43, Johannes Weiner wrote:
 From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
 From: Johannes Weiner han...@cmpxchg.org
 Date: Sun, 11 Jan 2015 10:29:05 -0500
 Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
  unbind
 
 Cgroup core assumes that any outstanding css references after
 offlining are temporary in nature, and e.g. mount waits for them to
 disappear and release the root cgroup.  But leftover page cache and
 swapout records in an offlined memcg are only dropped when the pages
 get reclaimed under pressure or the swapped out pages get faulted in
 from other cgroups, and so those cgroup operations can hang forever.
 
 Implement the -unbind() callback to actively get rid of outstanding
 references when cgroup core wants them gone.  Swap out records are
 deleted, such that the swap-in path will charge those pages to the
 faulting task. 

OK, that makes sense to me.

 Page cache pages are moved to the root memory cgroup.

OK, this is better than reclaiming them.

[...]
 +static void unbind_lru_list(struct mem_cgroup *memcg,
 + struct zone *zone, enum lru_list lru)
 +{
 + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 + struct list_head *list = lruvec-lists[lru];
 +
 + while (!list_empty(list)) {
 + unsigned int nr_pages;
 + unsigned long flags;
 + struct page *page;
 +
 + spin_lock_irqsave(zone-lru_lock, flags);
 + if (list_empty(list)) {
 + spin_unlock_irqrestore(zone-lru_lock, flags);
 + break;
 + }
 + page = list_last_entry(list, struct page, lru);

taking lru_lock for each page calls for troubles. The lock would bounce
like crazy. It shouldn't be a big problem to list_move to a local list
and then work on that one without the lock. Those pages wouldn't be
visible for the reclaim but that would be only temporary. Or if that is
not acceptable then just batch at least some number of pages (popular
SWAP_CLUSTER_MAX).

 + if (!get_page_unless_zero(page)) {
 + list_move(page-lru, list);
 + spin_unlock_irqrestore(zone-lru_lock, flags);
 + continue;
 + }
 + BUG_ON(!PageLRU(page));
 + ClearPageLRU(page);
 + del_page_from_lru_list(page, lruvec, lru);
 + spin_unlock_irqrestore(zone-lru_lock, flags);
 +
 + compound_lock(page);
 + nr_pages = hpage_nr_pages(page);
 +
 + if (!mem_cgroup_move_account(page, nr_pages,
 +  memcg, root_mem_cgroup)) {
 + /*
 +  * root_mem_cgroup page counters are not used,
 +  * otherwise we'd have to charge them here.
 +  */
 + page_counter_uncharge(memcg-memory, nr_pages);
 + if (do_swap_account)
 + page_counter_uncharge(memcg-memsw, nr_pages);
 + css_put_many(memcg-css, nr_pages);
 + }
 +
 + compound_unlock(page);
 +
 + putback_lru_page(page);
 + }
 +}
 +
 +static void unbind_work_fn(struct work_struct *work)
 +{
 + struct cgroup_subsys_state *css;
 +retry:
 + drain_all_stock(root_mem_cgroup);
 +
 + rcu_read_lock();
 + css_for_each_child(css, root_mem_cgroup-css) {
 + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 +
 + /* Drop references from swap-out records */
 + if (do_swap_account) {
 + long zapped;
 +
 + zapped = swap_cgroup_zap_records(memcg-css.id);
 + page_counter_uncharge(memcg-memsw, zapped);
 + css_put_many(memcg-css, zapped);
 + }
 +
 + /* Drop references from leftover LRU pages */
 + css_get(css);
 + rcu_read_unlock();
 + atomic_inc(memcg-moving_account);
 + synchronize_rcu();

Why do we need this? Who can migrate to/from offline memcgs? 

 + while (page_counter_read(memcg-memory) -
 +page_counter_read(memcg-kmem)  0) {
 + struct zone *zone;
 + enum lru_list lru;
 +
 + lru_add_drain_all();
 +
 + for_each_zone(zone)
 + for_each_lru(lru)
 + unbind_lru_list(memcg, zone, lru);
 +
 + cond_resched();
 + }
 + atomic_dec(memcg-moving_account);
 + rcu_read_lock();
 + css_put(css);
 + }
 + rcu_read_unlock();
 + /*
 +  * Swap-in is racy:
 +  *
 +  * #0#1
 +  *   lookup_swap_cgroup_id()
 +  *  

Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-15 Thread Michal Hocko
On Sat 10-01-15 16:43:16, Tejun Heo wrote:
 Currently, if a hierarchy doesn't have any live children when it's
 unmounted, the hierarchy starts dying by killing its refcnt.  The
 expectation is that even if there are lingering dead children which
 are lingering due to remaining references, they'll be put in a finite
 amount of time.  When the children are finally released, the hierarchy
 is destroyed and all controllers bound to it also are released.
 
 However, for memcg, the premise that the lingering refs will be put in
 a finite amount time is not true.  In the absense of memory pressure,
 dead memcg's may hang around indefinitely pinned by its pages.  This
 unfortunately may lead to indefinite hang on the next mount attempt
 involving memcg as the mount logic waits for it to get released.
 
 While we can change hierarchy destruction logic such that a hierarchy
 is only destroyed when it's not mounted anywhere and all its children,
 live or dead, are gone, this makes whether the hierarchy gets
 destroyed or not to be determined by factors opaque to userland.
 Userland may or may not get a new hierarchy on the next mount attempt.
 Worse, if it explicitly wants to create a new hierarchy with different
 options or controller compositions involving memcg, it will fail in an
 essentially arbitrary manner.
 
 We want to guarantee that a hierarchy is destroyed once the
 conditions, unmounted and no visible children, are met.  To aid it,
 this patch introduces a new callback cgroup_subsys-unbind() which is
 invoked right before the hierarchy a subsystem is bound to starts
 dying.  memcg can implement this callback and initiate draining of
 remaining refs so that the hierarchy can eventually be released in a
 finite amount of time.
 
 Signed-off-by: Tejun Heo t...@kernel.org
 Cc: Li Zefan lize...@huawei.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Michal Hocko mho...@suse.cz
 Cc: Vladimir Davydov vdavy...@parallels.com

Ohh, I have missed this one as I wasn't on the CC list.

FWIW this approach makes sense to me. I just think that we should have a
way to fail. E.g. kmem pages are impossible to reclaim because there
might be some objects lingering somewhere not bound to a task context
and reparenting is hard as Vladimir has pointed out several times
already.
Normal LRU pages should be reclaimable or reparented to the root easily.

I cannot judge the implementation but I agree with the fact that memcg
controller should be the one to take an action.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-14 Thread Suzuki K. Poulose

On 11/01/15 20:55, Johannes Weiner wrote:

On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:

Currently, if a hierarchy doesn't have any live children when it's
unmounted, the hierarchy starts dying by killing its refcnt.  The
expectation is that even if there are lingering dead children which
are lingering due to remaining references, they'll be put in a finite
amount of time.  When the children are finally released, the hierarchy
is destroyed and all controllers bound to it also are released.

However, for memcg, the premise that the lingering refs will be put in
a finite amount time is not true.  In the absense of memory pressure,
dead memcg's may hang around indefinitely pinned by its pages.  This
unfortunately may lead to indefinite hang on the next mount attempt
involving memcg as the mount logic waits for it to get released.

While we can change hierarchy destruction logic such that a hierarchy
is only destroyed when it's not mounted anywhere and all its children,
live or dead, are gone, this makes whether the hierarchy gets
destroyed or not to be determined by factors opaque to userland.
Userland may or may not get a new hierarchy on the next mount attempt.
Worse, if it explicitly wants to create a new hierarchy with different
options or controller compositions involving memcg, it will fail in an
essentially arbitrary manner.

We want to guarantee that a hierarchy is destroyed once the
conditions, unmounted and no visible children, are met.  To aid it,
this patch introduces a new callback cgroup_subsys->unbind() which is
invoked right before the hierarchy a subsystem is bound to starts
dying.  memcg can implement this callback and initiate draining of
remaining refs so that the hierarchy can eventually be released in a
finite amount of time.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
---
Hello,


May be, we should kill the ref counter to the memory controller root in
cgroup_kill_sb only if there is no children at all, neither online nor
offline.


Ah, thanks for the analysis, but I really wanna avoid making hierarchy
destruction conditions opaque to userland.  This is userland visible
behavior.  It shouldn't be determined by kernel internals invisible
outside.  This patch adds ss->unbind() which memcg can hook into to
kick off draining of residual refs.  If this would work, I'll add this
patch to cgroup/for-3.19-fixes, possibly with stable cc'd.


How about this ->unbind() for memcg?

 From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Sun, 11 Jan 2015 10:29:05 -0500
Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
  unbind

This patch doesn't cleanly apply on 3.19-rc4 for me (hunks in 
mm/memcontrol.c). I have manually applied it.


With these two patches in, I am still getting the failure. Also, the 
kworker thread is taking up 100% time (unbind_work) and continues to do 
so even after 6minutes.


Is there something I missed ?

Thanks
Suzuki





Cgroup core assumes that any outstanding css references after
offlining are temporary in nature, and e.g. mount waits for them to
disappear and release the root cgroup.  But leftover page cache and
swapout records in an offlined memcg are only dropped when the pages
get reclaimed under pressure or the swapped out pages get faulted in
from other cgroups, and so those cgroup operations can hang forever.

Implement the ->unbind() callback to actively get rid of outstanding
references when cgroup core wants them gone.  Swap out records are
deleted, such that the swap-in path will charge those pages to the
faulting task.  Page cache pages are moved to the root memory cgroup.

Signed-off-by: Johannes Weiner 
---
  include/linux/swap_cgroup.h |   6 +++
  mm/memcontrol.c | 126 
  mm/swap_cgroup.c|  38 +
  3 files changed, 170 insertions(+)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306bdc92f..ffe0866d2997 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
unsigned short old, unsigned short new);
  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
  extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_zap_records(unsigned short id);
  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
  extern void swap_cgroup_swapoff(int type);

@@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
return 0;
  }

+static inline unsigned long swap_cgroup_zap_records(unsigned short id)
+{
+   return 0;
+}
+
  static inline int
  swap_cgroup_swapon(int type, unsigned long max_pages)
  {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 692e96407627..40c426add613 

Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-14 Thread Suzuki K. Poulose

On 11/01/15 20:55, Johannes Weiner wrote:

On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:

Currently, if a hierarchy doesn't have any live children when it's
unmounted, the hierarchy starts dying by killing its refcnt.  The
expectation is that even if there are lingering dead children which
are lingering due to remaining references, they'll be put in a finite
amount of time.  When the children are finally released, the hierarchy
is destroyed and all controllers bound to it also are released.

However, for memcg, the premise that the lingering refs will be put in
a finite amount time is not true.  In the absense of memory pressure,
dead memcg's may hang around indefinitely pinned by its pages.  This
unfortunately may lead to indefinite hang on the next mount attempt
involving memcg as the mount logic waits for it to get released.

While we can change hierarchy destruction logic such that a hierarchy
is only destroyed when it's not mounted anywhere and all its children,
live or dead, are gone, this makes whether the hierarchy gets
destroyed or not to be determined by factors opaque to userland.
Userland may or may not get a new hierarchy on the next mount attempt.
Worse, if it explicitly wants to create a new hierarchy with different
options or controller compositions involving memcg, it will fail in an
essentially arbitrary manner.

We want to guarantee that a hierarchy is destroyed once the
conditions, unmounted and no visible children, are met.  To aid it,
this patch introduces a new callback cgroup_subsys-unbind() which is
invoked right before the hierarchy a subsystem is bound to starts
dying.  memcg can implement this callback and initiate draining of
remaining refs so that the hierarchy can eventually be released in a
finite amount of time.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Li Zefan lize...@huawei.com
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Michal Hocko mho...@suse.cz
Cc: Vladimir Davydov vdavy...@parallels.com
---
Hello,


May be, we should kill the ref counter to the memory controller root in
cgroup_kill_sb only if there is no children at all, neither online nor
offline.


Ah, thanks for the analysis, but I really wanna avoid making hierarchy
destruction conditions opaque to userland.  This is userland visible
behavior.  It shouldn't be determined by kernel internals invisible
outside.  This patch adds ss-unbind() which memcg can hook into to
kick off draining of residual refs.  If this would work, I'll add this
patch to cgroup/for-3.19-fixes, possibly with stable cc'd.


How about this -unbind() for memcg?

 From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
From: Johannes Weiner han...@cmpxchg.org
Date: Sun, 11 Jan 2015 10:29:05 -0500
Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
  unbind

This patch doesn't cleanly apply on 3.19-rc4 for me (hunks in 
mm/memcontrol.c). I have manually applied it.


With these two patches in, I am still getting the failure. Also, the 
kworker thread is taking up 100% time (unbind_work) and continues to do 
so even after 6minutes.


Is there something I missed ?

Thanks
Suzuki





Cgroup core assumes that any outstanding css references after
offlining are temporary in nature, and e.g. mount waits for them to
disappear and release the root cgroup.  But leftover page cache and
swapout records in an offlined memcg are only dropped when the pages
get reclaimed under pressure or the swapped out pages get faulted in
from other cgroups, and so those cgroup operations can hang forever.

Implement the -unbind() callback to actively get rid of outstanding
references when cgroup core wants them gone.  Swap out records are
deleted, such that the swap-in path will charge those pages to the
faulting task.  Page cache pages are moved to the root memory cgroup.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
  include/linux/swap_cgroup.h |   6 +++
  mm/memcontrol.c | 126 
  mm/swap_cgroup.c|  38 +
  3 files changed, 170 insertions(+)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306bdc92f..ffe0866d2997 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
unsigned short old, unsigned short new);
  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
  extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_zap_records(unsigned short id);
  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
  extern void swap_cgroup_swapoff(int type);

@@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
return 0;
  }

+static inline unsigned long swap_cgroup_zap_records(unsigned short id)
+{
+   return 0;
+}
+
  static inline int
  swap_cgroup_swapon(int 

Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-12 Thread Tejun Heo
On Mon, Jan 12, 2015 at 03:59:56PM +0300, Vladimir Davydov wrote:
> I haven't dug deep into the cgroup core, but may be we could detach the
> old root in cgroup_kill_sb() and leave it dangling until the last
> reference to it has gone?

The root isn't the problem here.  Individual controllers are as
there's only one copy of each and we most likely don't want to carry
over child csses from one hierarchy to the next as the controller may
operate under a different set of rules.

> BTW, IIRC the problem always existed for kmem-active memory cgroups,
> because we never had kmem reparenting. May be, we could therefore just
> document somewhere that kmem accounting is highly discouraged to be used
> in the legacy hierarchy and merge these two patches as is to handle page
> cache and swap charges? We won't break anything, because it was always
> broken :-)

If we're going that route, I think it'd be better to declare hierarchy
lifetime rules as essentially opaque to userland and destroy
hierarchies only when all its children, dead or alive, are gone.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-12 Thread Vladimir Davydov
On Mon, Jan 12, 2015 at 06:28:45AM -0500, Tejun Heo wrote:
> On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
> > Come to think of it, I wonder how many users actually want to mount
> > different controllers subset after unmount. Because we could allow
> 
> It wouldn't be a common use case but, on the face of it, we still
> support it.  If we collecctively decide that once a sub cgroup is
> created for any controller no further hierarchy configuration for that
> controller is allowed, that'd work too, but one way or the other, the
> behavior, I believe, should be well-defined.  As it currently stands,
> the conditions and failure mode are opaque to userland, which is never
> a good thing.
> 
> > mounting the same subset perfectly well, even if it includes memcg. BTW,
> > AFAIU in the unified hierarchy we won't have this problem at all,
> > because by definition it mounts all controllers IIRC, so do we need to
> > bother fixing this in such a complicated manner at all for the setup
> > that's going to be deprecated anyway?
> 
> There will likely be a quite long transition period and if and when
> the old things can be removed, this added cleanup logic can go away
> with it.  It depends on how complex the implementation would get but
> as long as it isn't too much and stays mostly isolated from the saner
> paths, I think it's probably the right thing to do.

We can't just move kmem objects from a per-memcg kmem_cache to the
global one fixing page counters, because in contrast to page cache and
swap we don't even track all kmem allocations. So we have to keep all
per-memcg kmem_cache's somewhere after unmount until they can finally be
destroyed, but the whole logic behind per-memcg kmem_cache's destruction
is currently tightly interwoven with that of css's (we destroy
kmem_cache's from css_free), and there won't be any css's after unmount.

That said, it isn't possible to add a couple of isolated functions,
which will live their own lives and can be easily removed once we've
switched to the unified hierarchy. Quite the contrary, implementing of
kmem reparenting would make me rethink and complicate kmemcg code all
over the place. That's why I'm rather reluctant to do it.

I haven't dug deep into the cgroup core, but may be we could detach the
old root in cgroup_kill_sb() and leave it dangling until the last
reference to it has gone?

BTW, IIRC the problem always existed for kmem-active memory cgroups,
because we never had kmem reparenting. May be, we could therefore just
document somewhere that kmem accounting is highly discouraged to be used
in the legacy hierarchy and merge these two patches as is to handle page
cache and swap charges? We won't break anything, because it was always
broken :-)

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-12 Thread Tejun Heo
Hello, Vladimir.

On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
> Come to think of it, I wonder how many users actually want to mount
> different controllers subset after unmount. Because we could allow

It wouldn't be a common use case but, on the face of it, we still
support it.  If we collecctively decide that once a sub cgroup is
created for any controller no further hierarchy configuration for that
controller is allowed, that'd work too, but one way or the other, the
behavior, I believe, should be well-defined.  As it currently stands,
the conditions and failure mode are opaque to userland, which is never
a good thing.

> mounting the same subset perfectly well, even if it includes memcg. BTW,
> AFAIU in the unified hierarchy we won't have this problem at all,
> because by definition it mounts all controllers IIRC, so do we need to
> bother fixing this in such a complicated manner at all for the setup
> that's going to be deprecated anyway?

There will likely be a quite long transition period and if and when
the old things can be removed, this added cleanup logic can go away
with it.  It depends on how complex the implementation would get but
as long as it isn't too much and stays mostly isolated from the saner
paths, I think it's probably the right thing to do.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-12 Thread Vladimir Davydov
On Sun, Jan 11, 2015 at 03:55:43PM -0500, Johannes Weiner wrote:
> On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
> > > May be, we should kill the ref counter to the memory controller root in
> > > cgroup_kill_sb only if there is no children at all, neither online nor
> > > offline.
> > 
> > Ah, thanks for the analysis, but I really wanna avoid making hierarchy
> > destruction conditions opaque to userland.  This is userland visible
> > behavior.  It shouldn't be determined by kernel internals invisible
> > outside.  This patch adds ss->unbind() which memcg can hook into to
> > kick off draining of residual refs.  If this would work, I'll add this
> > patch to cgroup/for-3.19-fixes, possibly with stable cc'd.
> 
> How about this ->unbind() for memcg?
> 
> From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Sun, 11 Jan 2015 10:29:05 -0500
> Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
>  unbind
> 
> Cgroup core assumes that any outstanding css references after
> offlining are temporary in nature, and e.g. mount waits for them to
> disappear and release the root cgroup.  But leftover page cache and
> swapout records in an offlined memcg are only dropped when the pages
> get reclaimed under pressure or the swapped out pages get faulted in
> from other cgroups, and so those cgroup operations can hang forever.
> 
> Implement the ->unbind() callback to actively get rid of outstanding
> references when cgroup core wants them gone.  Swap out records are
> deleted, such that the swap-in path will charge those pages to the
> faulting task.  Page cache pages are moved to the root memory cgroup.

... and kmem pages are ignored. I reckon we could reparent them (I
submitted the patch set some time ago), but that's going to be tricky
and will complicate regular kmem charge/uncharge paths, as well as
list_lru_add/del. I don't think we can put up with it, provided we only
want reparenting on unmount, do we not?

Come to think of it, I wonder how many users actually want to mount
different controllers subset after unmount. Because we could allow
mounting the same subset perfectly well, even if it includes memcg. BTW,
AFAIU in the unified hierarchy we won't have this problem at all,
because by definition it mounts all controllers IIRC, so do we need to
bother fixing this in such a complicated manner at all for the setup
that's going to be deprecated anyway?

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-12 Thread Vladimir Davydov
On Sun, Jan 11, 2015 at 03:55:43PM -0500, Johannes Weiner wrote:
 On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
   May be, we should kill the ref counter to the memory controller root in
   cgroup_kill_sb only if there is no children at all, neither online nor
   offline.
  
  Ah, thanks for the analysis, but I really wanna avoid making hierarchy
  destruction conditions opaque to userland.  This is userland visible
  behavior.  It shouldn't be determined by kernel internals invisible
  outside.  This patch adds ss-unbind() which memcg can hook into to
  kick off draining of residual refs.  If this would work, I'll add this
  patch to cgroup/for-3.19-fixes, possibly with stable cc'd.
 
 How about this -unbind() for memcg?
 
 From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
 From: Johannes Weiner han...@cmpxchg.org
 Date: Sun, 11 Jan 2015 10:29:05 -0500
 Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
  unbind
 
 Cgroup core assumes that any outstanding css references after
 offlining are temporary in nature, and e.g. mount waits for them to
 disappear and release the root cgroup.  But leftover page cache and
 swapout records in an offlined memcg are only dropped when the pages
 get reclaimed under pressure or the swapped out pages get faulted in
 from other cgroups, and so those cgroup operations can hang forever.
 
 Implement the -unbind() callback to actively get rid of outstanding
 references when cgroup core wants them gone.  Swap out records are
 deleted, such that the swap-in path will charge those pages to the
 faulting task.  Page cache pages are moved to the root memory cgroup.

... and kmem pages are ignored. I reckon we could reparent them (I
submitted the patch set some time ago), but that's going to be tricky
and will complicate regular kmem charge/uncharge paths, as well as
list_lru_add/del. I don't think we can put up with it, provided we only
want reparenting on unmount, do we not?

Come to think of it, I wonder how many users actually want to mount
different controllers subset after unmount. Because we could allow
mounting the same subset perfectly well, even if it includes memcg. BTW,
AFAIU in the unified hierarchy we won't have this problem at all,
because by definition it mounts all controllers IIRC, so do we need to
bother fixing this in such a complicated manner at all for the setup
that's going to be deprecated anyway?

Thanks,
Vladimir
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-12 Thread Tejun Heo
On Mon, Jan 12, 2015 at 03:59:56PM +0300, Vladimir Davydov wrote:
 I haven't dug deep into the cgroup core, but may be we could detach the
 old root in cgroup_kill_sb() and leave it dangling until the last
 reference to it has gone?

The root isn't the problem here.  Individual controllers are as
there's only one copy of each and we most likely don't want to carry
over child csses from one hierarchy to the next as the controller may
operate under a different set of rules.

 BTW, IIRC the problem always existed for kmem-active memory cgroups,
 because we never had kmem reparenting. May be, we could therefore just
 document somewhere that kmem accounting is highly discouraged to be used
 in the legacy hierarchy and merge these two patches as is to handle page
 cache and swap charges? We won't break anything, because it was always
 broken :-)

If we're going that route, I think it'd be better to declare hierarchy
lifetime rules as essentially opaque to userland and destroy
hierarchies only when all its children, dead or alive, are gone.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-12 Thread Tejun Heo
Hello, Vladimir.

On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
 Come to think of it, I wonder how many users actually want to mount
 different controllers subset after unmount. Because we could allow

It wouldn't be a common use case but, on the face of it, we still
support it.  If we collecctively decide that once a sub cgroup is
created for any controller no further hierarchy configuration for that
controller is allowed, that'd work too, but one way or the other, the
behavior, I believe, should be well-defined.  As it currently stands,
the conditions and failure mode are opaque to userland, which is never
a good thing.

 mounting the same subset perfectly well, even if it includes memcg. BTW,
 AFAIU in the unified hierarchy we won't have this problem at all,
 because by definition it mounts all controllers IIRC, so do we need to
 bother fixing this in such a complicated manner at all for the setup
 that's going to be deprecated anyway?

There will likely be a quite long transition period and if and when
the old things can be removed, this added cleanup logic can go away
with it.  It depends on how complex the implementation would get but
as long as it isn't too much and stays mostly isolated from the saner
paths, I think it's probably the right thing to do.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-12 Thread Vladimir Davydov
On Mon, Jan 12, 2015 at 06:28:45AM -0500, Tejun Heo wrote:
 On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote:
  Come to think of it, I wonder how many users actually want to mount
  different controllers subset after unmount. Because we could allow
 
 It wouldn't be a common use case but, on the face of it, we still
 support it.  If we collecctively decide that once a sub cgroup is
 created for any controller no further hierarchy configuration for that
 controller is allowed, that'd work too, but one way or the other, the
 behavior, I believe, should be well-defined.  As it currently stands,
 the conditions and failure mode are opaque to userland, which is never
 a good thing.
 
  mounting the same subset perfectly well, even if it includes memcg. BTW,
  AFAIU in the unified hierarchy we won't have this problem at all,
  because by definition it mounts all controllers IIRC, so do we need to
  bother fixing this in such a complicated manner at all for the setup
  that's going to be deprecated anyway?
 
 There will likely be a quite long transition period and if and when
 the old things can be removed, this added cleanup logic can go away
 with it.  It depends on how complex the implementation would get but
 as long as it isn't too much and stays mostly isolated from the saner
 paths, I think it's probably the right thing to do.

We can't just move kmem objects from a per-memcg kmem_cache to the
global one fixing page counters, because in contrast to page cache and
swap we don't even track all kmem allocations. So we have to keep all
per-memcg kmem_cache's somewhere after unmount until they can finally be
destroyed, but the whole logic behind per-memcg kmem_cache's destruction
is currently tightly interwoven with that of css's (we destroy
kmem_cache's from css_free), and there won't be any css's after unmount.

That said, it isn't possible to add a couple of isolated functions,
which will live their own lives and can be easily removed once we've
switched to the unified hierarchy. Quite the contrary, implementing of
kmem reparenting would make me rethink and complicate kmemcg code all
over the place. That's why I'm rather reluctant to do it.

I haven't dug deep into the cgroup core, but may be we could detach the
old root in cgroup_kill_sb() and leave it dangling until the last
reference to it has gone?

BTW, IIRC the problem always existed for kmem-active memory cgroups,
because we never had kmem reparenting. May be, we could therefore just
document somewhere that kmem accounting is highly discouraged to be used
in the legacy hierarchy and merge these two patches as is to handle page
cache and swap charges? We won't break anything, because it was always
broken :-)

Thanks,
Vladimir
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-11 Thread Johannes Weiner
On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
> Currently, if a hierarchy doesn't have any live children when it's
> unmounted, the hierarchy starts dying by killing its refcnt.  The
> expectation is that even if there are lingering dead children which
> are lingering due to remaining references, they'll be put in a finite
> amount of time.  When the children are finally released, the hierarchy
> is destroyed and all controllers bound to it also are released.
> 
> However, for memcg, the premise that the lingering refs will be put in
> a finite amount time is not true.  In the absense of memory pressure,
> dead memcg's may hang around indefinitely pinned by its pages.  This
> unfortunately may lead to indefinite hang on the next mount attempt
> involving memcg as the mount logic waits for it to get released.
> 
> While we can change hierarchy destruction logic such that a hierarchy
> is only destroyed when it's not mounted anywhere and all its children,
> live or dead, are gone, this makes whether the hierarchy gets
> destroyed or not to be determined by factors opaque to userland.
> Userland may or may not get a new hierarchy on the next mount attempt.
> Worse, if it explicitly wants to create a new hierarchy with different
> options or controller compositions involving memcg, it will fail in an
> essentially arbitrary manner.
> 
> We want to guarantee that a hierarchy is destroyed once the
> conditions, unmounted and no visible children, are met.  To aid it,
> this patch introduces a new callback cgroup_subsys->unbind() which is
> invoked right before the hierarchy a subsystem is bound to starts
> dying.  memcg can implement this callback and initiate draining of
> remaining refs so that the hierarchy can eventually be released in a
> finite amount of time.
> 
> Signed-off-by: Tejun Heo 
> Cc: Li Zefan 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> ---
> Hello,
> 
> > May be, we should kill the ref counter to the memory controller root in
> > cgroup_kill_sb only if there is no children at all, neither online nor
> > offline.
> 
> Ah, thanks for the analysis, but I really wanna avoid making hierarchy
> destruction conditions opaque to userland.  This is userland visible
> behavior.  It shouldn't be determined by kernel internals invisible
> outside.  This patch adds ss->unbind() which memcg can hook into to
> kick off draining of residual refs.  If this would work, I'll add this
> patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

How about this ->unbind() for memcg?

>From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Sun, 11 Jan 2015 10:29:05 -0500
Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
 unbind

Cgroup core assumes that any outstanding css references after
offlining are temporary in nature, and e.g. mount waits for them to
disappear and release the root cgroup.  But leftover page cache and
swapout records in an offlined memcg are only dropped when the pages
get reclaimed under pressure or the swapped out pages get faulted in
from other cgroups, and so those cgroup operations can hang forever.

Implement the ->unbind() callback to actively get rid of outstanding
references when cgroup core wants them gone.  Swap out records are
deleted, such that the swap-in path will charge those pages to the
faulting task.  Page cache pages are moved to the root memory cgroup.

Signed-off-by: Johannes Weiner 
---
 include/linux/swap_cgroup.h |   6 +++
 mm/memcontrol.c | 126 
 mm/swap_cgroup.c|  38 +
 3 files changed, 170 insertions(+)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306bdc92f..ffe0866d2997 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_zap_records(unsigned short id);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 
@@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
return 0;
 }
 
+static inline unsigned long swap_cgroup_zap_records(unsigned short id)
+{
+   return 0;
+}
+
 static inline int
 swap_cgroup_swapon(int type, unsigned long max_pages)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 692e96407627..40c426add613 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state 
*root_css)
mem_cgroup_from_css(root_css)->use_hierarchy = true;
 }
 
+static void unbind_lru_list(struct mem_cgroup *memcg,
+

Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-11 Thread Johannes Weiner
On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote:
 Currently, if a hierarchy doesn't have any live children when it's
 unmounted, the hierarchy starts dying by killing its refcnt.  The
 expectation is that even if there are lingering dead children which
 are lingering due to remaining references, they'll be put in a finite
 amount of time.  When the children are finally released, the hierarchy
 is destroyed and all controllers bound to it also are released.
 
 However, for memcg, the premise that the lingering refs will be put in
 a finite amount time is not true.  In the absense of memory pressure,
 dead memcg's may hang around indefinitely pinned by its pages.  This
 unfortunately may lead to indefinite hang on the next mount attempt
 involving memcg as the mount logic waits for it to get released.
 
 While we can change hierarchy destruction logic such that a hierarchy
 is only destroyed when it's not mounted anywhere and all its children,
 live or dead, are gone, this makes whether the hierarchy gets
 destroyed or not to be determined by factors opaque to userland.
 Userland may or may not get a new hierarchy on the next mount attempt.
 Worse, if it explicitly wants to create a new hierarchy with different
 options or controller compositions involving memcg, it will fail in an
 essentially arbitrary manner.
 
 We want to guarantee that a hierarchy is destroyed once the
 conditions, unmounted and no visible children, are met.  To aid it,
 this patch introduces a new callback cgroup_subsys-unbind() which is
 invoked right before the hierarchy a subsystem is bound to starts
 dying.  memcg can implement this callback and initiate draining of
 remaining refs so that the hierarchy can eventually be released in a
 finite amount of time.
 
 Signed-off-by: Tejun Heo t...@kernel.org
 Cc: Li Zefan lize...@huawei.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Michal Hocko mho...@suse.cz
 Cc: Vladimir Davydov vdavy...@parallels.com
 ---
 Hello,
 
  May be, we should kill the ref counter to the memory controller root in
  cgroup_kill_sb only if there is no children at all, neither online nor
  offline.
 
 Ah, thanks for the analysis, but I really wanna avoid making hierarchy
 destruction conditions opaque to userland.  This is userland visible
 behavior.  It shouldn't be determined by kernel internals invisible
 outside.  This patch adds ss-unbind() which memcg can hook into to
 kick off draining of residual refs.  If this would work, I'll add this
 patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

How about this -unbind() for memcg?

From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001
From: Johannes Weiner han...@cmpxchg.org
Date: Sun, 11 Jan 2015 10:29:05 -0500
Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during
 unbind

Cgroup core assumes that any outstanding css references after
offlining are temporary in nature, and e.g. mount waits for them to
disappear and release the root cgroup.  But leftover page cache and
swapout records in an offlined memcg are only dropped when the pages
get reclaimed under pressure or the swapped out pages get faulted in
from other cgroups, and so those cgroup operations can hang forever.

Implement the -unbind() callback to actively get rid of outstanding
references when cgroup core wants them gone.  Swap out records are
deleted, such that the swap-in path will charge those pages to the
faulting task.  Page cache pages are moved to the root memory cgroup.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 include/linux/swap_cgroup.h |   6 +++
 mm/memcontrol.c | 126 
 mm/swap_cgroup.c|  38 +
 3 files changed, 170 insertions(+)

diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306bdc92f..ffe0866d2997 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern unsigned long swap_cgroup_zap_records(unsigned short id);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 
@@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
return 0;
 }
 
+static inline unsigned long swap_cgroup_zap_records(unsigned short id)
+{
+   return 0;
+}
+
 static inline int
 swap_cgroup_swapon(int type, unsigned long max_pages)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 692e96407627..40c426add613 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state 
*root_css)
mem_cgroup_from_css(root_css)-use_hierarchy = true;
 }
 
+static 

[PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback

2015-01-10 Thread Tejun Heo
Currently, if a hierarchy doesn't have any live children when it's
unmounted, the hierarchy starts dying by killing its refcnt.  The
expectation is that even if there are lingering dead children which
are lingering due to remaining references, they'll be put in a finite
amount of time.  When the children are finally released, the hierarchy
is destroyed and all controllers bound to it also are released.

However, for memcg, the premise that the lingering refs will be put in
a finite amount time is not true.  In the absense of memory pressure,
dead memcg's may hang around indefinitely pinned by its pages.  This
unfortunately may lead to indefinite hang on the next mount attempt
involving memcg as the mount logic waits for it to get released.

While we can change hierarchy destruction logic such that a hierarchy
is only destroyed when it's not mounted anywhere and all its children,
live or dead, are gone, this makes whether the hierarchy gets
destroyed or not to be determined by factors opaque to userland.
Userland may or may not get a new hierarchy on the next mount attempt.
Worse, if it explicitly wants to create a new hierarchy with different
options or controller compositions involving memcg, it will fail in an
essentially arbitrary manner.

We want to guarantee that a hierarchy is destroyed once the
conditions, unmounted and no visible children, are met.  To aid it,
this patch introduces a new callback cgroup_subsys->unbind() which is
invoked right before the hierarchy a subsystem is bound to starts
dying.  memcg can implement this callback and initiate draining of
remaining refs so that the hierarchy can eventually be released in a
finite amount of time.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
---
Hello,

> May be, we should kill the ref counter to the memory controller root in
> cgroup_kill_sb only if there is no children at all, neither online nor
> offline.

Ah, thanks for the analysis, but I really wanna avoid making hierarchy
destruction conditions opaque to userland.  This is userland visible
behavior.  It shouldn't be determined by kernel internals invisible
outside.  This patch adds ss->unbind() which memcg can hook into to
kick off draining of residual refs.  If this would work, I'll add this
patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

Thanks.

 Documentation/cgroups/cgroups.txt |   12 +++-
 include/linux/cgroup.h|1 +
 kernel/cgroup.c   |   14 --
 3 files changed, 24 insertions(+), 3 deletions(-)

--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -637,7 +637,7 @@ void exit(struct task_struct *task)
 
 Called during task exit.
 
-void bind(struct cgroup *root)
+void bind(struct cgroup_subsys_state *root_css)
 (cgroup_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
@@ -645,6 +645,16 @@ and root cgroup. Currently this will onl
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+void unbind(struct cgroup_subsys_state *root_css)
+(cgroup_mutex held by caller)
+
+Called when a cgroup subsys is unbound from its current hierarchy. The
+subsystem must guarantee that all offline cgroups are going to be
+released in a finite amount of time after this function is called.
+Such draining can be asynchronous. The following binding of the
+subsystem to a hierarchy will be delayed till the draining is
+complete.
+
 4. Extended attribute usage
 ===
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -654,6 +654,7 @@ struct cgroup_subsys {
 struct cgroup_subsys_state *old_css,
 struct task_struct *task);
void (*bind)(struct cgroup_subsys_state *root_css);
+   void (*unbind)(struct cgroup_subsys_state *root_css);
 
int disabled;
int early_init;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_
 * And don't kill the default root.
 */
if (css_has_online_children(>cgrp.self) ||
-   root == _dfl_root)
+   root == _dfl_root) {
cgroup_put(>cgrp);
-   else
+   } else {
+   struct cgroup_subsys *ss;
+   int i;
+
+   mutex_lock(_mutex);
+   for_each_subsys(ss, i)
+   if ((root->subsys_mask & (1UL << i)) && ss->unbind)
+   ss->unbind(cgroup_css(>cgrp, ss));
+   mutex_unlock(_mutex);
+
percpu_ref_kill(>cgrp.self.refcnt);
+   }
 
kernfs_kill_sb(sb);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

[PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys-unbind() callback

2015-01-10 Thread Tejun Heo
Currently, if a hierarchy doesn't have any live children when it's
unmounted, the hierarchy starts dying by killing its refcnt.  The
expectation is that even if there are lingering dead children which
are lingering due to remaining references, they'll be put in a finite
amount of time.  When the children are finally released, the hierarchy
is destroyed and all controllers bound to it also are released.

However, for memcg, the premise that the lingering refs will be put in
a finite amount time is not true.  In the absense of memory pressure,
dead memcg's may hang around indefinitely pinned by its pages.  This
unfortunately may lead to indefinite hang on the next mount attempt
involving memcg as the mount logic waits for it to get released.

While we can change hierarchy destruction logic such that a hierarchy
is only destroyed when it's not mounted anywhere and all its children,
live or dead, are gone, this makes whether the hierarchy gets
destroyed or not to be determined by factors opaque to userland.
Userland may or may not get a new hierarchy on the next mount attempt.
Worse, if it explicitly wants to create a new hierarchy with different
options or controller compositions involving memcg, it will fail in an
essentially arbitrary manner.

We want to guarantee that a hierarchy is destroyed once the
conditions, unmounted and no visible children, are met.  To aid it,
this patch introduces a new callback cgroup_subsys-unbind() which is
invoked right before the hierarchy a subsystem is bound to starts
dying.  memcg can implement this callback and initiate draining of
remaining refs so that the hierarchy can eventually be released in a
finite amount of time.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Li Zefan lize...@huawei.com
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Michal Hocko mho...@suse.cz
Cc: Vladimir Davydov vdavy...@parallels.com
---
Hello,

 May be, we should kill the ref counter to the memory controller root in
 cgroup_kill_sb only if there is no children at all, neither online nor
 offline.

Ah, thanks for the analysis, but I really wanna avoid making hierarchy
destruction conditions opaque to userland.  This is userland visible
behavior.  It shouldn't be determined by kernel internals invisible
outside.  This patch adds ss-unbind() which memcg can hook into to
kick off draining of residual refs.  If this would work, I'll add this
patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

Thanks.

 Documentation/cgroups/cgroups.txt |   12 +++-
 include/linux/cgroup.h|1 +
 kernel/cgroup.c   |   14 --
 3 files changed, 24 insertions(+), 3 deletions(-)

--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -637,7 +637,7 @@ void exit(struct task_struct *task)
 
 Called during task exit.
 
-void bind(struct cgroup *root)
+void bind(struct cgroup_subsys_state *root_css)
 (cgroup_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
@@ -645,6 +645,16 @@ and root cgroup. Currently this will onl
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+void unbind(struct cgroup_subsys_state *root_css)
+(cgroup_mutex held by caller)
+
+Called when a cgroup subsys is unbound from its current hierarchy. The
+subsystem must guarantee that all offline cgroups are going to be
+released in a finite amount of time after this function is called.
+Such draining can be asynchronous. The following binding of the
+subsystem to a hierarchy will be delayed till the draining is
+complete.
+
 4. Extended attribute usage
 ===
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -654,6 +654,7 @@ struct cgroup_subsys {
 struct cgroup_subsys_state *old_css,
 struct task_struct *task);
void (*bind)(struct cgroup_subsys_state *root_css);
+   void (*unbind)(struct cgroup_subsys_state *root_css);
 
int disabled;
int early_init;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_
 * And don't kill the default root.
 */
if (css_has_online_children(root-cgrp.self) ||
-   root == cgrp_dfl_root)
+   root == cgrp_dfl_root) {
cgroup_put(root-cgrp);
-   else
+   } else {
+   struct cgroup_subsys *ss;
+   int i;
+
+   mutex_lock(cgroup_mutex);
+   for_each_subsys(ss, i)
+   if ((root-subsys_mask  (1UL  i))  ss-unbind)
+   ss-unbind(cgroup_css(root-cgrp, ss));
+   mutex_unlock(cgroup_mutex);
+
percpu_ref_kill(root-cgrp.self.refcnt);
+   }
 
kernfs_kill_sb(sb);
 }
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to