Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
On Sun 11-01-15 15:55:43, Johannes Weiner wrote: > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during > unbind > > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. OK, that makes sense to me. > Page cache pages are moved to the root memory cgroup. OK, this is better than reclaiming them. [...] > +static void unbind_lru_list(struct mem_cgroup *memcg, > + struct zone *zone, enum lru_list lru) > +{ > + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); > + struct list_head *list = &lruvec->lists[lru]; > + > + while (!list_empty(list)) { > + unsigned int nr_pages; > + unsigned long flags; > + struct page *page; > + > + spin_lock_irqsave(&zone->lru_lock, flags); > + if (list_empty(list)) { > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + break; > + } > + page = list_last_entry(list, struct page, lru); taking lru_lock for each page calls for troubles. The lock would bounce like crazy. It shouldn't be a big problem to list_move to a local list and then work on that one without the lock. Those pages wouldn't be visible for the reclaim but that would be only temporary. Or if that is not acceptable then just batch at least some number of pages (popular SWAP_CLUSTER_MAX). > + if (!get_page_unless_zero(page)) { > + list_move(&page->lru, list); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + continue; > + } > + BUG_ON(!PageLRU(page)); > + ClearPageLRU(page); > + del_page_from_lru_list(page, lruvec, lru); > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + > + compound_lock(page); > + nr_pages = hpage_nr_pages(page); > + > + if (!mem_cgroup_move_account(page, nr_pages, > + memcg, root_mem_cgroup)) { > + /* > + * root_mem_cgroup page counters are not used, > + * otherwise we'd have to charge them here. > + */ > + page_counter_uncharge(&memcg->memory, nr_pages); > + if (do_swap_account) > + page_counter_uncharge(&memcg->memsw, nr_pages); > + css_put_many(&memcg->css, nr_pages); > + } > + > + compound_unlock(page); > + > + putback_lru_page(page); > + } > +} > + > +static void unbind_work_fn(struct work_struct *work) > +{ > + struct cgroup_subsys_state *css; > +retry: > + drain_all_stock(root_mem_cgroup); > + > + rcu_read_lock(); > + css_for_each_child(css, &root_mem_cgroup->css) { > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > + > + /* Drop references from swap-out records */ > + if (do_swap_account) { > + long zapped; > + > + zapped = swap_cgroup_zap_records(memcg->css.id); > + page_counter_uncharge(&memcg->memsw, zapped); > + css_put_many(&memcg->css, zapped); > + } > + > + /* Drop references from leftover LRU pages */ > + css_get(css); > + rcu_read_unlock(); > + atomic_inc(&memcg->moving_account); > + synchronize_rcu(); Why do we need this? Who can migrate to/from offline memcgs? > + while (page_counter_read(&memcg->memory) - > +page_counter_read(&memcg->kmem) > 0) { > + struct zone *zone; > + enum lru_list lru; > + > + lru_add_drain_all(); > + > + for_each_zone(zone) > + for_each_lru(lru) > + unbind_lru_list(memcg, zone, lru); > + > + cond_resched(); > + } > + atomic_dec(&memcg->moving_account); > + rcu_read_lock(); > + css_put(css); > + } > + rcu_read_unlock(); > + /* > + * Swap-in is racy: > +
Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
On Sat 10-01-15 16:43:16, Tejun Heo wrote: > Currently, if a hierarchy doesn't have any live children when it's > unmounted, the hierarchy starts dying by killing its refcnt. The > expectation is that even if there are lingering dead children which > are lingering due to remaining references, they'll be put in a finite > amount of time. When the children are finally released, the hierarchy > is destroyed and all controllers bound to it also are released. > > However, for memcg, the premise that the lingering refs will be put in > a finite amount time is not true. In the absense of memory pressure, > dead memcg's may hang around indefinitely pinned by its pages. This > unfortunately may lead to indefinite hang on the next mount attempt > involving memcg as the mount logic waits for it to get released. > > While we can change hierarchy destruction logic such that a hierarchy > is only destroyed when it's not mounted anywhere and all its children, > live or dead, are gone, this makes whether the hierarchy gets > destroyed or not to be determined by factors opaque to userland. > Userland may or may not get a new hierarchy on the next mount attempt. > Worse, if it explicitly wants to create a new hierarchy with different > options or controller compositions involving memcg, it will fail in an > essentially arbitrary manner. > > We want to guarantee that a hierarchy is destroyed once the > conditions, unmounted and no visible children, are met. To aid it, > this patch introduces a new callback cgroup_subsys->unbind() which is > invoked right before the hierarchy a subsystem is bound to starts > dying. memcg can implement this callback and initiate draining of > remaining refs so that the hierarchy can eventually be released in a > finite amount of time. > > Signed-off-by: Tejun Heo > Cc: Li Zefan > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vladimir Davydov Ohh, I have missed this one as I wasn't on the CC list. FWIW this approach makes sense to me. I just think that we should have a way to fail. E.g. kmem pages are impossible to reclaim because there might be some objects lingering somewhere not bound to a task context and reparenting is hard as Vladimir has pointed out several times already. Normal LRU pages should be reclaimable or reparented to the root easily. I cannot judge the implementation but I agree with the fact that memcg controller should be the one to take an action. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
On 11/01/15 20:55, Johannes Weiner wrote: On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: Currently, if a hierarchy doesn't have any live children when it's unmounted, the hierarchy starts dying by killing its refcnt. The expectation is that even if there are lingering dead children which are lingering due to remaining references, they'll be put in a finite amount of time. When the children are finally released, the hierarchy is destroyed and all controllers bound to it also are released. However, for memcg, the premise that the lingering refs will be put in a finite amount time is not true. In the absense of memory pressure, dead memcg's may hang around indefinitely pinned by its pages. This unfortunately may lead to indefinite hang on the next mount attempt involving memcg as the mount logic waits for it to get released. While we can change hierarchy destruction logic such that a hierarchy is only destroyed when it's not mounted anywhere and all its children, live or dead, are gone, this makes whether the hierarchy gets destroyed or not to be determined by factors opaque to userland. Userland may or may not get a new hierarchy on the next mount attempt. Worse, if it explicitly wants to create a new hierarchy with different options or controller compositions involving memcg, it will fail in an essentially arbitrary manner. We want to guarantee that a hierarchy is destroyed once the conditions, unmounted and no visible children, are met. To aid it, this patch introduces a new callback cgroup_subsys->unbind() which is invoked right before the hierarchy a subsystem is bound to starts dying. memcg can implement this callback and initiate draining of remaining refs so that the hierarchy can eventually be released in a finite amount of time. Signed-off-by: Tejun Heo Cc: Li Zefan Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov --- Hello, May be, we should kill the ref counter to the memory controller root in cgroup_kill_sb only if there is no children at all, neither online nor offline. Ah, thanks for the analysis, but I really wanna avoid making hierarchy destruction conditions opaque to userland. This is userland visible behavior. It shouldn't be determined by kernel internals invisible outside. This patch adds ss->unbind() which memcg can hook into to kick off draining of residual refs. If this would work, I'll add this patch to cgroup/for-3.19-fixes, possibly with stable cc'd. How about this ->unbind() for memcg? From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Sun, 11 Jan 2015 10:29:05 -0500 Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during unbind This patch doesn't cleanly apply on 3.19-rc4 for me (hunks in mm/memcontrol.c). I have manually applied it. With these two patches in, I am still getting the failure. Also, the kworker thread is taking up 100% time (unbind_work) and continues to do so even after 6minutes. Is there something I missed ? Thanks Suzuki Cgroup core assumes that any outstanding css references after offlining are temporary in nature, and e.g. mount waits for them to disappear and release the root cgroup. But leftover page cache and swapout records in an offlined memcg are only dropped when the pages get reclaimed under pressure or the swapped out pages get faulted in from other cgroups, and so those cgroup operations can hang forever. Implement the ->unbind() callback to actively get rid of outstanding references when cgroup core wants them gone. Swap out records are deleted, such that the swap-in path will charge those pages to the faulting task. Page cache pages are moved to the root memory cgroup. Signed-off-by: Johannes Weiner --- include/linux/swap_cgroup.h | 6 +++ mm/memcontrol.c | 126 mm/swap_cgroup.c| 38 + 3 files changed, 170 insertions(+) diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index 145306bdc92f..ffe0866d2997 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, unsigned short old, unsigned short new); extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); +extern unsigned long swap_cgroup_zap_records(unsigned short id); extern int swap_cgroup_swapon(int type, unsigned long max_pages); extern void swap_cgroup_swapoff(int type); @@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) return 0; } +static inline unsigned long swap_cgroup_zap_records(unsigned short id) +{ + return 0; +} + static inline int swap_cgroup_swapon(int type, unsigned long max_pages) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 692e96407627..40c426add613 10064
Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
On Mon, Jan 12, 2015 at 03:59:56PM +0300, Vladimir Davydov wrote: > I haven't dug deep into the cgroup core, but may be we could detach the > old root in cgroup_kill_sb() and leave it dangling until the last > reference to it has gone? The root isn't the problem here. Individual controllers are as there's only one copy of each and we most likely don't want to carry over child csses from one hierarchy to the next as the controller may operate under a different set of rules. > BTW, IIRC the problem always existed for kmem-active memory cgroups, > because we never had kmem reparenting. May be, we could therefore just > document somewhere that kmem accounting is highly discouraged to be used > in the legacy hierarchy and merge these two patches as is to handle page > cache and swap charges? We won't break anything, because it was always > broken :-) If we're going that route, I think it'd be better to declare hierarchy lifetime rules as essentially opaque to userland and destroy hierarchies only when all its children, dead or alive, are gone. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
On Mon, Jan 12, 2015 at 06:28:45AM -0500, Tejun Heo wrote: > On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote: > > Come to think of it, I wonder how many users actually want to mount > > different controllers subset after unmount. Because we could allow > > It wouldn't be a common use case but, on the face of it, we still > support it. If we collecctively decide that once a sub cgroup is > created for any controller no further hierarchy configuration for that > controller is allowed, that'd work too, but one way or the other, the > behavior, I believe, should be well-defined. As it currently stands, > the conditions and failure mode are opaque to userland, which is never > a good thing. > > > mounting the same subset perfectly well, even if it includes memcg. BTW, > > AFAIU in the unified hierarchy we won't have this problem at all, > > because by definition it mounts all controllers IIRC, so do we need to > > bother fixing this in such a complicated manner at all for the setup > > that's going to be deprecated anyway? > > There will likely be a quite long transition period and if and when > the old things can be removed, this added cleanup logic can go away > with it. It depends on how complex the implementation would get but > as long as it isn't too much and stays mostly isolated from the saner > paths, I think it's probably the right thing to do. We can't just move kmem objects from a per-memcg kmem_cache to the global one fixing page counters, because in contrast to page cache and swap we don't even track all kmem allocations. So we have to keep all per-memcg kmem_cache's somewhere after unmount until they can finally be destroyed, but the whole logic behind per-memcg kmem_cache's destruction is currently tightly interwoven with that of css's (we destroy kmem_cache's from css_free), and there won't be any css's after unmount. That said, it isn't possible to add a couple of isolated functions, which will live their own lives and can be easily removed once we've switched to the unified hierarchy. Quite the contrary, implementing of kmem reparenting would make me rethink and complicate kmemcg code all over the place. That's why I'm rather reluctant to do it. I haven't dug deep into the cgroup core, but may be we could detach the old root in cgroup_kill_sb() and leave it dangling until the last reference to it has gone? BTW, IIRC the problem always existed for kmem-active memory cgroups, because we never had kmem reparenting. May be, we could therefore just document somewhere that kmem accounting is highly discouraged to be used in the legacy hierarchy and merge these two patches as is to handle page cache and swap charges? We won't break anything, because it was always broken :-) Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
Hello, Vladimir. On Mon, Jan 12, 2015 at 11:01:14AM +0300, Vladimir Davydov wrote: > Come to think of it, I wonder how many users actually want to mount > different controllers subset after unmount. Because we could allow It wouldn't be a common use case but, on the face of it, we still support it. If we collecctively decide that once a sub cgroup is created for any controller no further hierarchy configuration for that controller is allowed, that'd work too, but one way or the other, the behavior, I believe, should be well-defined. As it currently stands, the conditions and failure mode are opaque to userland, which is never a good thing. > mounting the same subset perfectly well, even if it includes memcg. BTW, > AFAIU in the unified hierarchy we won't have this problem at all, > because by definition it mounts all controllers IIRC, so do we need to > bother fixing this in such a complicated manner at all for the setup > that's going to be deprecated anyway? There will likely be a quite long transition period and if and when the old things can be removed, this added cleanup logic can go away with it. It depends on how complex the implementation would get but as long as it isn't too much and stays mostly isolated from the saner paths, I think it's probably the right thing to do. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
On Sun, Jan 11, 2015 at 03:55:43PM -0500, Johannes Weiner wrote: > On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: > > > May be, we should kill the ref counter to the memory controller root in > > > cgroup_kill_sb only if there is no children at all, neither online nor > > > offline. > > > > Ah, thanks for the analysis, but I really wanna avoid making hierarchy > > destruction conditions opaque to userland. This is userland visible > > behavior. It shouldn't be determined by kernel internals invisible > > outside. This patch adds ss->unbind() which memcg can hook into to > > kick off draining of residual refs. If this would work, I'll add this > > patch to cgroup/for-3.19-fixes, possibly with stable cc'd. > > How about this ->unbind() for memcg? > > From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Sun, 11 Jan 2015 10:29:05 -0500 > Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during > unbind > > Cgroup core assumes that any outstanding css references after > offlining are temporary in nature, and e.g. mount waits for them to > disappear and release the root cgroup. But leftover page cache and > swapout records in an offlined memcg are only dropped when the pages > get reclaimed under pressure or the swapped out pages get faulted in > from other cgroups, and so those cgroup operations can hang forever. > > Implement the ->unbind() callback to actively get rid of outstanding > references when cgroup core wants them gone. Swap out records are > deleted, such that the swap-in path will charge those pages to the > faulting task. Page cache pages are moved to the root memory cgroup. ... and kmem pages are ignored. I reckon we could reparent them (I submitted the patch set some time ago), but that's going to be tricky and will complicate regular kmem charge/uncharge paths, as well as list_lru_add/del. I don't think we can put up with it, provided we only want reparenting on unmount, do we not? Come to think of it, I wonder how many users actually want to mount different controllers subset after unmount. Because we could allow mounting the same subset perfectly well, even if it includes memcg. BTW, AFAIU in the unified hierarchy we won't have this problem at all, because by definition it mounts all controllers IIRC, so do we need to bother fixing this in such a complicated manner at all for the setup that's going to be deprecated anyway? Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
On Sat, Jan 10, 2015 at 04:43:16PM -0500, Tejun Heo wrote: > Currently, if a hierarchy doesn't have any live children when it's > unmounted, the hierarchy starts dying by killing its refcnt. The > expectation is that even if there are lingering dead children which > are lingering due to remaining references, they'll be put in a finite > amount of time. When the children are finally released, the hierarchy > is destroyed and all controllers bound to it also are released. > > However, for memcg, the premise that the lingering refs will be put in > a finite amount time is not true. In the absense of memory pressure, > dead memcg's may hang around indefinitely pinned by its pages. This > unfortunately may lead to indefinite hang on the next mount attempt > involving memcg as the mount logic waits for it to get released. > > While we can change hierarchy destruction logic such that a hierarchy > is only destroyed when it's not mounted anywhere and all its children, > live or dead, are gone, this makes whether the hierarchy gets > destroyed or not to be determined by factors opaque to userland. > Userland may or may not get a new hierarchy on the next mount attempt. > Worse, if it explicitly wants to create a new hierarchy with different > options or controller compositions involving memcg, it will fail in an > essentially arbitrary manner. > > We want to guarantee that a hierarchy is destroyed once the > conditions, unmounted and no visible children, are met. To aid it, > this patch introduces a new callback cgroup_subsys->unbind() which is > invoked right before the hierarchy a subsystem is bound to starts > dying. memcg can implement this callback and initiate draining of > remaining refs so that the hierarchy can eventually be released in a > finite amount of time. > > Signed-off-by: Tejun Heo > Cc: Li Zefan > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vladimir Davydov > --- > Hello, > > > May be, we should kill the ref counter to the memory controller root in > > cgroup_kill_sb only if there is no children at all, neither online nor > > offline. > > Ah, thanks for the analysis, but I really wanna avoid making hierarchy > destruction conditions opaque to userland. This is userland visible > behavior. It shouldn't be determined by kernel internals invisible > outside. This patch adds ss->unbind() which memcg can hook into to > kick off draining of residual refs. If this would work, I'll add this > patch to cgroup/for-3.19-fixes, possibly with stable cc'd. How about this ->unbind() for memcg? >From d527ba1dbfdb58e1f7c7c4ee12b32ef2e5461990 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Sun, 11 Jan 2015 10:29:05 -0500 Subject: [patch] mm: memcontrol: zap outstanding cache/swap references during unbind Cgroup core assumes that any outstanding css references after offlining are temporary in nature, and e.g. mount waits for them to disappear and release the root cgroup. But leftover page cache and swapout records in an offlined memcg are only dropped when the pages get reclaimed under pressure or the swapped out pages get faulted in from other cgroups, and so those cgroup operations can hang forever. Implement the ->unbind() callback to actively get rid of outstanding references when cgroup core wants them gone. Swap out records are deleted, such that the swap-in path will charge those pages to the faulting task. Page cache pages are moved to the root memory cgroup. Signed-off-by: Johannes Weiner --- include/linux/swap_cgroup.h | 6 +++ mm/memcontrol.c | 126 mm/swap_cgroup.c| 38 + 3 files changed, 170 insertions(+) diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index 145306bdc92f..ffe0866d2997 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -9,6 +9,7 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, unsigned short old, unsigned short new); extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); +extern unsigned long swap_cgroup_zap_records(unsigned short id); extern int swap_cgroup_swapon(int type, unsigned long max_pages); extern void swap_cgroup_swapoff(int type); @@ -26,6 +27,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) return 0; } +static inline unsigned long swap_cgroup_zap_records(unsigned short id) +{ + return 0; +} + static inline int swap_cgroup_swapon(int type, unsigned long max_pages) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 692e96407627..40c426add613 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5197,6 +5197,131 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) mem_cgroup_from_css(root_css)->use_hierarchy = true; } +static void unbind_lru_list(struct mem_cgroup *memcg, +
[PATCH cgroup/for-3.19-fixes] cgroup: implement cgroup_subsys->unbind() callback
Currently, if a hierarchy doesn't have any live children when it's unmounted, the hierarchy starts dying by killing its refcnt. The expectation is that even if there are lingering dead children which are lingering due to remaining references, they'll be put in a finite amount of time. When the children are finally released, the hierarchy is destroyed and all controllers bound to it also are released. However, for memcg, the premise that the lingering refs will be put in a finite amount time is not true. In the absense of memory pressure, dead memcg's may hang around indefinitely pinned by its pages. This unfortunately may lead to indefinite hang on the next mount attempt involving memcg as the mount logic waits for it to get released. While we can change hierarchy destruction logic such that a hierarchy is only destroyed when it's not mounted anywhere and all its children, live or dead, are gone, this makes whether the hierarchy gets destroyed or not to be determined by factors opaque to userland. Userland may or may not get a new hierarchy on the next mount attempt. Worse, if it explicitly wants to create a new hierarchy with different options or controller compositions involving memcg, it will fail in an essentially arbitrary manner. We want to guarantee that a hierarchy is destroyed once the conditions, unmounted and no visible children, are met. To aid it, this patch introduces a new callback cgroup_subsys->unbind() which is invoked right before the hierarchy a subsystem is bound to starts dying. memcg can implement this callback and initiate draining of remaining refs so that the hierarchy can eventually be released in a finite amount of time. Signed-off-by: Tejun Heo Cc: Li Zefan Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov --- Hello, > May be, we should kill the ref counter to the memory controller root in > cgroup_kill_sb only if there is no children at all, neither online nor > offline. Ah, thanks for the analysis, but I really wanna avoid making hierarchy destruction conditions opaque to userland. This is userland visible behavior. It shouldn't be determined by kernel internals invisible outside. This patch adds ss->unbind() which memcg can hook into to kick off draining of residual refs. If this would work, I'll add this patch to cgroup/for-3.19-fixes, possibly with stable cc'd. Thanks. Documentation/cgroups/cgroups.txt | 12 +++- include/linux/cgroup.h|1 + kernel/cgroup.c | 14 -- 3 files changed, 24 insertions(+), 3 deletions(-) --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -637,7 +637,7 @@ void exit(struct task_struct *task) Called during task exit. -void bind(struct cgroup *root) +void bind(struct cgroup_subsys_state *root_css) (cgroup_mutex held by caller) Called when a cgroup subsystem is rebound to a different hierarchy @@ -645,6 +645,16 @@ and root cgroup. Currently this will onl the default hierarchy (which never has sub-cgroups) and a hierarchy that is being created/destroyed (and hence has no sub-cgroups). +void unbind(struct cgroup_subsys_state *root_css) +(cgroup_mutex held by caller) + +Called when a cgroup subsys is unbound from its current hierarchy. The +subsystem must guarantee that all offline cgroups are going to be +released in a finite amount of time after this function is called. +Such draining can be asynchronous. The following binding of the +subsystem to a hierarchy will be delayed till the draining is +complete. + 4. Extended attribute usage === --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -654,6 +654,7 @@ struct cgroup_subsys { struct cgroup_subsys_state *old_css, struct task_struct *task); void (*bind)(struct cgroup_subsys_state *root_css); + void (*unbind)(struct cgroup_subsys_state *root_css); int disabled; int early_init; --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_ * And don't kill the default root. */ if (css_has_online_children(&root->cgrp.self) || - root == &cgrp_dfl_root) + root == &cgrp_dfl_root) { cgroup_put(&root->cgrp); - else + } else { + struct cgroup_subsys *ss; + int i; + + mutex_lock(&cgroup_mutex); + for_each_subsys(ss, i) + if ((root->subsys_mask & (1UL << i)) && ss->unbind) + ss->unbind(cgroup_css(&root->cgrp, ss)); + mutex_unlock(&cgroup_mutex); + percpu_ref_kill(&root->cgrp.self.refcnt); + } kernfs_kill_sb(sb); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-in