Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
Hello Vivek. On Fri, Mar 27, 2015 at 05:06:13PM -0400, Vivek Goyal wrote: > I was curious to know that why do we need this "struct page *page" when > trying to attach a inode to a bdi_writeback. Is using current's cgroup > always not sufficient? So, memcg's page ownership is first-use based and it never gets updated once set till the page is released which means that there can be corner cases where an inode is mostly faulted in by one cgroup and then constantly dirtied by another. Because the ownership belongs to the initial cgroup which instantiated those pages, cgroup writeback ends up considering the pages as belonging to that initial cgroup and the foreign detection will trigger if it's being written by a different cgroup. H... this isn't a huge problem as once the foreign detection triggers, the problem will be corrected but still when the page is availalbe, I think it makes sense to attach to the page as that's what actually defines the ownership. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
On Mon, Mar 23, 2015 at 12:54:32AM -0400, Tejun Heo wrote: [..] > +/** > + * inode_attach_wb - associate an inode with its wb > + * @inode: inode of interest > + * @page: page being dirtied (may be NULL) > + * > + * If @inode doesn't have its wb, associate it with the wb matching the > + * memcg of @page or, if @page is NULL, %current. May be called w/ or w/o > + * @inode->i_lock. > + */ > +static inline void inode_attach_wb(struct inode *inode, struct page *page) > +{ > + if (!inode->i_wb) > + __inode_attach_wb(inode, page); > +} Hi Tejun, I was curious to know that why do we need this "struct page *page" when trying to attach a inode to a bdi_writeback. Is using current's cgroup always not sufficient? Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
Hello Vivek. On Fri, Mar 27, 2015 at 05:06:13PM -0400, Vivek Goyal wrote: I was curious to know that why do we need this struct page *page when trying to attach a inode to a bdi_writeback. Is using current's cgroup always not sufficient? So, memcg's page ownership is first-use based and it never gets updated once set till the page is released which means that there can be corner cases where an inode is mostly faulted in by one cgroup and then constantly dirtied by another. Because the ownership belongs to the initial cgroup which instantiated those pages, cgroup writeback ends up considering the pages as belonging to that initial cgroup and the foreign detection will trigger if it's being written by a different cgroup. H... this isn't a huge problem as once the foreign detection triggers, the problem will be corrected but still when the page is availalbe, I think it makes sense to attach to the page as that's what actually defines the ownership. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
On Mon, Mar 23, 2015 at 12:54:32AM -0400, Tejun Heo wrote: [..] +/** + * inode_attach_wb - associate an inode with its wb + * @inode: inode of interest + * @page: page being dirtied (may be NULL) + * + * If @inode doesn't have its wb, associate it with the wb matching the + * memcg of @page or, if @page is NULL, %current. May be called w/ or w/o + * @inode-i_lock. + */ +static inline void inode_attach_wb(struct inode *inode, struct page *page) +{ + if (!inode-i_wb) + __inode_attach_wb(inode, page); +} Hi Tejun, I was curious to know that why do we need this struct page *page when trying to attach a inode to a bdi_writeback. Is using current's cgroup always not sufficient? Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- block/blk-cgroup.c | 7 +- fs/fs-writeback.c| 8 +- fs/inode.c | 1 + include/linux/backing-dev-defs.h | 59 +- include/linux/backing-dev.h | 195 +++ include/linux/blk-cgroup.h | 4 + include/linux/fs.h | 4 + include/linux/memcontrol.h | 4 + mm/backing-dev.c | 398 +++ mm/memcontrol.c | 19 +- mm/page-writeback.c | 11 +- 11 files changed, 699 insertions(+), 11 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 9e0fe38..d2b1cbf 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -811,6 +812,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css) } spin_unlock_irq(>lock); + + wb_blkcg_offline(blkcg); } static void blkcg_css_free(struct cgroup_subsys_state *css) @@ -841,7 +844,9 @@ done: spin_lock_init(>lock); INIT_RADIX_TREE(>blkg_tree, GFP_ATOMIC); INIT_HLIST_HEAD(>blkg_list); - +#ifdef CONFIG_CGROUP_WRITEBACK + INIT_LIST_HEAD(>cgwb_list); +#endif return >css; } diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 4fd264d..48db5e6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -173,11 +173,11 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi) */ void inode_wb_list_del(struct inode *inode) { - struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = inode_to_wb(inode); - spin_lock(>wb.list_lock); + spin_lock(>list_lock); list_del_init(>i_wb_list); - spin_unlock(>wb.list_lock); + spin_unlock(>list_lock); } /* @@ -1200,6 +1200,8 @@ void __mark_inode_dirty(struct inode *inode, int flags) if ((inode->i_state & flags) != flags) { const int was_dirty = inode->i_state & I_DIRTY; + inode_attach_wb(inode, NULL); + if (flags & I_DIRTY_INODE) inode->i_state &= ~I_DIRTY_TIME; inode->i_state |= flags; diff --git a/fs/inode.c b/fs/inode.c index f00b16f..55cedf8 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu); void __destroy_inode(struct inode *inode) { BUG_ON(inode_has_buffers(inode)); + inode_detach_wb(inode); security_inode_free(inode); fsnotify_inode_delete(inode); locks_free_lock_context(inode->i_flctx); diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 9e9eafa..a1e9c40 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -2,8 +2,11 @@ #define __LINUX_BACKING_DEV_DEFS_H #include +#include +#include #include #include +#include #include #include #include @@ -37,10 +40,43 @@ enum wb_stat_item { #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) +/* + * For cgroup writeback, multiple wb's may map to the same blkcg. Those + * wb's can operate mostly independently but should share the congested + * state. To facilitate such sharing, the congested state is tracked using + * the following struct which is created on demand, indexed by blkcg ID on + * its bdi, and refcounted. + */ struct bdi_writeback_congested { unsigned long state;/* WB_[a]sync_congested flags */ + +#ifdef CONFIG_CGROUP_WRITEBACK + struct backing_dev_info *bdi; /* the associated bdi */ +
[PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi-wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi-cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi-wb. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo t...@kernel.org Cc: Jens Axboe ax...@kernel.dk Cc: Jan Kara j...@suse.cz --- block/blk-cgroup.c | 7 +- fs/fs-writeback.c| 8 +- fs/inode.c | 1 + include/linux/backing-dev-defs.h | 59 +- include/linux/backing-dev.h | 195 +++ include/linux/blk-cgroup.h | 4 + include/linux/fs.h | 4 + include/linux/memcontrol.h | 4 + mm/backing-dev.c | 398 +++ mm/memcontrol.c | 19 +- mm/page-writeback.c | 11 +- 11 files changed, 699 insertions(+), 11 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 9e0fe38..d2b1cbf 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -15,6 +15,7 @@ #include linux/module.h #include linux/err.h #include linux/blkdev.h +#include linux/backing-dev.h #include linux/slab.h #include linux/genhd.h #include linux/delay.h @@ -811,6 +812,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css) } spin_unlock_irq(blkcg-lock); + + wb_blkcg_offline(blkcg); } static void blkcg_css_free(struct cgroup_subsys_state *css) @@ -841,7 +844,9 @@ done: spin_lock_init(blkcg-lock); INIT_RADIX_TREE(blkcg-blkg_tree, GFP_ATOMIC); INIT_HLIST_HEAD(blkcg-blkg_list); - +#ifdef CONFIG_CGROUP_WRITEBACK + INIT_LIST_HEAD(blkcg-cgwb_list); +#endif return blkcg-css; } diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 4fd264d..48db5e6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -173,11 +173,11 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi) */ void inode_wb_list_del(struct inode *inode) { - struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = inode_to_wb(inode); - spin_lock(bdi-wb.list_lock); + spin_lock(wb-list_lock); list_del_init(inode-i_wb_list); - spin_unlock(bdi-wb.list_lock); + spin_unlock(wb-list_lock); } /* @@ -1200,6 +1200,8 @@ void __mark_inode_dirty(struct inode *inode, int flags) if ((inode-i_state flags) != flags) { const int was_dirty = inode-i_state I_DIRTY; + inode_attach_wb(inode, NULL); + if (flags I_DIRTY_INODE) inode-i_state = ~I_DIRTY_TIME; inode-i_state |= flags; diff --git a/fs/inode.c b/fs/inode.c index f00b16f..55cedf8 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu); void __destroy_inode(struct inode *inode) { BUG_ON(inode_has_buffers(inode)); + inode_detach_wb(inode); security_inode_free(inode); fsnotify_inode_delete(inode); locks_free_lock_context(inode-i_flctx); diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 9e9eafa..a1e9c40 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -2,8 +2,11 @@ #define __LINUX_BACKING_DEV_DEFS_H #include linux/list.h +#include linux/radix-tree.h +#include linux/rbtree.h #include linux/spinlock.h #include linux/percpu_counter.h +#include linux/percpu-refcount.h #include linux/flex_proportions.h #include linux/timer.h #include linux/workqueue.h @@ -37,10 +40,43 @@ enum wb_stat_item { #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) +/* + * For cgroup writeback, multiple wb's may map to the same blkcg. Those + * wb's can operate mostly independently but should share the congested + * state. To facilitate such sharing, the congested state is