Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

2015-03-27 Thread Tejun Heo
Hello Vivek.

On Fri, Mar 27, 2015 at 05:06:13PM -0400, Vivek Goyal wrote:
> I was curious to know that why do we need this "struct page *page" when
> trying to attach a inode to a bdi_writeback. Is using current's cgroup
> always not sufficient?

So, memcg's page ownership is first-use based and it never gets
updated once set till the page is released which means that there can
be corner cases where an inode is mostly faulted in by one cgroup and
then constantly dirtied by another.  Because the ownership belongs to
the initial cgroup which instantiated those pages, cgroup writeback
ends up considering the pages as belonging to that initial cgroup and
the foreign detection will trigger if it's being written by a
different cgroup.  H... this isn't a huge problem as once the
foreign detection triggers, the problem will be corrected but still
when the page is availalbe, I think it makes sense to attach to the
page as that's what actually defines the ownership.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

2015-03-27 Thread Vivek Goyal
On Mon, Mar 23, 2015 at 12:54:32AM -0400, Tejun Heo wrote:

[..]
> +/**
> + * inode_attach_wb - associate an inode with its wb
> + * @inode: inode of interest
> + * @page: page being dirtied (may be NULL)
> + *
> + * If @inode doesn't have its wb, associate it with the wb matching the
> + * memcg of @page or, if @page is NULL, %current.  May be called w/ or w/o
> + * @inode->i_lock.
> + */
> +static inline void inode_attach_wb(struct inode *inode, struct page *page)
> +{
> + if (!inode->i_wb)
> + __inode_attach_wb(inode, page);
> +}

Hi Tejun,

I was curious to know that why do we need this "struct page *page" when
trying to attach a inode to a bdi_writeback. Is using current's cgroup
always not sufficient?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

2015-03-27 Thread Tejun Heo
Hello Vivek.

On Fri, Mar 27, 2015 at 05:06:13PM -0400, Vivek Goyal wrote:
 I was curious to know that why do we need this struct page *page when
 trying to attach a inode to a bdi_writeback. Is using current's cgroup
 always not sufficient?

So, memcg's page ownership is first-use based and it never gets
updated once set till the page is released which means that there can
be corner cases where an inode is mostly faulted in by one cgroup and
then constantly dirtied by another.  Because the ownership belongs to
the initial cgroup which instantiated those pages, cgroup writeback
ends up considering the pages as belonging to that initial cgroup and
the foreign detection will trigger if it's being written by a
different cgroup.  H... this isn't a huge problem as once the
foreign detection triggers, the problem will be corrected but still
when the page is availalbe, I think it makes sense to attach to the
page as that's what actually defines the ownership.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

2015-03-27 Thread Vivek Goyal
On Mon, Mar 23, 2015 at 12:54:32AM -0400, Tejun Heo wrote:

[..]
 +/**
 + * inode_attach_wb - associate an inode with its wb
 + * @inode: inode of interest
 + * @page: page being dirtied (may be NULL)
 + *
 + * If @inode doesn't have its wb, associate it with the wb matching the
 + * memcg of @page or, if @page is NULL, %current.  May be called w/ or w/o
 + * @inode-i_lock.
 + */
 +static inline void inode_attach_wb(struct inode *inode, struct page *page)
 +{
 + if (!inode-i_wb)
 + __inode_attach_wb(inode, page);
 +}

Hi Tejun,

I was curious to know that why do we need this struct page *page when
trying to attach a inode to a bdi_writeback. Is using current's cgroup
always not sufficient?

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

2015-03-22 Thread Tejun Heo
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg.  This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg.  This means that there may be multiple
wb's of a bdi mapped to the same blkcg.  As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state.  This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
by its memcg id.  Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.

v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 block/blk-cgroup.c   |   7 +-
 fs/fs-writeback.c|   8 +-
 fs/inode.c   |   1 +
 include/linux/backing-dev-defs.h |  59 +-
 include/linux/backing-dev.h  | 195 +++
 include/linux/blk-cgroup.h   |   4 +
 include/linux/fs.h   |   4 +
 include/linux/memcontrol.h   |   4 +
 mm/backing-dev.c | 398 +++
 mm/memcontrol.c  |  19 +-
 mm/page-writeback.c  |  11 +-
 11 files changed, 699 insertions(+), 11 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9e0fe38..d2b1cbf 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -811,6 +812,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state 
*css)
}
 
spin_unlock_irq(>lock);
+
+   wb_blkcg_offline(blkcg);
 }
 
 static void blkcg_css_free(struct cgroup_subsys_state *css)
@@ -841,7 +844,9 @@ done:
spin_lock_init(>lock);
INIT_RADIX_TREE(>blkg_tree, GFP_ATOMIC);
INIT_HLIST_HEAD(>blkg_list);
-
+#ifdef CONFIG_CGROUP_WRITEBACK
+   INIT_LIST_HEAD(>cgwb_list);
+#endif
return >css;
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4fd264d..48db5e6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -173,11 +173,11 @@ void bdi_start_background_writeback(struct 
backing_dev_info *bdi)
  */
 void inode_wb_list_del(struct inode *inode)
 {
-   struct backing_dev_info *bdi = inode_to_bdi(inode);
+   struct bdi_writeback *wb = inode_to_wb(inode);
 
-   spin_lock(>wb.list_lock);
+   spin_lock(>list_lock);
list_del_init(>i_wb_list);
-   spin_unlock(>wb.list_lock);
+   spin_unlock(>list_lock);
 }
 
 /*
@@ -1200,6 +1200,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
 
+   inode_attach_wb(inode, NULL);
+
if (flags & I_DIRTY_INODE)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
diff --git a/fs/inode.c b/fs/inode.c
index f00b16f..55cedf8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu);
 void __destroy_inode(struct inode *inode)
 {
BUG_ON(inode_has_buffers(inode));
+   inode_detach_wb(inode);
security_inode_free(inode);
fsnotify_inode_delete(inode);
locks_free_lock_context(inode->i_flctx);
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 9e9eafa..a1e9c40 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -2,8 +2,11 @@
 #define __LINUX_BACKING_DEV_DEFS_H
 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -37,10 +40,43 @@ enum wb_stat_item {
 
 #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+/*
+ * For cgroup writeback, multiple wb's may map to the same blkcg.  Those
+ * wb's can operate mostly independently but should share the congested
+ * state.  To facilitate such sharing, the congested state is tracked using
+ * the following struct which is created on demand, indexed by blkcg ID on
+ * its bdi, and refcounted.
+ */
 struct bdi_writeback_congested {
unsigned long state;/* WB_[a]sync_congested flags */
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+   struct backing_dev_info *bdi;   /* the associated bdi */
+ 

[PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

2015-03-22 Thread Tejun Heo
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg.  This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg.  This means that there may be multiple
wb's of a bdi mapped to the same blkcg.  As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state.  This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi-wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task.  Each cgwb is indexed on bdi-cgwb_tree
by its memcg id.  Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi-wb.

v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Jens Axboe ax...@kernel.dk
Cc: Jan Kara j...@suse.cz
---
 block/blk-cgroup.c   |   7 +-
 fs/fs-writeback.c|   8 +-
 fs/inode.c   |   1 +
 include/linux/backing-dev-defs.h |  59 +-
 include/linux/backing-dev.h  | 195 +++
 include/linux/blk-cgroup.h   |   4 +
 include/linux/fs.h   |   4 +
 include/linux/memcontrol.h   |   4 +
 mm/backing-dev.c | 398 +++
 mm/memcontrol.c  |  19 +-
 mm/page-writeback.c  |  11 +-
 11 files changed, 699 insertions(+), 11 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9e0fe38..d2b1cbf 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
 #include linux/module.h
 #include linux/err.h
 #include linux/blkdev.h
+#include linux/backing-dev.h
 #include linux/slab.h
 #include linux/genhd.h
 #include linux/delay.h
@@ -811,6 +812,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state 
*css)
}
 
spin_unlock_irq(blkcg-lock);
+
+   wb_blkcg_offline(blkcg);
 }
 
 static void blkcg_css_free(struct cgroup_subsys_state *css)
@@ -841,7 +844,9 @@ done:
spin_lock_init(blkcg-lock);
INIT_RADIX_TREE(blkcg-blkg_tree, GFP_ATOMIC);
INIT_HLIST_HEAD(blkcg-blkg_list);
-
+#ifdef CONFIG_CGROUP_WRITEBACK
+   INIT_LIST_HEAD(blkcg-cgwb_list);
+#endif
return blkcg-css;
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4fd264d..48db5e6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -173,11 +173,11 @@ void bdi_start_background_writeback(struct 
backing_dev_info *bdi)
  */
 void inode_wb_list_del(struct inode *inode)
 {
-   struct backing_dev_info *bdi = inode_to_bdi(inode);
+   struct bdi_writeback *wb = inode_to_wb(inode);
 
-   spin_lock(bdi-wb.list_lock);
+   spin_lock(wb-list_lock);
list_del_init(inode-i_wb_list);
-   spin_unlock(bdi-wb.list_lock);
+   spin_unlock(wb-list_lock);
 }
 
 /*
@@ -1200,6 +1200,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if ((inode-i_state  flags) != flags) {
const int was_dirty = inode-i_state  I_DIRTY;
 
+   inode_attach_wb(inode, NULL);
+
if (flags  I_DIRTY_INODE)
inode-i_state = ~I_DIRTY_TIME;
inode-i_state |= flags;
diff --git a/fs/inode.c b/fs/inode.c
index f00b16f..55cedf8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu);
 void __destroy_inode(struct inode *inode)
 {
BUG_ON(inode_has_buffers(inode));
+   inode_detach_wb(inode);
security_inode_free(inode);
fsnotify_inode_delete(inode);
locks_free_lock_context(inode-i_flctx);
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 9e9eafa..a1e9c40 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -2,8 +2,11 @@
 #define __LINUX_BACKING_DEV_DEFS_H
 
 #include linux/list.h
+#include linux/radix-tree.h
+#include linux/rbtree.h
 #include linux/spinlock.h
 #include linux/percpu_counter.h
+#include linux/percpu-refcount.h
 #include linux/flex_proportions.h
 #include linux/timer.h
 #include linux/workqueue.h
@@ -37,10 +40,43 @@ enum wb_stat_item {
 
 #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+/*
+ * For cgroup writeback, multiple wb's may map to the same blkcg.  Those
+ * wb's can operate mostly independently but should share the congested
+ * state.  To facilitate such sharing, the congested state is