Re: [PATCH] xfs: use GFP_NOFS argument in radix_tree_preload

2015-03-22 Thread Taesoo Kim
Hi Dave,

Thank you for letting us know. Since we are not an expert of XFS (nor
want to be), we really want to let you guys know it's potential bug
that you might miss (we are helping you!). And that's why Sanidhya
asked (rather than sending a patch) at the first place.

I agree that the comment is misleading and not correct, but probably
encouraging a student who spend times to clean up your mistake
might be better way to influence him rather than shouting :)

Taesoo

On 03/23/15 at 04:24pm, Dave Chinner wrote:
> On Mon, Mar 23, 2015 at 01:06:23AM -0400, Sanidhya Kashyap wrote:
> > From: Byoungyoung Lee 
> > 
> > Following the convention of other file systems, GFP_NOFS
> > should be used as an argument for radix_tree_preload() instead
> > of GFP_KERNEL.
> 
> "convention of other filesystems" is not a reason for changing from
> GFP_KERNEL to GFP_NOFS. There are rules for when GFP_NOFS needs to
> be used, and so we only need to change the code if one of those
> rules are triggered. i.e. inside a transaction, holding a lock that
> memory reclaim might require to make progress (e.g. ip->i_ilock,
> buffer locks, etc). The context in which the allocation is made will
> tell you whether GFP_KERNEL is safe or not.
> 
> So while the change probably needs to be made, it needs to be made
> for the right reasons. I haven't looked at the code, but I have
> a pretty good idea of the context the allocation is being made
> under. I'd suggest documenting the call path down to
> xfs_mru_cache_insert(), because that will tell you exactly what
> context the allocaiton is being made in and hence tell everyone else
> the real reason we need to make this change...
> 
> Call me picky, pendantic and/or annoying, but if you are looking at
> validating/correcting allocation flags then you need to understand
> the rules and context in which the allocation is being made...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/8] truncate: swap the order of conditionals in cancel_dirty_page()

2015-03-22 Thread Tejun Heo
cancel_dirty_page() currently performs TestClearPageDirty() and then
tests whether the mapping exists and has cap_account_dirty.  This
patch swaps the order so that it performs the mapping tests first.

If the mapping tests fail, the dirty is cleared with ClearPageDirty().
The order or the conditionals is swapped but the end result is the
same.  This will help inode foreign cgroup wb switching.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/truncate.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index fe2d769..9d40cd4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -108,13 +108,13 @@ void do_invalidatepage(struct page *page, unsigned int 
offset,
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
-   struct mem_cgroup *memcg;
+   struct address_space *mapping = page->mapping;
 
-   memcg = mem_cgroup_begin_page_stat(page);
-   if (TestClearPageDirty(page)) {
-   struct address_space *mapping = page->mapping;
+   if (mapping && mapping_cap_account_dirty(mapping)) {
+   struct mem_cgroup *memcg;
 
-   if (mapping && mapping_cap_account_dirty(mapping)) {
+   memcg = mem_cgroup_begin_page_stat(page);
+   if (TestClearPageDirty(page)) {
struct bdi_writeback *wb = inode_to_wb(mapping->host);
 
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
@@ -123,8 +123,10 @@ void cancel_dirty_page(struct page *page, unsigned int 
account_size)
if (account_size)
task_io_account_cancelled_write(account_size);
}
+   mem_cgroup_end_page_stat(memcg);
+   } else {
+   ClearPageDirty(page);
}
-   mem_cgroup_end_page_stat(memcg);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHSET 3/3 block/for-4.1/core] writeback: implement foreign cgroup inode bdi_writeback switching

2015-03-22 Thread Tejun Heo
Hello,

The previous two patchsets [2][3] implemented cgroup writeback support
and backpressure propagation through dirty throttling mechanism;
however, the inode is assigned to the wb (bdi_writeback) matching the
first dirtied page and stays there until released.  This first-use
policy can easily lead to gross misbehaviors - a single stray dirty
page can cause gigatbytes to be written by the wrong cgroup.  Also,
while concurrently write sharing an inode is extremely rare and
unsupported, inodes jumping cgroups over time are more common.

This patchset implements foreign cgroup inode detection and wb
switching.  Each writeback run tracks the majority wb being written
using a simple but fairly robust algorithm and when an inode
persistently writes out more foreign cgroup pages than local ones, the
inode is transferred to the majority winner.

This patchset adds 8 bytes to inode making the total per-inode space
overhead of cgroup writeback support 16 bytes on 64bit systems.  The
computational overhead should be negligible.  If the writer changes
from one cgroup to another entirely, the mechanism can render the
correct switch verdict in several seconds of IO time in most cases and
it can converge on the correct answer in reasonable amount of time
even in more ambiguous cases.

This patchset contains the following 8 patches.

 0001-writeback-relocate-wb-_try-_get-wb_put-inode_-attach.patch
 0002-writeback-make-writeback_control-track-the-inode-bei.patch
 0003-writeback-implement-foreign-cgroup-inode-detection.patch
 0004-truncate-swap-the-order-of-conditionals-in-cancel_di.patch
 0005-writeback-implement-locked_-inode_to_wb_and_lock_lis.patch
 0006-writeback-implement-I_WB_SWITCH-and-bdi_writeback-st.patch
 0007-writeback-add-lockdep-annotation-to-inode_to_wb.patch
 0008-writeback-implement-foreign-cgroup-inode-bdi_writeba.patch

This patchset is on top of

  block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() 
if __GFP_WAIT isn't set")
+ [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation
+ [2] [PATCHSET 1/3 v2 block/for-4.1/core] writeback: cgroup writeback support
+ [3] [PATCHSET 2/3 block/for-4.1/core] writeback: cgroup writeback 
backpressure propagation

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
review-cgroup-writeback-switch-20150322

diffstat follows.  Thanks.

 fs/buffer.c  |   26 +-
 fs/fs-writeback.c|  499 ++-
 fs/mpage.c   |3 
 include/linux/backing-dev-defs.h |   50 +++
 include/linux/backing-dev.h  |  136 --
 include/linux/fs.h   |   11 
 include/linux/writeback.h|  123 +
 mm/backing-dev.c |   30 --
 mm/filemap.c |2 
 mm/page-writeback.c  |   16 +
 mm/truncate.c|   21 +
 11 files changed, 773 insertions(+), 144 deletions(-)

--
tejun

[L] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email...@kernel.org
[1] http://lkml.kernel.org/g/20150323041848.ga8...@htj.duckdns.org
[2] http://lkml.kernel.org/g/1427086499-15657-1-git-send-email...@kernel.org
[3] http://lkml.kernel.org/g/1427087267-16592-1-git-send-email...@kernel.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/8] writeback: implement [locked_]inode_to_wb_and_lock_list()

2015-03-22 Thread Tejun Heo
cgroup writeback currently assumes that inode to wb association
doesn't change; however, with the planned foreign inode wb switching
mechanism, the association will change dynamically.

When an inode needs to be put on one of the IO lists of its wb, the
current code simply calls inode_to_wb() and locks the returned wb;
however, with the planned wb switching, the association may change
before locking the wb and may even get released.

This patch implements [locked_]inode_to_wb_and_lock_list() which pins
the associated wb while holding i_lock, releases it, acquires
wb->list_lock and verifies that the association hasn't changed
inbetween.  As the association will be protected by both locks among
other things, this guarantees that the wb is the inode's associated wb
until the list_lock is released.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/fs-writeback.c | 80 +++
 1 file changed, 75 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 478d26e..e888c4b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -246,6 +246,56 @@ void __inode_attach_wb(struct inode *inode, struct page 
*page)
 }
 
 /**
+ * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
+ * @inode: inode of interest with i_lock held
+ *
+ * Returns @inode's wb with its list_lock held.  @inode->i_lock must be
+ * held on entry and is released on return.  The returned wb is guaranteed
+ * to stay @inode's associated wb until its list_lock is released.
+ */
+static struct bdi_writeback *
+locked_inode_to_wb_and_lock_list(struct inode *inode)
+   __releases(>i_lock)
+   __acquires(>list_lock)
+{
+   while (true) {
+   struct bdi_writeback *wb = inode_to_wb(inode);
+
+   /*
+* inode_to_wb() association is protected by both
+* @inode->i_lock and @wb->list_lock but list_lock nests
+* outside i_lock.  Drop i_lock and verify that the
+* association hasn't changed after acquiring list_lock.
+*/
+   wb_get(wb);
+   spin_unlock(>i_lock);
+   spin_lock(>list_lock);
+   wb_put(wb); /* not gonna deref it anymore */
+
+   if (likely(wb == inode_to_wb(inode)))
+   return wb;  /* @inode already has ref */
+
+   spin_unlock(>list_lock);
+   cpu_relax();
+   spin_lock(>i_lock);
+   }
+}
+
+/**
+ * inode_to_wb_and_lock_list - determine an inode's wb and lock it
+ * @inode: inode of interest
+ *
+ * Same as locked_inode_to_wb_and_lock_list() but @inode->i_lock isn't held
+ * on entry.
+ */
+static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode)
+   __acquires(>list_lock)
+{
+   spin_lock(>i_lock);
+   return locked_inode_to_wb_and_lock_list(inode);
+}
+
+/**
  * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
  * @wbc: writeback_control of interest
  * @inode: target inode
@@ -591,6 +641,27 @@ restart:
 
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
+static struct bdi_writeback *
+locked_inode_to_wb_and_lock_list(struct inode *inode)
+   __releases(>i_lock)
+   __acquires(>list_lock)
+{
+   struct bdi_writeback *wb = inode_to_wb(inode);
+
+   spin_unlock(>i_lock);
+   spin_lock(>list_lock);
+   return wb;
+}
+
+static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode)
+   __acquires(>list_lock)
+{
+   struct bdi_writeback *wb = inode_to_wb(inode);
+
+   spin_lock(>list_lock);
+   return wb;
+}
+
 static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
 {
return nr_pages;
@@ -666,9 +737,9 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
  */
 void inode_wb_list_del(struct inode *inode)
 {
-   struct bdi_writeback *wb = inode_to_wb(inode);
+   struct bdi_writeback *wb;
 
-   spin_lock(>list_lock);
+   wb = inode_to_wb_and_lock_list(inode);
inode_wb_list_del_locked(inode, wb);
spin_unlock(>list_lock);
 }
@@ -1713,11 +1784,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 * reposition it (that would break b_dirty time-ordering).
 */
if (!was_dirty) {
-   struct bdi_writeback *wb = inode_to_wb(inode);
+   struct bdi_writeback *wb;
bool wakeup_bdi = false;
 
-   spin_unlock(>i_lock);
-   spin_lock(>list_lock);
+   wb = locked_inode_to_wb_and_lock_list(inode);
 
WARN(bdi_cap_writeback_dirty(wb->bdi) &&
 !test_bit(WB_registered, >state),
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a 

Re: [PATCH] lguest: simplify lguest_iret

2015-03-22 Thread Rusty Russell
Denys Vlasenko  writes:
> Signed-off-by: Denys Vlasenko 
> CC: lgu...@lists.ozlabs.org
> CC: x...@kernel.org
> CC: linux-kernel@vger.kernel.org

Oh, thanks, applied!

And now it's down to one instruction, we could change
try_deliver_interrupt() to handle this case (rather than ignoring the
interrupt altogether): just jump to the handler and leave the
stack set up.

Here's a pair of inline patches which attempt to do that (untested!).

Thanks,
Rusty.

lguest: suppress interrupts for single insn, not range.

The last patch reduced our interrupt-suppression region to one address,
so simplify the code somewhat.

Also, remove the obsolete undefined instruction ranges and the comment
which refers to lguest_guest.S instead of head_32.S.

Signed-off-by: Rusty Russell 

diff --git a/arch/x86/include/asm/lguest.h b/arch/x86/include/asm/lguest.h
index e2d4a4afa8c3..3bbc07a57a31 100644
--- a/arch/x86/include/asm/lguest.h
+++ b/arch/x86/include/asm/lguest.h
@@ -20,13 +20,10 @@ extern unsigned long switcher_addr;
 /* Found in switcher.S */
 extern unsigned long default_idt_entries[];
 
-/* Declarations for definitions in lguest_guest.S */
-extern char lguest_noirq_start[], lguest_noirq_end[];
+/* Declarations for definitions in arch/x86/lguest/head_32.S */
+extern char lguest_noirq_iret[];
 extern const char lgstart_cli[], lgend_cli[];
-extern const char lgstart_sti[], lgend_sti[];
-extern const char lgstart_popf[], lgend_popf[];
 extern const char lgstart_pushf[], lgend_pushf[];
-extern const char lgstart_iret[], lgend_iret[];
 
 extern void lguest_iret(void);
 extern void lguest_init(void);
diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c
index ac4453d8520e..4c8cf656f21f 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -87,8 +87,7 @@
 
 struct lguest_data lguest_data = {
.hcall_status = { [0 ... LHCALL_RING_SIZE-1] = 0xFF },
-   .noirq_start = (u32)lguest_noirq_start,
-   .noirq_end = (u32)lguest_noirq_end,
+   .noirq_iret = (u32)lguest_noirq_iret,
.kernel_address = PAGE_OFFSET,
.blocked_interrupts = { 1 }, /* Block timer interrupts */
.syscall_vec = SYSCALL_VECTOR,
diff --git a/arch/x86/lguest/head_32.S b/arch/x86/lguest/head_32.S
index 583732cc5d18..128fe93161b4 100644
--- a/arch/x86/lguest/head_32.S
+++ b/arch/x86/lguest/head_32.S
@@ -133,9 +133,8 @@ ENTRY(lg_restore_fl)
ret
 /*:*/
 
-/* These demark the EIP range where host should never deliver interrupts. */
-.global lguest_noirq_start
-.global lguest_noirq_end
+/* These demark the EIP where host should never deliver interrupts. */
+.global lguest_noirq_iret
 
 /*M:004
  * When the Host reflects a trap or injects an interrupt into the Guest, it
@@ -174,12 +173,11 @@ ENTRY(lg_restore_fl)
  *
  * The second is harder: copying eflags to lguest_data.irq_enabled will turn
  * interrupts on before we're finished, so we could be interrupted before we
- * return to userspace or wherever.  Our solution to this is to surround the
- * code with lguest_noirq_start: and lguest_noirq_end: labels.  We tell the
+ * return to userspace or wherever.  Our solution to this is to tell the
  * Host that it is *never* to interrupt us there, even if interrupts seem to be
  * enabled. (It's not necessary to protect pop instruction, since
- * data gets updated only after it completes, so we end up surrounding
- * just one instruction, iret).
+ * data gets updated only after it completes, so we only need to protect
+ * one instruction, iret).
  */
 ENTRY(lguest_iret)
pushl   2*4(%esp)
@@ -190,6 +188,5 @@ ENTRY(lguest_iret)
 * prefix makes sure we use the stack segment, which is still valid.
 */
popl%ss:lguest_data+LGUEST_DATA_irq_enabled
-lguest_noirq_start:
+lguest_noirq_iret:
iret
-lguest_noirq_end:
diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c
index 1219af493c0f..19a32280731d 100644
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -211,10 +211,9 @@ static void initialize(struct lg_cpu *cpu)
 
/*
 * The Guest tells us where we're not to deliver interrupts by putting
-* the range of addresses into "struct lguest_data".
+* the instruction address into "struct lguest_data".
 */
-   if (get_user(cpu->lg->noirq_start, >lg->lguest_data->noirq_start)
-   || get_user(cpu->lg->noirq_end, >lg->lguest_data->noirq_end))
+   if (get_user(cpu->lg->noirq_iret, >lg->lguest_data->noirq_iret))
kill_guest(cpu, "bad guest page %p", cpu->lg->lguest_data);
 
/*
diff --git a/drivers/lguest/interrupts_and_traps.c 
b/drivers/lguest/interrupts_and_traps.c
index 70dfcdc29f1f..6d4c072b61e1 100644
--- a/drivers/lguest/interrupts_and_traps.c
+++ b/drivers/lguest/interrupts_and_traps.c
@@ -204,8 +204,7 @@ void try_deliver_interrupt(struct lg_cpu *cpu, unsigned int 
irq, bool more)
 * They may be in the middle of an iret, where they asked us never to
 * 

[PATCH 2/8] writeback: make writeback_control track the inode being written back

2015-03-22 Thread Tejun Heo
Currently, for cgroup writeback, the IO submission paths directly
associate the bio's with the blkcg from inode_to_wb_blkcg_css();
however, it'd be necessary to keep more writeback context to implement
foreign inode writeback detection.  wbc (writeback_control) is the
natural fit for the extra context - it persists throughout the
writeback of each inode and is passed all the way down to IO
submission paths.

This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
wbc_attach_fdatawrite_inode() which are used to associate wbc with the
inode being written back.  IO submission paths now use wbc_init_bio()
instead of directly associating bio's with blkcg themselves.  This
leaves inode_to_wb_blkcg_css() w/o any user.  The function is removed.

wbc currently only tracks the associated wb (bdi_writeback).  Future
patches will add more for foreign inode detection.  The association is
established under i_lock which will be depended upon when migrating
foreign inodes to other wb's.

As currently, once established, inode to wb association never changes,
going through wbc when initializing bio's doesn't cause any behavior
changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/buffer.c | 24 --
 fs/fs-writeback.c   | 37 +--
 fs/mpage.c  |  2 +-
 include/linux/backing-dev.h | 12 -
 include/linux/writeback.h   | 61 +
 mm/filemap.c|  2 ++
 6 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index f2d594c..cb2c7ec 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -45,9 +45,9 @@
 #include 
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
-static int submit_bh_blkcg(int rw, struct buffer_head *bh,
-  unsigned long bio_flags,
-  struct cgroup_subsys_state *blkcg_css);
+static int submit_bh_wbc(int rw, struct buffer_head *bh,
+unsigned long bio_flags,
+struct writeback_control *wbc);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1709,7 +1709,6 @@ static int __block_write_full_page(struct inode *inode, 
struct page *page,
unsigned int blocksize, bbits;
int nr_underway = 0;
int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
-   struct cgroup_subsys_state *blkcg_css = inode_to_wb_blkcg_css(inode);
 
head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
@@ -1798,7 +1797,7 @@ static int __block_write_full_page(struct inode *inode, 
struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
-   submit_bh_blkcg(write_op, bh, 0, blkcg_css);
+   submit_bh_wbc(write_op, bh, 0, wbc);
nr_underway++;
}
bh = next;
@@ -1852,7 +1851,7 @@ recover:
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
-   submit_bh_blkcg(write_op, bh, 0, blkcg_css);
+   submit_bh_wbc(write_op, bh, 0, wbc);
nr_underway++;
}
bh = next;
@@ -3021,9 +3020,8 @@ void guard_bio_eod(int rw, struct bio *bio)
}
 }
 
-static int submit_bh_blkcg(int rw, struct buffer_head *bh,
-  unsigned long bio_flags,
-  struct cgroup_subsys_state *blkcg_css)
+static int submit_bh_wbc(int rw, struct buffer_head *bh,
+unsigned long bio_flags, struct writeback_control *wbc)
 {
struct bio *bio;
int ret = 0;
@@ -3046,8 +3044,8 @@ static int submit_bh_blkcg(int rw, struct buffer_head *bh,
 */
bio = bio_alloc(GFP_NOIO, 1);
 
-   if (blkcg_css)
-   bio_associate_blkcg(bio, blkcg_css);
+   if (wbc)
+   wbc_init_bio(wbc, bio);
 
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
@@ -3082,13 +3080,13 @@ static int submit_bh_blkcg(int rw, struct buffer_head 
*bh,
 
 int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
-   return submit_bh_blkcg(rw, bh, bio_flags, NULL);
+   return submit_bh_wbc(rw, bh, bio_flags, NULL);
 }
 EXPORT_SYMBOL_GPL(_submit_bh);
 
 int submit_bh(int rw, struct buffer_head *bh)
 {
-   return submit_bh_blkcg(rw, bh, 0, NULL);
+   return submit_bh_wbc(rw, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dfb7bb6..da87463 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -232,6 +232,37 @@ void 

[PATCH 3/8] writeback: implement foreign cgroup inode detection

2015-03-22 Thread Tejun Heo
As concurrent write sharing of an inode is expected to be very rare
and memcg only tracks page ownership on first-use basis severely
confining the usefulness of such sharing, cgroup writeback tracks
ownership per-inode.  While the support for concurrent write sharing
of an inode is deemed unnecessary, an inode being written to by
different cgroups at different points in time is a lot more common,
and, more importantly, charging only by first-use can too readily lead
to grossly incorrect behaviors (single foreign page can lead to
gigabytes of writeback to be incorrectly attributed).

To resolve this issue, cgroup writeback detects the majority dirtier
of an inode and will transfer the ownership to it.  To avoid
unnnecessary oscillation, the detection mechanism keeps track of
history and gives out the switch verdict only if the foreign usage
pattern is stable over a certain amount of time and/or writeback
attempts.

The detection mechanism has fairly low space and computation overhead.
It adds 8 bytes to struct inode (one int and two u16's) and minimal
amount of calculation per IO.  The detection mechanism converges to
the correct answer usually in several seconds of IO time when there's
a clear majority dirtier.  Even when there isn't, it can reach an
acceptable answer fairly quickly under most circumstances.

Please see wb_detach_inode() for more details.

This patch only implements detection.  Following patches will
implement actual switching.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/buffer.c   |   4 +-
 fs/fs-writeback.c | 168 +-
 fs/mpage.c|   1 +
 include/linux/fs.h|   5 ++
 include/linux/writeback.h |  16 +
 5 files changed, 191 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index cb2c7ec..a404d8e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3044,8 +3044,10 @@ static int submit_bh_wbc(int rw, struct buffer_head *bh,
 */
bio = bio_alloc(GFP_NOIO, 1);
 
-   if (wbc)
+   if (wbc) {
wbc_init_bio(wbc, bio);
+   wbc_account_io(wbc, bh->b_page, bh->b_size);
+   }
 
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index da87463..478d26e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -201,6 +201,20 @@ static void wb_wait_for_completion(struct backing_dev_info 
*bdi,
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
+/* parameters for foreign inode detection, see wb_detach_inode() */
+#define WB_FRN_TIME_SHIFT  13  /* 1s = 2^13, upto 8 secs w/ 16bit */
+#define WB_FRN_TIME_AVG_SHIFT  3   /* avg = avg * 7/8 + new * 1/8 */
+#define WB_FRN_TIME_CUT_DIV2   /* ignore rounds < avg / 2 */
+#define WB_FRN_TIME_PERIOD (2 * (1 << WB_FRN_TIME_SHIFT))  /* 2s */
+
+#define WB_FRN_HIST_SLOTS  16  /* inode->i_wb_frn_history is 16bit */
+#define WB_FRN_HIST_UNIT   (WB_FRN_TIME_PERIOD / WB_FRN_HIST_SLOTS)
+   /* each slot's duration is 2s / 16 */
+#define WB_FRN_HIST_THR_SLOTS  (WB_FRN_HIST_SLOTS / 2)
+   /* if foreign slots >= 8, switch */
+#define WB_FRN_HIST_MAX_SLOTS  (WB_FRN_HIST_THR_SLOTS / 2 + 1)
+   /* one round can affect upto 5 slots */
+
 void __inode_attach_wb(struct inode *inode, struct page *page)
 {
struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -245,24 +259,174 @@ void wbc_attach_and_unlock_inode(struct 
writeback_control *wbc,
 struct inode *inode)
 {
wbc->wb = inode_to_wb(inode);
+   wbc->inode = inode;
+
+   wbc->wb_id = wbc->wb->memcg_css->id;
+   wbc->wb_lcand_id = inode->i_wb_frn_winner;
+   wbc->wb_tcand_id = 0;
+   wbc->wb_bytes = 0;
+   wbc->wb_lcand_bytes = 0;
+   wbc->wb_tcand_bytes = 0;
+
wb_get(wbc->wb);
spin_unlock(>i_lock);
 }
 
 /**
- * wbc_detach_inode - disassociate wbc from its target inode
- * @wbc: writeback_control of interest
+ * wbc_detach_inode - disassociate wbc from inode and perform foreign detection
+ * @wbc: writeback_control of the just finished writeback
  *
  * To be called after a writeback attempt of an inode finishes and undoes
  * wbc_attach_and_unlock_inode().  Can be called under any context.
+ *
+ * As concurrent write sharing of an inode is expected to be very rare and
+ * memcg only tracks page ownership on first-use basis severely confining
+ * the usefulness of such sharing, cgroup writeback tracks ownership
+ * per-inode.  While the support for concurrent write sharing of an inode
+ * is deemed unnecessary, an inode being written to by different cgroups at
+ * different points in time is a lot more common, and, more importantly,
+ * charging only by first-use can too readily lead to grossly incorrect

[PATCH 1/8] writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()

2015-03-22 Thread Tejun Heo
Currently, majority of cgroup writeback support including all the
above functions are implemented in include/linux/backing-dev.h and
mm/backing-dev.c; however, the portion closely related to writeback
logic implemented in include/linux/writeback.h and mm/page-writeback.c
will expand to support foreign writeback detection and correction.

This patch moves wb[_try]_get() and wb_put() to
include/linux/backing-dev-defs.h so that they can be used from
writeback.h and inode_{attach|detach}_wb() to writeback.h and
page-writeback.c.

This is pure reorganization and doesn't introduce any functional
changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/fs-writeback.c| 31 +++
 include/linux/backing-dev-defs.h | 50 
 include/linux/backing-dev.h  | 82 
 include/linux/writeback.h| 46 ++
 mm/backing-dev.c | 30 ---
 5 files changed, 127 insertions(+), 112 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 683bd92..dfb7bb6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /*
@@ -200,6 +201,36 @@ static void wb_wait_for_completion(struct backing_dev_info 
*bdi,
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
+void __inode_attach_wb(struct inode *inode, struct page *page)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+   struct bdi_writeback *wb = NULL;
+
+   if (inode_cgwb_enabled(inode)) {
+   struct cgroup_subsys_state *memcg_css;
+
+   if (page) {
+   memcg_css = mem_cgroup_css_from_page(page);
+   wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+   } else {
+   /* must pin memcg_css, see wb_get_create() */
+   memcg_css = task_get_css(current, memory_cgrp_id);
+   wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+   css_put(memcg_css);
+   }
+   }
+
+   if (!wb)
+   wb = >wb;
+
+   /*
+* There may be multiple instances of this function racing to
+* update the same inode.  Use cmpxchg() to tell the winner.
+*/
+   if (unlikely(cmpxchg(>i_wb, NULL, wb)))
+   wb_put(wb);
+}
+
 /**
  * mapping_congested - test whether a mapping is congested for a task
  * @mapping: address space to test for congestion
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8d470b7..e047b49 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -186,4 +186,54 @@ static inline void set_bdi_congested(struct 
backing_dev_info *bdi, int sync)
set_wb_congested(bdi->wb.congested, sync);
 }
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * wb_tryget - try to increment a wb's refcount
+ * @wb: bdi_writeback to get
+ */
+static inline bool wb_tryget(struct bdi_writeback *wb)
+{
+   if (wb != >bdi->wb)
+   return percpu_ref_tryget(>refcnt);
+   return true;
+}
+
+/**
+ * wb_get - increment a wb's refcount
+ * @wb: bdi_writeback to get
+ */
+static inline void wb_get(struct bdi_writeback *wb)
+{
+   if (wb != >bdi->wb)
+   percpu_ref_get(>refcnt);
+}
+
+/**
+ * wb_put - decrement a wb's refcount
+ * @wb: bdi_writeback to put
+ */
+static inline void wb_put(struct bdi_writeback *wb)
+{
+   if (wb != >bdi->wb)
+   percpu_ref_put(>refcnt);
+}
+
+#else  /* CONFIG_CGROUP_WRITEBACK */
+
+static inline bool wb_tryget(struct bdi_writeback *wb)
+{
+   return true;
+}
+
+static inline void wb_get(struct bdi_writeback *wb)
+{
+}
+
+static inline void wb_put(struct bdi_writeback *wb)
+{
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
 #endif /* __LINUX_BACKING_DEV_DEFS_H */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index a9a843c..119f0af 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -243,7 +243,6 @@ void wb_congested_put(struct bdi_writeback_congested 
*congested);
 struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
struct cgroup_subsys_state *memcg_css,
gfp_t gfp);
-void __inode_attach_wb(struct inode *inode, struct page *page);
 void wb_memcg_offline(struct mem_cgroup *memcg);
 void wb_blkcg_offline(struct blkcg *blkcg);
 int mapping_congested(struct address_space *mapping, struct task_struct *task,
@@ -266,37 +265,6 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
 }
 
 /**
- * wb_tryget - try to increment a wb's refcount
- * @wb: bdi_writeback to get
- */
-static inline bool wb_tryget(struct bdi_writeback *wb)
-{
-   if (wb != >bdi->wb)
-   return percpu_ref_tryget(>refcnt);
-   return true;
-}
-
-/**
- * 

Re: [PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim

2015-03-22 Thread Tejun Heo
On Mon, Mar 23, 2015 at 01:07:47AM -0400, Tejun Heo wrote:
> Because writeback wasn't cgroup aware before, the usual dirty
> throttling mechanism in balance_dirty_pages() didn't work for
> processes under memcg limit.  The writeback path didn't know how much
> memory is available or how fast the dirty pages are being written out
> for a given memcg and balance_dirty_pages() didn't have any measure of
> IO back pressure for the memcg.
> 
> To work around the issue, memcg implemented an ad-hoc dirty throttling
> mechanism in the direct reclaim path by stalling on pages under
> writeback which are encountered during direct reclaim scan.  This is
> rather ugly and crude - none of the configurability, fairness, or
> bandwidth-proportional distribution of the normal path.
> 
> The previous patches implemented proper memcg aware dirty throttling
> and the ad-hoc mechanism is no longer necessary.  Remove it.

Oops, just realized that this can't be removed, at least yet.
!unified path still depends on it.  I'll update the patch to disable
these checks only on the unified hierarchy.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/8] writeback: add lockdep annotation to inode_to_wb()

2015-03-22 Thread Tejun Heo
With the previous two patches, all operations which acquire wb from
inode are either under one of inode->i_lock, mapping->tree_lock or
wb->list_lock, or protected by stat transaction, which will be
depended upon by foreign inode wb switching.

This patch adds lockdep assertion to inode_to_wb() so that usages
outside the above list locks can be caught easily.
inode_wb_stat_unlocked_begin() is an exception as it's usually
protected by combination of !I_WB_SWITCH and rcu_read_lock().  It's
updated to dereference inode->i_wb directly.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 include/linux/backing-dev.h | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 040be1a..b9937e5 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -326,10 +326,18 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t 
gfp)
  * inode_to_wb - determine the wb of an inode
  * @inode: inode of interest
  *
- * Returns the wb @inode is currently associated with.
+ * Returns the wb @inode is currently associated with.  The caller must be
+ * holding either @inode->i_lock, @inode->i_mapping->tree_lock, or the
+ * associated wb's list_lock.
  */
 static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 {
+#ifdef CONFIG_LOCKDEP
+   WARN_ON_ONCE(debug_locks &&
+(!lockdep_is_held(>i_lock) &&
+ !lockdep_is_held(>i_mapping->tree_lock) &&
+ !lockdep_is_held(>i_wb->list_lock)));
+#endif
return inode->i_wb;
 }
 
@@ -360,7 +368,12 @@ inode_wb_stat_unlocked_begin(struct inode *inode, bool 
*lockedp)
 
if (unlikely(*lockedp))
spin_lock_irq(>i_mapping->tree_lock);
-   return inode_to_wb(inode);
+
+   /*
+* Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
+* inode_to_wb() will bark.  Deref directly.
+*/
+   return inode->i_wb;
 }
 
 /**
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/8] writeback: implement I_WB_SWITCH and bdi_writeback stat update transaction

2015-03-22 Thread Tejun Heo
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place.  This patch build the
framework for the actual switching.

This patch adds a new inode flag I_WB_SWITCHING, which has two
functions.  First, the easy one, it ensures that there's only one
switching in progress for a give inode.  Second, it's used as a
mechanism to synchronize wb stat updates.

The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively.  As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.

This patch solves the problem in a similar way as memcg.  Each such
lockless stat updates are wrapped in transaction surrounded by
inode_wb_stat_unlocked_begin/end().  During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.

In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.

Combined with the previous patch, this makes all wb list and stat
operations to be protected by either of inode->i_lock, wb->list_lock,
or mapping->tree_lock while wb switching is in progress.

This patch still doesn't implement the actual switching.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/fs-writeback.c   | 117 +---
 include/linux/backing-dev.h |  53 
 include/linux/fs.h  |   6 +++
 mm/page-writeback.c |  16 --
 mm/truncate.c   |   7 ++-
 5 files changed, 188 insertions(+), 11 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e888c4b..7a1ab24 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -295,6 +295,115 @@ static struct bdi_writeback 
*inode_to_wb_and_lock_list(struct inode *inode)
return locked_inode_to_wb_and_lock_list(inode);
 }
 
+struct inode_switch_wb_context {
+   struct inode*inode;
+   struct bdi_writeback*new_wb;
+
+   struct rcu_head rcu_head;
+   struct work_struct  work;
+};
+
+static void inode_switch_wb_work_fn(struct work_struct *work)
+{
+   struct inode_switch_wb_context *isw =
+   container_of(work, struct inode_switch_wb_context, work);
+   struct inode *inode = isw->inode;
+   struct bdi_writeback *new_wb = isw->new_wb;
+
+   /*
+* By the time control reaches here, RCU grace period has passed
+* since I_WB_SWITCH assertion and all wb stat update transactions
+* between inode_wb_stat_unlocked_begin/end() are guaranteed to be
+* synchronizing against mapping->tree_lock.
+*/
+   spin_lock(>i_lock);
+
+   inode->i_wb_frn_winner = 0;
+   inode->i_wb_frn_avg_time = 0;
+   inode->i_wb_frn_history = 0;
+
+   /*
+* Paired with load_acquire in inode_wb_stat_unlocked_begin() and
+* ensures that the new wb is visible if they see !I_WB_SWITCH.
+*/
+   smp_store_release(>i_state, inode->i_state & ~I_WB_SWITCH);
+
+   spin_unlock(>i_lock);
+
+   iput(inode);
+   wb_put(new_wb);
+   kfree(isw);
+}
+
+static void inode_switch_wb_rcu_fn(struct rcu_head *rcu_head)
+{
+   struct inode_switch_wb_context *isw =
+   container_of(rcu_head, struct inode_switch_wb_context, 
rcu_head);
+
+   /* needs to grab bh-unsafe locks, bounce to work item */
+   INIT_WORK(>work, inode_switch_wb_work_fn);
+   schedule_work(>work);
+}
+
+/**
+ * inode_switch_wb - change the wb association of an inode
+ * @inode: target inode
+ * @new_wb_id: ID of the new wb
+ *
+ * Switch @inode's wb association to the wb identified by @new_wb_id.  The
+ * switching is performed asynchronously and may fail silently.
+ */
+static void inode_switch_wb(struct inode *inode, int new_wb_id)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+   struct cgroup_subsys_state *memcg_css;
+   struct inode_switch_wb_context *isw;
+
+   /* noop if seems to be already in progress */
+   if (inode->i_state & I_WB_SWITCH)
+   return;
+
+   isw = kzalloc(sizeof(*isw), GFP_ATOMIC);
+   if (!isw)
+   return;
+
+   /* find and pin the new wb */
+   rcu_read_lock();
+   memcg_css = css_from_id(new_wb_id, _cgrp_subsys);
+   if (memcg_css)
+   isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+   rcu_read_unlock();
+   if (!isw->new_wb)
+   goto out_free;
+
+   /* while holding I_WB_SWITCH, no one else can 

[PATCH 8/8] writeback: implement foreign cgroup inode bdi_writeback switching

2015-03-22 Thread Tejun Heo
As concurrent write sharing of an inode is expected to be very rare
and memcg only tracks page ownership on first-use basis severely
confining the usefulness of such sharing, cgroup writeback tracks
ownership per-inode.  While the support for concurrent write sharing
of an inode is deemed unnecessary, an inode being written to by
different cgroups at different points in time is a lot more common,
and, more importantly, charging only by first-use can too readily lead
to grossly incorrect behaviors (single foreign page can lead to
gigabytes of writeback to be incorrectly attributed).

To resolve this issue, cgroup writeback detects the majority dirtier
of an inode and transfers the ownership to it.  The previous patches
implemented the foreign condition detection mechanism and laid the
groundwork.  This patch implements the actual switching.

With the previously implemented [unlocked_]inode_to_wb_and_list_lock()
and wb stat transaction, grabbing wb->list_lock, inode->i_lock and
mapping->tree_lock gives us full exclusion against all wb operations
on the target inode.  inode_switch_wb_work_fn() grabs all the locks
and transfers the inode atomically along with its RECLAIMABLE and
WRITEBACK stats.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/fs-writeback.c | 86 +--
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7a1ab24..5fc7828 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -308,30 +308,112 @@ static void inode_switch_wb_work_fn(struct work_struct 
*work)
struct inode_switch_wb_context *isw =
container_of(work, struct inode_switch_wb_context, work);
struct inode *inode = isw->inode;
+   struct address_space *mapping = inode->i_mapping;
+   struct bdi_writeback *old_wb = inode->i_wb;
struct bdi_writeback *new_wb = isw->new_wb;
+   struct radix_tree_iter iter;
+   bool switched = false;
+   void **slot;
 
/*
 * By the time control reaches here, RCU grace period has passed
 * since I_WB_SWITCH assertion and all wb stat update transactions
 * between inode_wb_stat_unlocked_begin/end() are guaranteed to be
 * synchronizing against mapping->tree_lock.
+*
+* Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock
+* gives us exclusion against all wb related operations on @inode
+* including IO list manipulations and stat updates.
 */
+   if (old_wb < new_wb) {
+   spin_lock(_wb->list_lock);
+   spin_lock_nested(_wb->list_lock, SINGLE_DEPTH_NESTING);
+   } else {
+   spin_lock(_wb->list_lock);
+   spin_lock_nested(_wb->list_lock, SINGLE_DEPTH_NESTING);
+   }
spin_lock(>i_lock);
+   spin_lock_irq(>tree_lock);
+
+   /*
+* Once I_FREEING is visible under i_lock, the eviction path owns
+* the inode and we shouldn't modify ->i_wb_list.
+*/
+   if (unlikely(inode->i_state & I_FREEING))
+   goto skip_switch;
 
+   /*
+* Count and transfer stats.  Note that PAGECACHE_TAG_DIRTY points
+* to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to
+* pages actually under underwriteback.
+*/
+   radix_tree_for_each_tagged(slot, >page_tree, , 0,
+  PAGECACHE_TAG_DIRTY) {
+   struct page *page = radix_tree_deref_slot_protected(slot,
+   >tree_lock);
+   if (likely(page) && PageDirty(page)) {
+   __dec_wb_stat(old_wb, WB_RECLAIMABLE);
+   __inc_wb_stat(new_wb, WB_RECLAIMABLE);
+   }
+   }
+
+   radix_tree_for_each_tagged(slot, >page_tree, , 0,
+  PAGECACHE_TAG_WRITEBACK) {
+   struct page *page = radix_tree_deref_slot_protected(slot,
+   >tree_lock);
+   if (likely(page)) {
+   WARN_ON_ONCE(!PageWriteback(page));
+   __dec_wb_stat(old_wb, WB_WRITEBACK);
+   __inc_wb_stat(new_wb, WB_WRITEBACK);
+   }
+   }
+
+   wb_get(new_wb);
+
+   /*
+* Transfer to @new_wb's IO list if necessary.  The specific list
+* @inode was on is ignored and the inode is put on ->b_dirty which
+* is always correct including from ->b_dirty_time.  The transfer
+* preserves @inode->dirtied_when ordering.
+*/
+   if (!list_empty(>i_wb_list)) {
+   struct inode *pos;
+
+   inode_wb_list_del_locked(inode, old_wb);
+   inode->i_wb = new_wb;
+   list_for_each_entry(pos, _wb->b_dirty, i_wb_list)
+   if 

Re: [PATCH] lguest: rename i386_head.S in the comments

2015-03-22 Thread Rusty Russell
Alexander Kuleshov  writes:
> i386_head.S renamed to the head_32.S, let's update it in
> the comments too.

Thanks, applied!

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/7 linux-next] ASoC: ak4554: constify of_device_id array

2015-03-22 Thread Mark Brown
On Wed, Mar 18, 2015 at 05:49:00PM +0100, Fabian Frederick wrote:
> of_device_id is always used as const.
> (See driver.of_match_table and open firmware functions)

Applied, thanks.


signature.asc
Description: Digital signature


Re: [PATCH] xfs: use GFP_NOFS argument in radix_tree_preload

2015-03-22 Thread Dave Chinner
On Mon, Mar 23, 2015 at 01:06:23AM -0400, Sanidhya Kashyap wrote:
> From: Byoungyoung Lee 
> 
> Following the convention of other file systems, GFP_NOFS
> should be used as an argument for radix_tree_preload() instead
> of GFP_KERNEL.

"convention of other filesystems" is not a reason for changing from
GFP_KERNEL to GFP_NOFS. There are rules for when GFP_NOFS needs to
be used, and so we only need to change the code if one of those
rules are triggered. i.e. inside a transaction, holding a lock that
memory reclaim might require to make progress (e.g. ip->i_ilock,
buffer locks, etc). The context in which the allocation is made will
tell you whether GFP_KERNEL is safe or not.

So while the change probably needs to be made, it needs to be made
for the right reasons. I haven't looked at the code, but I have
a pretty good idea of the context the allocation is being made
under. I'd suggest documenting the call path down to
xfs_mru_cache_insert(), because that will tell you exactly what
context the allocaiton is being made in and hence tell everyone else
the real reason we need to make this change...

Call me picky, pendantic and/or annoying, but if you are looking at
validating/correcting allocation flags then you need to understand
the rules and context in which the allocation is being made...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/48] memcg: add per cgroup dirty page accounting

2015-03-22 Thread Tejun Heo
From: Greg Thelen 

When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
per memcg memory.stat cgroupfs file.  The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback.  It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).

The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter.  The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
memcg = mem_cgroup_begin_page_stat(page)
if (TestSetPageDirty()) {
[...]
mem_cgroup_update_page_stat(memcg)
}
mem_cgroup_end_page_stat(memcg)

Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
  rcu_read_lock()
- With CONFIG_MEMCG and inter memcg  task movement, it's
  rcu_read_lock() + spin_lock_irqsave()

A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().

Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
  __mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
  __delete_from_page_cache(), replace_page_cache_page(),
  invalidate_complete_page2(), and __remove_mapping().

   textdata bss  dechex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
+192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
+773 text bytes

Performance tests run on v4.0-rc1-36-g4f671fe2f952.  Lower is better for
all metrics, they're all wall clock or cycle counts.  The read and write
fault benchmarks just measure fault time, they do not include I/O time.

* CONFIG_MEMCG not set:
baseline  patched
  kbuild 1m25.03(+-0.088% 3 samples)   
1m25.426667(+-0.120% 3 samples)
  dd write 100 MiB  0.859211561 +-15.10%  0.874162885 
+-15.03%
  dd write 200 MiB  1.670653105 +-17.87%  1.669384764 
+-11.99%
  dd write 1000 MiB 8.434691190 +-14.15%  8.474733215 
+-14.77%
  read fault cycles   254.0(+-0.000% 10 samples)253.0(+-0.000% 
10 samples)
  write fault cycles 2021.2(+-3.070% 10 samples)   1984.5(+-1.036% 
10 samples)

* CONFIG_MEMCG=y root_memcg:
baseline  patched
  kbuild 1m25.716667(+-0.105% 3 samples)   
1m25.686667(+-0.153% 3 samples)
  dd write 100 MiB  0.855650830 +-14.90%  0.887557919 
+-14.90%
  dd write 200 MiB  1.688322953 +-12.72%  1.667682724 
+-13.33%
  dd write 1000 MiB 8.418601605 +-14.30%  8.673532299 
+-15.00%
  read fault cycles   266.0(+-0.000% 10 samples)266.0(+-0.000% 
10 samples)
  write fault cycles 2051.7(+-1.349% 10 samples)   2049.6(+-1.686% 
10 samples)

* CONFIG_MEMCG=y non-root_memcg:
baseline  patched
  kbuild 1m26.12(+-0.273% 3 samples)   
1m25.76(+-0.127% 3 samples)
  dd write 100 MiB  0.861723964 +-15.25%  0.818129350 
+-14.82%
  dd write 200 MiB  1.669887569 +-13.30%  1.698645885 
+-13.27%
  dd write 1000 MiB 8.383191730 +-14.65%  8.351742280 
+-14.52%
  read fault cycles   265.7(+-0.172% 10 samples)267.0(+-0.000% 
10 samples)
  write fault cycles 2070.6(+-1.512% 10 samples)   2084.4(+-2.148% 
10 samples)

As expected anon page faults are not affected by this patch.

Signed-off-by: Sha Zhengju 
Signed-off-by: Greg Thelen 
---
 Documentation/cgroups/memory.txt |  1 +
 fs/buffer.c  | 34 +++---
 fs/xfs/xfs_aops.c| 12 ++--
 include/linux/memcontrol.h   |  1 +
 include/linux/mm.h   |  3 ++-
 include/linux/pagemap.h  |  3 ++-
 mm/filemap.c 

[PATCH 08/48] blkcg: implement bio_associate_blkcg()

2015-03-22 Thread Tejun Heo
Currently, a bio can only be associated with the io_context and blkcg
of %current using bio_associate_current().  This is too restrictive
for cgroup writeback support.  Implement bio_associate_blkcg() which
associates a bio with the specified blkcg.

bio_associate_blkcg() leaves the io_context unassociated.
bio_associate_current() is updated so that it considers a bio as
already associated if it has a blkcg_css, instead of an io_context,
associated with it.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Vivek Goyal 
---
 block/bio.c | 24 +++-
 include/linux/bio.h |  3 +++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 968683e..ab7517d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1971,6 +1971,28 @@ struct bio_set *bioset_create_nobvec(unsigned int 
pool_size, unsigned int front_
 EXPORT_SYMBOL(bioset_create_nobvec);
 
 #ifdef CONFIG_BLK_CGROUP
+
+/**
+ * bio_associate_blkcg - associate a bio with the specified blkcg
+ * @bio: target bio
+ * @blkcg_css: css of the blkcg to associate
+ *
+ * Associate @bio with the blkcg specified by @blkcg_css.  Block layer will
+ * treat @bio as if it were issued by a task which belongs to the blkcg.
+ *
+ * This function takes an extra reference of @blkcg_css which will be put
+ * when @bio is released.  The caller must own @bio and is responsible for
+ * synchronizing calls to this function.
+ */
+int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
+{
+   if (unlikely(bio->bi_css))
+   return -EBUSY;
+   css_get(blkcg_css);
+   bio->bi_css = blkcg_css;
+   return 0;
+}
+
 /**
  * bio_associate_current - associate a bio with %current
  * @bio: target bio
@@ -1988,7 +2010,7 @@ int bio_associate_current(struct bio *bio)
 {
struct io_context *ioc;
 
-   if (bio->bi_ioc)
+   if (bio->bi_css)
return -EBUSY;
 
ioc = current->io_context;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127..cbc5d1d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -469,9 +469,12 @@ extern void bvec_free(mempool_t *, struct bio_vec *, 
unsigned int);
 extern unsigned int bvec_nr_vecs(unsigned short idx);
 
 #ifdef CONFIG_BLK_CGROUP
+int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state 
*blkcg_css);
 int bio_associate_current(struct bio *bio);
 void bio_disassociate_task(struct bio *bio);
 #else  /* CONFIG_BLK_CGROUP */
+static inline int bio_associate_blkcg(struct bio *bio,
+   struct cgroup_subsys_state *blkcg_css) { return 0; }
 static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
 static inline void bio_disassociate_task(struct bio *bio) { }
 #endif /* CONFIG_BLK_CGROUP */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/48] cgroup, block: implement task_get_css() and use it in bio_associate_current()

2015-03-22 Thread Tejun Heo
bio_associate_current() currently open codes task_css() and
css_tryget_online() to find and pin $current's blkcg css.  Abstract it
into task_get_css() which is implemented from cgroup side.  As a task
is always associated with an online css for every subsystem except
while the css_set update is propagating, task_get_css() retries till
css_tryget_online() succeeds.

This is a cleanup and shouldn't lead to noticeable behavior changes.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: Jens Axboe 
Cc: Vivek Goyal 
---
 block/bio.c| 11 +--
 include/linux/cgroup.h | 25 +
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f66a4ea..968683e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1987,7 +1987,6 @@ EXPORT_SYMBOL(bioset_create_nobvec);
 int bio_associate_current(struct bio *bio)
 {
struct io_context *ioc;
-   struct cgroup_subsys_state *css;
 
if (bio->bi_ioc)
return -EBUSY;
@@ -1996,17 +1995,9 @@ int bio_associate_current(struct bio *bio)
if (!ioc)
return -ENOENT;
 
-   /* acquire active ref on @ioc and associate */
get_io_context_active(ioc);
bio->bi_ioc = ioc;
-
-   /* associate blkcg if exists */
-   rcu_read_lock();
-   css = task_css(current, blkio_cgrp_id);
-   if (css && css_tryget_online(css))
-   bio->bi_css = css;
-   rcu_read_unlock();
-
+   bio->bi_css = task_get_css(current, blkio_cgrp_id);
return 0;
 }
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..e7da0aa 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -774,6 +774,31 @@ static inline struct cgroup_subsys_state *task_css(struct 
task_struct *task,
 }
 
 /**
+ * task_get_css - find and get the css for (task, subsys)
+ * @task: the target task
+ * @subsys_id: the target subsystem ID
+ *
+ * Find the css for the (@task, @subsys_id) combination, increment a
+ * reference on and return it.  This function is guaranteed to return a
+ * valid css.
+ */
+static inline struct cgroup_subsys_state *
+task_get_css(struct task_struct *task, int subsys_id)
+{
+   struct cgroup_subsys_state *css;
+
+   rcu_read_lock();
+   while (true) {
+   css = task_css(task, subsys_id);
+   if (likely(css_tryget_online(css)))
+   break;
+   cpu_relax();
+   }
+   rcu_read_unlock();
+   return css;
+}
+
+/**
  * task_css_is_root - test whether a task belongs to the root css
  * @task: the target task
  * @subsys_id: the target subsystem ID
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/48] blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h

2015-03-22 Thread Tejun Heo
cgroup aware writeback support will require exposing some of blkcg
details.  In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h.  This patch is pure file move.

Signed-off-by: Tejun Heo 
Cc: Vivek Goyal 
---
 block/blk-cgroup.c |   2 +-
 block/blk-cgroup.h | 603 -
 block/blk-core.c   |   2 +-
 block/blk-sysfs.c  |   2 +-
 block/blk-throttle.c   |   2 +-
 block/cfq-iosched.c|   2 +-
 block/elevator.c   |   2 +-
 include/linux/blk-cgroup.h | 603 +
 8 files changed, 609 insertions(+), 609 deletions(-)
 delete mode 100644 block/blk-cgroup.h
 create mode 100644 include/linux/blk-cgroup.h

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0ac817b..c3226ce 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 #include 
-#include "blk-cgroup.h"
+#include 
 #include "blk.h"
 
 #define MAX_KEY_LEN 100
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
deleted file mode 100644
index c567865..000
--- a/block/blk-cgroup.h
+++ /dev/null
@@ -1,603 +0,0 @@
-#ifndef _BLK_CGROUP_H
-#define _BLK_CGROUP_H
-/*
- * Common Block IO controller cgroup interface
- *
- * Based on ideas and code from CFQ, CFS and BFQ:
- * Copyright (C) 2003 Jens Axboe 
- *
- * Copyright (C) 2008 Fabio Checconi 
- *   Paolo Valente 
- *
- * Copyright (C) 2009 Vivek Goyal 
- *   Nauman Rafique 
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/* Max limits for throttle policy */
-#define THROTL_IOPS_MAXUINT_MAX
-
-/* CFQ specific, out here for blkcg->cfq_weight */
-#define CFQ_WEIGHT_MIN 10
-#define CFQ_WEIGHT_MAX 1000
-#define CFQ_WEIGHT_DEFAULT 500
-
-#ifdef CONFIG_BLK_CGROUP
-
-enum blkg_rwstat_type {
-   BLKG_RWSTAT_READ,
-   BLKG_RWSTAT_WRITE,
-   BLKG_RWSTAT_SYNC,
-   BLKG_RWSTAT_ASYNC,
-
-   BLKG_RWSTAT_NR,
-   BLKG_RWSTAT_TOTAL = BLKG_RWSTAT_NR,
-};
-
-struct blkcg_gq;
-
-struct blkcg {
-   struct cgroup_subsys_state  css;
-   spinlock_t  lock;
-
-   struct radix_tree_root  blkg_tree;
-   struct blkcg_gq *blkg_hint;
-   struct hlist_head   blkg_list;
-
-   /* TODO: per-policy storage in blkcg */
-   unsigned intcfq_weight; /* belongs to cfq */
-   unsigned intcfq_leaf_weight;
-};
-
-struct blkg_stat {
-   struct u64_stats_sync   syncp;
-   uint64_tcnt;
-};
-
-struct blkg_rwstat {
-   struct u64_stats_sync   syncp;
-   uint64_tcnt[BLKG_RWSTAT_NR];
-};
-
-/*
- * A blkcg_gq (blkg) is association between a block cgroup (blkcg) and a
- * request_queue (q).  This is used by blkcg policies which need to track
- * information per blkcg - q pair.
- *
- * There can be multiple active blkcg policies and each has its private
- * data on each blkg, the size of which is determined by
- * blkcg_policy->pd_size.  blkcg core allocates and frees such areas
- * together with blkg and invokes pd_init/exit_fn() methods.
- *
- * Such private data must embed struct blkg_policy_data (pd) at the
- * beginning and pd_size can't be smaller than pd.
- */
-struct blkg_policy_data {
-   /* the blkg and policy id this per-policy data belongs to */
-   struct blkcg_gq *blkg;
-   int plid;
-
-   /* used during policy activation */
-   struct list_headalloc_node;
-};
-
-/* association between a blk cgroup and a request queue */
-struct blkcg_gq {
-   /* Pointer to the associated request_queue */
-   struct request_queue*q;
-   struct list_headq_node;
-   struct hlist_node   blkcg_node;
-   struct blkcg*blkcg;
-
-   /* all non-root blkcg_gq's are guaranteed to have access to parent */
-   struct blkcg_gq *parent;
-
-   /* request allocation list for this blkcg-q pair */
-   struct request_list rl;
-
-   /* reference count */
-   atomic_trefcnt;
-
-   /* is this blkg online? protected by both blkcg and q locks */
-   boolonline;
-
-   struct blkg_policy_data *pd[BLKCG_MAX_POLS];
-
-   struct rcu_head rcu_head;
-};
-
-typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg);
-
-struct blkcg_policy {
-   int plid;
-   /* policy specific private data 

[PATCH 03/48] update !CONFIG_BLK_CGROUP dummies in include/linux/blk-cgroup.h

2015-03-22 Thread Tejun Heo
The header file will be used more widely with the pending cgroup
writeback support and the current set of dummy declarations aren't
enough to handle different config combinations.  Update as follows.

* Drop the struct cgroup declaration.  None of the dummy defs need it.

* Define blkcg as an empty struct instead of just declaring it.

* Wrap dummy function defs in CONFIG_BLOCK.  Some functions use block
  data types and none of them are to be used w/o block enabled.

Signed-off-by: Tejun Heo 
---
 include/linux/blk-cgroup.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index c567865..51f95b3 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -558,8 +558,8 @@ static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
 
 #else  /* CONFIG_BLK_CGROUP */
 
-struct cgroup;
-struct blkcg;
+struct blkcg {
+};
 
 struct blkg_policy_data {
 };
@@ -570,6 +570,8 @@ struct blkcg_gq {
 struct blkcg_policy {
 };
 
+#ifdef CONFIG_BLOCK
+
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { 
return NULL; }
 static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
 static inline void blkcg_drain_queue(struct request_queue *q) { }
@@ -599,5 +601,6 @@ static inline struct request_list *blk_rq_rl(struct request 
*rq) { return >q
 #define blk_queue_for_each_rl(rl, q)   \
for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 
+#endif /* CONFIG_BLOCK */
 #endif /* CONFIG_BLK_CGROUP */
 #endif /* _BLK_CGROUP_H */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/48] blkcg: add blkcg_root_css

2015-03-22 Thread Tejun Heo
Add global constant blkcg_root_css which points to _root.css.
This will be used by cgroup writeback support.  If blkcg is disabled,
it's defined as ERR_PTR(-EINVAL).

v2: The declarations moved to include/linux/blk-cgroup.h as suggested
by Vivek.

Signed-off-by: Tejun Heo 
Cc: Vivek Goyal 
Cc: Jens Axboe 
---
 block/blk-cgroup.c | 2 ++
 include/linux/blk-cgroup.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c3226ce..9e0fe38 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,6 +30,8 @@ struct blkcg blkcg_root = { .cfq_weight = 2 * 
CFQ_WEIGHT_DEFAULT,
.cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, };
 EXPORT_SYMBOL_GPL(blkcg_root);
 
+struct cgroup_subsys_state * const blkcg_root_css = _root.css;
+
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
 static bool blkcg_policy_enabled(struct request_queue *q,
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 51f95b3..65f0c17 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -134,6 +134,7 @@ struct blkcg_policy {
 };
 
 extern struct blkcg blkcg_root;
+extern struct cgroup_subsys_state * const blkcg_root_css;
 
 struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
@@ -570,6 +571,8 @@ struct blkcg_gq {
 struct blkcg_policy {
 };
 
+#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
+
 #ifdef CONFIG_BLOCK
 
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { 
return NULL; }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/48] writeback: move backing_dev_info->state into bdi_writeback

2015-03-22 Thread Tejun Heo
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
  changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
  wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->state are mechanically replaced with bdi->wb.state
  introducing no behavior changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: drbd-...@lists.linbit.com
Cc: Neil Brown 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
---
 block/blk-core.c   |  1 -
 drivers/block/drbd/drbd_main.c | 10 +-
 drivers/md/dm.c|  2 +-
 drivers/md/raid1.c |  4 ++--
 drivers/md/raid10.c|  2 +-
 fs/fs-writeback.c  | 14 +++---
 include/linux/backing-dev.h| 24 
 mm/backing-dev.c   | 20 ++--
 8 files changed, 38 insertions(+), 39 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 56da125..fa1314e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -606,7 +606,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 
q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-   q->backing_dev_info.state = 0;
q->backing_dev_info.capabilities = 0;
q->backing_dev_info.name = "block";
q->node = node_id;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 1fc8342..61b00aa 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2360,7 +2360,7 @@ static void drbd_cleanup(void)
  * @congested_data:User data
  * @bdi_bits:  Bits the BDI flusher thread is currently interested in
  *
- * Returns 1flags)) {
-   r |= (1 << BDI_async_congested);
+   r |= (1 << WB_async_congested);
/* Without good local data, we would need to read from remote,
 * and that would need the worker thread as well, which is
 * currently blocked waiting for that usermode helper to
 * finish.
 */
if (!get_ldev_if_state(device, D_UP_TO_DATE))
-   r |= (1 << BDI_sync_congested);
+   r |= (1 << WB_sync_congested);
else
put_ldev(device);
r &= bdi_bits;
@@ -2400,9 +2400,9 @@ static int drbd_congested(void *congested_data, int 
bdi_bits)
reason = 'b';
}
 
-   if (bdi_bits & (1 << BDI_async_congested) &&
+   if (bdi_bits & (1 << WB_async_congested) &&
test_bit(NET_CONGESTED, 
_peer_device(device)->connection->flags)) {
-   r |= (1 << BDI_async_congested);
+   r |= (1 << WB_async_congested);
reason = reason == 'b' ? 'a' : 'n';
}
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 73f2880..c076982 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2039,7 +2039,7 @@ static int dm_any_congested(void *congested_data, int 
bdi_bits)
 * the query about congestion status of request_queue
 */
if (dm_request_based(md))
-   r = md->queue->backing_dev_info.state &
+   r = md->queue->backing_dev_info.wb.state &
bdi_bits;
else
r = dm_table_any_congested(map, bdi_bits);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..2fca392 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -739,7 +739,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
struct r1conf *conf = mddev->private;
int i, ret = 0;
 
-   if ((bits & (1 << BDI_async_congested)) &&
+   if ((bits & (1 << WB_async_congested)) &&
conf->pending_count >= max_queued_requests)
return 1;
 
@@ -754,7 +754,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
/* Note the '|| 1' - when read_balance prefers
 * non-congested targets, it can be removed
 */
-   if ((bits & (1backing_dev_info, 
bits);
diff --git a/drivers/md/raid10.c 

[PATCH 12/48] writeback: move bandwidth related fields from backing_dev_info into bdi_writeback

2015-03-22 Thread Tejun Heo
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.

* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
  write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
  balanced_dirty_ratelimit, completions and dirty_exceeded.

* writeback_chunk_size() and over_bgroup_thresh() now take @wb instead
  of @bdi.

* bdi_writeout_fraction(bdi, ...)   -> wb_writeout_fraction(wb, ...)
  bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
  bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
  bdi_update_writebandwidth(bdi, ...)   -> wb_update_write_bandwidth(wb, ...)
  [__]bdi_update_bandwidth(bdi, ...)-> [__]wb_update_bandwidth(wb, ...)
  bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
  bdi_dirty_limits(bdi, ...)-> wb_dirty_limits(wb, ...)

* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
  respectively.  Note that explicit zeroing is dropped in the process
  as wb's are cleared in entirety anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
  introducing no behavior changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Jaegeuk Kim 
Cc: Steven Whitehouse 
---
 fs/f2fs/node.c   |   4 +-
 fs/f2fs/segment.h|   2 +-
 fs/fs-writeback.c|  17 ++-
 fs/gfs2/super.c  |   2 +-
 include/linux/backing-dev.h  |  20 +--
 include/linux/writeback.h|  19 ++-
 include/trace/events/writeback.h |   8 +-
 mm/backing-dev.c |  45 +++
 mm/page-writeback.c  | 262 ---
 9 files changed, 187 insertions(+), 192 deletions(-)

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..a97da4e 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -51,7 +51,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
PAGE_CACHE_SHIFT;
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2);
} else if (type == DIRTY_DENTS) {
-   if (sbi->sb->s_bdi->dirty_exceeded)
+   if (sbi->sb->s_bdi->wb.dirty_exceeded)
return false;
mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
@@ -63,7 +63,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
sizeof(struct ino_entry)) >> PAGE_CACHE_SHIFT;
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
} else {
-   if (sbi->sb->s_bdi->dirty_exceeded)
+   if (sbi->sb->s_bdi->wb.dirty_exceeded)
return false;
}
return res;
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index 7fd3511..3a5bfcf 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -712,7 +712,7 @@ static inline unsigned int max_hw_blocks(struct 
f2fs_sb_info *sbi)
  */
 static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type)
 {
-   if (sbi->sb->s_bdi->dirty_exceeded)
+   if (sbi->sb->s_bdi->wb.dirty_exceeded)
return 0;
 
if (type == DATA)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 992a065..4fcf2385 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -606,7 +606,7 @@ out:
return ret;
 }
 
-static long writeback_chunk_size(struct backing_dev_info *bdi,
+static long writeback_chunk_size(struct bdi_writeback *wb,
 struct wb_writeback_work *work)
 {
long pages;
@@ -627,7 +627,7 @@ static long writeback_chunk_size(struct backing_dev_info 
*bdi,
if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
pages = LONG_MAX;
else {
-   pages = min(bdi->avg_write_bandwidth / 2,
+   pages = min(wb->avg_write_bandwidth / 2,
global_dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
@@ -725,7 +725,7 @@ static long writeback_sb_inodes(struct super_block *sb,
inode->i_state |= I_SYNC;
spin_unlock(>i_lock);
 
-   write_chunk = writeback_chunk_size(wb->bdi, work);
+   write_chunk = writeback_chunk_size(wb, work);
wbc.nr_to_write = write_chunk;

[PATCH 09/48] memcg: implement mem_cgroup_css_from_page()

2015-03-22 Thread Tejun Heo
Implement mem_cgroup_css_from_page() which returns the
cgroup_subsys_state of the memcg associated with a given page.  This
will be used by cgroup writeback support.

Signed-off-by: Tejun Heo 
Cc: Johannes Weiner 
Cc: Michal Hocko 
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c| 14 ++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 294498f..637ef62 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -115,6 +115,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
 }
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
+extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
 
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
   struct mem_cgroup *,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fda7025..74241b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -591,6 +591,20 @@ struct cgroup_subsys_state *mem_cgroup_css(struct 
mem_cgroup *memcg)
return >css;
 }
 
+/**
+ * mem_cgroup_css_from_page - css of the memcg associated with a page
+ * @page: page of interest
+ *
+ * This function is guaranteed to return a valid cgroup_subsys_state and
+ * the returned css remains accessible until @page is released.
+ */
+struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
+{
+   if (page->mem_cgroup)
+   return >mem_cgroup->css;
+   return _mem_cgroup->css;
+}
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/48] writeback: s/bdi/wb/ in mm/page-writeback.c

2015-03-22 Thread Tejun Heo
Writeback operations will now be per wb (bdi_writeback) instead of
bdi.  Replace the relevant bdi references in symbol names and comments
with wb.  This patch is purely cosmetic and doesn't make any
functional changes.

Signed-off-by: Tejun Heo 
Cc: Wu Fengguang 
Cc: Jan Kara 
Cc: Jens Axboe 
---
 mm/page-writeback.c | 270 ++--
 1 file changed, 134 insertions(+), 136 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 29fb4f3..c615a15 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -595,7 +595,7 @@ static long long pos_ratio_polynom(unsigned long setpoint,
  *
  * (o) global/bdi setpoints
  *
- * We want the dirty pages be balanced around the global/bdi setpoints.
+ * We want the dirty pages be balanced around the global/wb setpoints.
  * When the number of dirty pages is higher/lower than the setpoint, the
  * dirty position control ratio (and hence task dirty ratelimit) will be
  * decreased/increased to bring the dirty pages back to the setpoint.
@@ -605,8 +605,8 @@ static long long pos_ratio_polynom(unsigned long setpoint,
  * if (dirty < setpoint) scale up   pos_ratio
  * if (dirty > setpoint) scale down pos_ratio
  *
- * if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
- * if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ * if (wb_dirty < wb_setpoint) scale up   pos_ratio
+ * if (wb_dirty > wb_setpoint) scale down pos_ratio
  *
  * task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
  *
@@ -631,7 +631,7 @@ static long long pos_ratio_polynom(unsigned long setpoint,
  *   0 +.--.--*->
  *   freerun^  setpoint^ limit^   dirty pages
  *
- * (o) bdi control line
+ * (o) wb control line
  *
  * ^ pos_ratio
  * |
@@ -657,27 +657,27 @@ static long long pos_ratio_polynom(unsigned long setpoint,
  * |  .   .
  * |  . .
  *   0 +--.---.->
- *bdi_setpoint^x_intercept^
+ *wb_setpoint^x_intercept^
  *
- * The bdi control line won't drop below pos_ratio=1/4, so that bdi_dirty can
+ * The wb control line won't drop below pos_ratio=1/4, so that wb_dirty can
  * be smoothly throttled down to normal if it starts high in situations like
  * - start writing to a slow SD card and a fast disk at the same time. The SD
- *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
- * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ *   card's wb_dirty may rush to many times higher than wb_setpoint.
+ * - the wb dirty thresh drops quickly due to change of JBOD workload
  */
 static unsigned long wb_position_ratio(struct bdi_writeback *wb,
   unsigned long thresh,
   unsigned long bg_thresh,
   unsigned long dirty,
-  unsigned long bdi_thresh,
-  unsigned long bdi_dirty)
+  unsigned long wb_thresh,
+  unsigned long wb_dirty)
 {
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
-   unsigned long bdi_setpoint;
+   unsigned long wb_setpoint;
unsigned long span;
long long pos_ratio;/* for scaling up/down the rate limit */
long x;
@@ -696,146 +696,145 @@ static unsigned long wb_position_ratio(struct 
bdi_writeback *wb,
/*
 * The strictlimit feature is a tool preventing mistrusted filesystems
 * from growing a large number of dirty pages before throttling. For
-* such filesystems balance_dirty_pages always checks bdi counters
-* against bdi limits. Even if global "nr_dirty" is under "freerun".
+* such filesystems balance_dirty_pages always checks wb counters
+* against wb limits. Even if global "nr_dirty" is under "freerun".
 * This is especially important for fuse which sets bdi->max_ratio to
 * 1% by default. Without strictlimit feature, fuse writeback may
 * consume arbitrary amount of RAM because it is accounted in
 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
 *
 * Here, in wb_position_ratio(), we calculate pos_ratio based on
-* two values: bdi_dirty and bdi_thresh. Let's consider an example:
+* two values: wb_dirty and wb_thresh. Let's consider an example:
  

[PATCH 16/48] writeback: separate out include/linux/backing-dev-defs.h

2015-03-22 Thread Tejun Heo
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
---
 block/blk-integrity.c|   1 +
 block/blk-sysfs.c|   1 +
 block/bounce.c   |   1 +
 block/genhd.c|   1 +
 drivers/block/drbd/drbd_int.h|   1 +
 drivers/block/pktcdvd.c  |   1 +
 drivers/char/raw.c   |   1 +
 drivers/md/bcache/request.c  |   1 +
 drivers/md/dm.h  |   1 +
 drivers/md/md.h  |   1 +
 drivers/mtd/devices/block2mtd.c  |   1 +
 fs/block_dev.c   |   1 +
 fs/ext4/extents.c|   1 +
 fs/ext4/mballoc.c|   1 +
 fs/ext4/super.c  |   1 +
 fs/f2fs/segment.h|   1 +
 fs/hfs/super.c   |   1 +
 fs/hfsplus/super.c   |   1 +
 fs/nfs/filelayout/filelayout.c   |   1 +
 fs/ocfs2/file.c  |   1 +
 fs/reiserfs/super.c  |   1 +
 fs/ufs/super.c   |   1 +
 fs/xfs/xfs_file.c|   1 +
 include/linux/backing-dev-defs.h | 106 +++
 include/linux/backing-dev.h  | 102 +
 include/linux/blkdev.h   |   2 +-
 mm/madvise.c |   1 +
 27 files changed, 132 insertions(+), 102 deletions(-)
 create mode 100644 include/linux/backing-dev-defs.h

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 79ffb48..f548b64 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -21,6 +21,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5677eb7..1b60941 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/block/bounce.c b/block/bounce.c
index ab21ba2..c616a60 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/block/genhd.c b/block/genhd.c
index 0a536dc..d46ba56 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index b905e98..efd19c2 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628da..4c20c22 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -61,6 +61,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index 6e29bf2..ee47e59 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ab43fad..9c083b9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 59f53e7..ae4a3ca 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 318ca8f..641abb5 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -16,6 +16,7 @@
 #define _MD_MD_H
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 66f0405..e22e40f 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..e4f5f71 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index bed4308..21a7bcb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext4_jbd2.h"
 #include "ext4_extents.h"
 #include "xattr.h"
diff --git 

[PATCH 15/48] writeback: reorganize mm/backing-dev.c

2015-03-22 Thread Tejun Heo
Move wb_shutdown(), bdi_register(), bdi_register_dev(),
bdi_prune_sb(), bdi_remove_from_list() and bdi_unregister() so that
init / exit functions are grouped together.  This will make updating
init / exit paths for cgroup writeback support easier.

This is pure source file reorganization.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
---
 mm/backing-dev.c | 174 +++
 1 file changed, 87 insertions(+), 87 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 597f0ce..ff85ecb 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -286,93 +286,6 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
 }
 
 /*
- * Remove bdi from bdi_list, and ensure that it is no longer visible
- */
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
-{
-   spin_lock_bh(_lock);
-   list_del_rcu(>bdi_list);
-   spin_unlock_bh(_lock);
-
-   synchronize_rcu_expedited();
-}
-
-int bdi_register(struct backing_dev_info *bdi, struct device *parent,
-   const char *fmt, ...)
-{
-   va_list args;
-   struct device *dev;
-
-   if (bdi->dev)   /* The driver needs to use separate queues per device */
-   return 0;
-
-   va_start(args, fmt);
-   dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, 
args);
-   va_end(args);
-   if (IS_ERR(dev))
-   return PTR_ERR(dev);
-
-   bdi->dev = dev;
-
-   bdi_debug_register(bdi, dev_name(dev));
-   set_bit(WB_registered, >wb.state);
-
-   spin_lock_bh(_lock);
-   list_add_tail_rcu(>bdi_list, _list);
-   spin_unlock_bh(_lock);
-
-   trace_writeback_bdi_register(bdi);
-   return 0;
-}
-EXPORT_SYMBOL(bdi_register);
-
-int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
-{
-   return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
-}
-EXPORT_SYMBOL(bdi_register_dev);
-
-/*
- * Remove bdi from the global list and shutdown any threads we have running
- */
-static void wb_shutdown(struct bdi_writeback *wb)
-{
-   /* Make sure nobody queues further work */
-   spin_lock_bh(>work_lock);
-   if (!test_and_clear_bit(WB_registered, >state)) {
-   spin_unlock_bh(>work_lock);
-   return;
-   }
-   spin_unlock_bh(>work_lock);
-
-   /*
-* Drain work list and shutdown the delayed_work.  !WB_registered
-* tells wb_workfn() that @wb is dying and its work_list needs to
-* be drained no matter what.
-*/
-   mod_delayed_work(bdi_wq, >dwork, 0);
-   flush_delayed_work(>dwork);
-   WARN_ON(!list_empty(>work_list));
-}
-
-/*
- * Called when the device behind @bdi has been removed or ejected.
- *
- * We can't really do much here except for reducing the dirty ratio at
- * the moment.  In the future we should be able to set a flag so that
- * the filesystem can handle errors at mark_inode_dirty time instead
- * of only at writeback time.
- */
-void bdi_unregister(struct backing_dev_info *bdi)
-{
-   if (WARN_ON_ONCE(!bdi->dev))
-   return;
-
-   bdi_set_min_ratio(bdi, 0);
-}
-EXPORT_SYMBOL(bdi_unregister);
-
-/*
  * Initial write bandwidth: 100 MB/s
  */
 #define INIT_BW(100 << (20 - PAGE_SHIFT))
@@ -418,6 +331,29 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi)
return 0;
 }
 
+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void wb_shutdown(struct bdi_writeback *wb)
+{
+   /* Make sure nobody queues further work */
+   spin_lock_bh(>work_lock);
+   if (!test_and_clear_bit(WB_registered, >state)) {
+   spin_unlock_bh(>work_lock);
+   return;
+   }
+   spin_unlock_bh(>work_lock);
+
+   /*
+* Drain work list and shutdown the delayed_work.  !WB_registered
+* tells wb_workfn() that @wb is dying and its work_list needs to
+* be drained no matter what.
+*/
+   mod_delayed_work(bdi_wq, >dwork, 0);
+   flush_delayed_work(>dwork);
+   WARN_ON(!list_empty(>work_list));
+}
+
 static void wb_exit(struct bdi_writeback *wb)
 {
int i;
@@ -449,6 +385,70 @@ int bdi_init(struct backing_dev_info *bdi)
 }
 EXPORT_SYMBOL(bdi_init);
 
+int bdi_register(struct backing_dev_info *bdi, struct device *parent,
+   const char *fmt, ...)
+{
+   va_list args;
+   struct device *dev;
+
+   if (bdi->dev)   /* The driver needs to use separate queues per device */
+   return 0;
+
+   va_start(args, fmt);
+   dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, 
args);
+   va_end(args);
+   if (IS_ERR(dev))
+   return PTR_ERR(dev);
+
+   bdi->dev = dev;
+
+   bdi_debug_register(bdi, dev_name(dev));
+   set_bit(WB_registered, >wb.state);
+
+   spin_lock_bh(_lock);
+   list_add_tail_rcu(>bdi_list, _list);
+ 

[PATCH 14/48] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback

2015-03-22 Thread Tejun Heo
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->wb_lock and ->worklist into wb.

* The lock protects bdi->worklist and bdi->wb.dwork scheduling.  While
  moving, rename it to wb->work_lock as wb->wb_lock is confusing.
  Also, move wb->dwork downwards so that it's colocated with the new
  ->work_lock and ->work_list fields.

* bdi_writeback_workfn()-> wb_workfn()
  bdi_wakeup_thread_delayed(bdi)-> wb_wakeup_delayed(wb)
  bdi_wakeup_thread(bdi)-> wb_wakeup(wb)
  bdi_queue_work(bdi, ...)  -> wb_queue_work(wb, ...)
  __bdi_start_writeback(bdi, ...)   -> __wb_start_writeback(wb, ...)
  get_next_work_item(bdi)   -> get_next_work_item(wb)

* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
  The function contained parts which belong to the containing bdi
  rather than the wb itself - testing cap_writeback_dirty and
  bdi_remove_from_list() invocation.  Those are moved to
  bdi_unregister().

* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
  Initializations of the moved bdi->wb_lock and ->work_list are
  relocated from bdi_init() to wb_init().

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->state are mechanically replaced with bdi->wb.state
  introducing no behavior changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
---
 fs/fs-writeback.c   | 84 +
 include/linux/backing-dev.h | 12 +++
 mm/backing-dev.c| 59 +++
 3 files changed, 74 insertions(+), 81 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4fcf2385..7c2f0bd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -97,34 +97,33 @@ static inline struct inode *wb_inode(struct list_head *head)
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
 
-static void bdi_wakeup_thread(struct backing_dev_info *bdi)
+static void wb_wakeup(struct bdi_writeback *wb)
 {
-   spin_lock_bh(>wb_lock);
-   if (test_bit(WB_registered, >wb.state))
-   mod_delayed_work(bdi_wq, >wb.dwork, 0);
-   spin_unlock_bh(>wb_lock);
+   spin_lock_bh(>work_lock);
+   if (test_bit(WB_registered, >state))
+   mod_delayed_work(bdi_wq, >dwork, 0);
+   spin_unlock_bh(>work_lock);
 }
 
-static void bdi_queue_work(struct backing_dev_info *bdi,
-  struct wb_writeback_work *work)
+static void wb_queue_work(struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
 {
-   trace_writeback_queue(bdi, work);
+   trace_writeback_queue(wb->bdi, work);
 
-   spin_lock_bh(>wb_lock);
-   if (!test_bit(WB_registered, >wb.state)) {
+   spin_lock_bh(>work_lock);
+   if (!test_bit(WB_registered, >state)) {
if (work->done)
complete(work->done);
goto out_unlock;
}
-   list_add_tail(>list, >work_list);
-   mod_delayed_work(bdi_wq, >wb.dwork, 0);
+   list_add_tail(>list, >work_list);
+   mod_delayed_work(bdi_wq, >dwork, 0);
 out_unlock:
-   spin_unlock_bh(>wb_lock);
+   spin_unlock_bh(>work_lock);
 }
 
-static void
-__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
- bool range_cyclic, enum wb_reason reason)
+static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+bool range_cyclic, enum wb_reason reason)
 {
struct wb_writeback_work *work;
 
@@ -134,8 +133,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long 
nr_pages,
 */
work = kzalloc(sizeof(*work), GFP_ATOMIC);
if (!work) {
-   trace_writeback_nowork(bdi);
-   bdi_wakeup_thread(bdi);
+   trace_writeback_nowork(wb->bdi);
+   wb_wakeup(wb);
return;
}
 
@@ -144,7 +143,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long 
nr_pages,
work->range_cyclic = range_cyclic;
work->reason= reason;
 
-   bdi_queue_work(bdi, work);
+   wb_queue_work(wb, work);
 }
 
 /**
@@ -162,7 +161,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long 
nr_pages,
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason)
 {
-   __bdi_start_writeback(bdi, nr_pages, true, reason);
+   __wb_start_writeback(>wb, nr_pages, true, reason);
 }
 
 /**
@@ -182,7 +181,7 @@ void bdi_start_background_writeback(struct backing_dev_info 
*bdi)
 * writeback as soon as there is 

[PATCH 07/48] blkcg: implement task_get_blkcg_css()

2015-03-22 Thread Tejun Heo
Implement a wrapper around task_get_css() to acquire the blkcg css for
a given task.  The wrapper is necessary for cgroup writeback support
as there will be places outside blkcg proper trying to acquire
blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP.

Signed-off-by: Tejun Heo 
---
 include/linux/blk-cgroup.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 65f0c17..4dc643f 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -195,6 +195,12 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
return task_blkcg(current);
 }
 
+static inline struct cgroup_subsys_state *
+task_get_blkcg_css(struct task_struct *task)
+{
+   return task_get_css(task, blkio_cgrp_id);
+}
+
 /**
  * blkcg_parent - get the parent of a blkcg
  * @blkcg: blkcg of interest
@@ -573,6 +579,12 @@ struct blkcg_policy {
 
 #define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
 
+static inline struct cgroup_subsys_state *
+task_get_blkcg_css(struct task_struct *task)
+{
+   return NULL;
+}
+
 #ifdef CONFIG_BLOCK
 
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { 
return NULL; }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/48] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

2015-03-22 Thread Tejun Heo
cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 block/blk-core.c|  2 +-
 include/linux/backing-dev.h | 32 +++-
 include/linux/fs.h  |  1 +
 init/Kconfig|  5 +
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fa1314e..c44018a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -606,7 +606,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 
q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-   q->backing_dev_info.capabilities = 0;
+   q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->backing_dev_info.name = "block";
q->node = node_id;
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bfdaa18..6bb3123 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -134,12 +134,15 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
  * BDI_CAP_NO_WRITEBACK:   Don't write pages back
  * BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
  * BDI_CAP_STRICTLIMIT:Keep number of dirty pages below bdi threshold.
+ *
+ * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
  */
 #define BDI_CAP_NO_ACCT_DIRTY  0x0001
 #define BDI_CAP_NO_WRITEBACK   0x0002
 #define BDI_CAP_NO_ACCT_WB 0x0004
 #define BDI_CAP_STABLE_WRITES  0x0008
 #define BDI_CAP_STRICTLIMIT0x0010
+#define BDI_CAP_CGROUP_WRITEBACK 0x0020
 
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -229,4 +232,31 @@ static inline int bdi_sched_wait(void *word)
return 0;
 }
 
-#endif /* _LINUX_BACKING_DEV_H */
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
+ * @inode: inode of interest
+ *
+ * cgroup writeback requires support from both the bdi and filesystem.
+ * Test whether @inode has both.
+ */
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   return bdi_cap_account_dirty(bdi) &&
+   (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
+   (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
+}
+
+#else  /* CONFIG_CGROUP_WRITEBACK */
+
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+   return false;
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
+#endif /* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ccf4b64..bc72737 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1862,6 +1862,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
 #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */
+#define FS_CGROUP_WRITEBACK32  /* Supports cgroup-aware writeback */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..9f17798 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1132,6 +1132,11 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.
 
+config CGROUP_WRITEBACK
+   bool
+   depends on MEMCG && BLK_CGROUP
+   default y
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 17/48] bdi: make inode_to_bdi() inline

2015-03-22 Thread Tejun Heo
Now that bdi definitions are moved to backing-dev-defs.h,
backing-dev.h can include blkdev.h and inline inode_to_bdi() without
worrying about introducing circular include dependency.  The function
gets called from hot paths and fairly trivial.

This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
function calls inline.  blockdev_superblock and noop_backing_dev_info
are EXPORT_GPL'd to allow the inline functions to be used from
modules.

While at it, maske sb_is_blkdev_sb() return bool instead of int.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
---
 fs/block_dev.c  |  8 ++--
 fs/fs-writeback.c   | 16 
 include/linux/backing-dev.h | 18 --
 include/linux/fs.h  |  8 +++-
 mm/backing-dev.c|  1 +
 5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e4f5f71..875d41a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -549,7 +549,8 @@ static struct file_system_type bd_type = {
.kill_sb= kill_anon_super,
 };
 
-static struct super_block *blockdev_superblock __read_mostly;
+struct super_block *blockdev_superblock __read_mostly;
+EXPORT_SYMBOL_GPL(blockdev_superblock);
 
 void __init bdev_cache_init(void)
 {
@@ -690,11 +691,6 @@ static struct block_device *bd_acquire(struct inode *inode)
return bdev;
 }
 
-int sb_is_blkdev_sb(struct super_block *sb)
-{
-   return sb == blockdev_superblock;
-}
-
 /* Call when you free inode */
 
 void bd_forget(struct inode *inode)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7c2f0bd..4fd264d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -66,22 +66,6 @@ int writeback_in_progress(struct backing_dev_info *bdi)
 }
 EXPORT_SYMBOL(writeback_in_progress);
 
-struct backing_dev_info *inode_to_bdi(struct inode *inode)
-{
-   struct super_block *sb;
-
-   if (!inode)
-   return _backing_dev_info;
-
-   sb = inode->i_sb;
-#ifdef CONFIG_BLOCK
-   if (sb_is_blkdev_sb(sb))
-   return blk_get_backing_dev_info(I_BDEV(inode));
-#endif
-   return sb->s_bdi;
-}
-EXPORT_SYMBOL_GPL(inode_to_bdi);
-
 static inline struct inode *wb_inode(struct list_head *head)
 {
return list_entry(head, struct inode, i_wb_list);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5e39f7a..7857820 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -11,11 +11,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-struct backing_dev_info *inode_to_bdi(struct inode *inode);
-
 int __must_check bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
@@ -149,6 +148,21 @@ extern struct backing_dev_info noop_backing_dev_info;
 
 int writeback_in_progress(struct backing_dev_info *bdi);
 
+static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+{
+   struct super_block *sb;
+
+   if (!inode)
+   return _backing_dev_info;
+
+   sb = inode->i_sb;
+#ifdef CONFIG_BLOCK
+   if (sb_is_blkdev_sb(sb))
+   return blk_get_backing_dev_info(I_BDEV(inode));
+#endif
+   return sb->s_bdi;
+}
+
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
if (bdi->congested_fn)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..ccf4b64 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2205,7 +2205,13 @@ extern struct super_block *freeze_bdev(struct 
block_device *);
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
-extern int sb_is_blkdev_sb(struct super_block *sb);
+
+extern struct super_block *blockdev_superblock;
+
+static inline bool sb_is_blkdev_sb(struct super_block *sb)
+{
+   return sb == blockdev_superblock;
+}
 #else
 static inline void bd_forget(struct inode *inode) {}
 static inline int sync_blockdev(struct block_device *bdev) { return 0; }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ff85ecb..b0707d1 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = {
.name   = "noop",
.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
+EXPORT_SYMBOL_GPL(noop_backing_dev_info);
 
 static struct class *bdi_class;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 19/48] bdi: separate out congested state into a separate struct

2015-03-22 Thread Tejun Heo
Currently, a wb's (bdi_writeback) congestion state is carried in its
->state field; however, cgroup writeback support will require multiple
wb's sharing the same congestion state.  This patch separates out
congestion state into its own struct - struct bdi_writeback_congested.
A new field wb field, wb_congested, points to its associated congested
struct.  The default wb, bdi->wb, always points to bdi->wb_congested.

While this patch adds a layer of indirection, it doesn't introduce any
behavior changes.

Signed-off-by: Tejun Heo 
---
 include/linux/backing-dev-defs.h | 14 --
 include/linux/backing-dev.h  |  2 +-
 mm/backing-dev.c |  7 +--
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index aa18c4b..9e9eafa 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -16,12 +16,15 @@ struct dentry;
  * Bits in bdi_writeback.state
  */
 enum wb_state {
-   WB_async_congested, /* The async (write) queue is getting full */
-   WB_sync_congested,  /* The sync queue is getting full */
WB_registered,  /* bdi_register() was done */
WB_writeback_running,   /* Writeback is in progress */
 };
 
+enum wb_congested_state {
+   WB_async_congested, /* The async (write) queue is getting full */
+   WB_sync_congested,  /* The sync queue is getting full */
+};
+
 typedef int (congested_fn)(void *, int);
 
 enum wb_stat_item {
@@ -34,6 +37,10 @@ enum wb_stat_item {
 
 #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+struct bdi_writeback_congested {
+   unsigned long state;/* WB_[a]sync_congested flags */
+};
+
 struct bdi_writeback {
struct backing_dev_info *bdi;   /* our parent bdi */
 
@@ -48,6 +55,8 @@ struct bdi_writeback {
 
struct percpu_counter stat[NR_WB_STAT_ITEMS];
 
+   struct bdi_writeback_congested *congested;
+
unsigned long bw_time_stamp;/* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp;/* pages written at bw_time_stamp */
@@ -84,6 +93,7 @@ struct backing_dev_info {
unsigned int max_ratio, max_prop_frac;
 
struct bdi_writeback wb;  /* default writeback info for this bdi */
+   struct bdi_writeback_congested wb_congested;
 
struct device *dev;
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 7857820..bfdaa18 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,7 +167,7 @@ static inline int bdi_congested(struct backing_dev_info 
*bdi, int bdi_bits)
 {
if (bdi->congested_fn)
return bdi->congested_fn(bdi->congested_data, bdi_bits);
-   return (bdi->wb.state & bdi_bits);
+   return (bdi->wb.congested->state & bdi_bits);
 }
 
 static inline int bdi_read_congested(struct backing_dev_info *bdi)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 805b287..5ec7658 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -383,6 +383,9 @@ int bdi_init(struct backing_dev_info *bdi)
if (err)
return err;
 
+   bdi->wb_congested.state = 0;
+   bdi->wb.congested = >wb_congested;
+
return 0;
 }
 EXPORT_SYMBOL(bdi_init);
@@ -504,7 +507,7 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int 
sync)
wait_queue_head_t *wqh = _wqh[sync];
 
bit = sync ? WB_sync_congested : WB_async_congested;
-   if (test_and_clear_bit(bit, >wb.state))
+   if (test_and_clear_bit(bit, >wb.congested->state))
atomic_dec(_bdi_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
@@ -517,7 +520,7 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
sync)
enum wb_state bit;
 
bit = sync ? WB_sync_congested : WB_async_congested;
-   if (!test_and_set_bit(bit, >wb.state))
+   if (!test_and_set_bit(bit, >wb.congested->state))
atomic_inc(_bdi_congested[sync]);
 }
 EXPORT_SYMBOL(set_bdi_congested);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHSET 2/3 block/for-4.1/core] writeback: cgroup writeback backpressure propagation

2015-03-22 Thread Tejun Heo
Hello,

While the previous patchset[2] implemented cgroup writeback support,
the IO back pressure propagation mechanism implemented in
balance_dirty_pages() and its subroutines isn't yet aware of cgroup
writeback.

Processes belonging to a memcg may have access to only subset of total
memory available in the system and not factoring this into dirty
throttling rendered it completely ineffective for processes under
memcg limits and memcg ended up building a separate ad-hoc degenerate
mechanism directly into vmscan code to limit page dirtying.

This patchset refactors the dirty throttling logic implemented in
balance_dirty_pages() and its subroutines os that it can handle both
global and memcg memory domains.  Dirty throttling mechanism is
applied against both the global and memcg constraints and the more
restricted of the two is used for actual throttling.

This makes the dirty throttling mechanism operational for memcg
domains including writeback-bandwidth-proportional dirty page
distribution inside them.

This patchset contains the following 18 patches.

 0001-memcg-make-mem_cgroup_read_-stat-event-iterate-possi.patch
 0002-writeback-reorganize-__-wb_update_bandwidth.patch
 0003-writeback-implement-wb_domain.patch
 0004-writeback-move-global_dirty_limit-into-wb_domain.patch
 0005-writeback-consolidate-dirty-throttle-parameters-into.patch
 0006-writeback-add-dirty_throttle_control-wb_bg_thresh.patch
 0007-writeback-make-__wb_dirty_limit-take-dirty_throttle_.patch
 0008-writeback-add-dirty_throttle_control-pos_ratio.patch
 0009-writeback-add-dirty_throttle_control-wb_completions.patch
 0010-writeback-add-dirty_throttle_control-dom.patch
 0011-writeback-make-__wb_writeout_inc-and-hard_dirty_limi.patch
 0012-writeback-separate-out-domain_dirty_limits.patch
 0013-writeback-move-over_bground_thresh-to-mm-page-writeb.patch
 0014-writeback-update-wb_over_bg_thresh-to-use-wb_domain-.patch
 0015-writeback-implement-memcg-wb_domain.patch
 0016-writeback-reset-wb_domain-dirty_limit-_tstmp-when-me.patch
 0017-writeback-implement-memcg-writeback-domain-based-thr.patch
 0018-mm-vmscan-remove-memcg-stalling-on-writeback-pages-d.patch

0001-0002 are prep patches.

0003-0014 refactors dirty throttling logic so that it operates on
wb_domain.

0015-0018 implement memcg wb_domain.

This patchset is on top of

  block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() 
if __GFP_WAIT isn't set")
+ [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation
+ [2] [PATCHSET 1/3 v2 block/for-4.1/core] writeback: cgroup writeback support

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
review-cgroup-writeback-backpressure-20150322

diffstat follows.  Thanks.

 fs/fs-writeback.c|   32 -
 include/linux/backing-dev-defs.h |1 
 include/linux/memcontrol.h   |   21 +
 include/linux/writeback.h|   82 +++-
 include/trace/events/writeback.h |7 
 mm/backing-dev.c |9 
 mm/memcontrol.c  |  145 +--
 mm/page-writeback.c  |  722 +--
 mm/vmscan.c  |  109 +
 9 files changed, 716 insertions(+), 412 deletions(-)

--
tejun

[L] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email...@kernel.org
[1] http://lkml.kernel.org/g/20150323041848.ga8...@htj.duckdns.org
[2] http://lkml.kernel.org/g/1427086499-15657-1-git-send-email...@kernel.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

2015-03-22 Thread Tejun Heo
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg.  This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg.  This means that there may be multiple
wb's of a bdi mapped to the same blkcg.  As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state.  This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
by its memcg id.  Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.

v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 block/blk-cgroup.c   |   7 +-
 fs/fs-writeback.c|   8 +-
 fs/inode.c   |   1 +
 include/linux/backing-dev-defs.h |  59 +-
 include/linux/backing-dev.h  | 195 +++
 include/linux/blk-cgroup.h   |   4 +
 include/linux/fs.h   |   4 +
 include/linux/memcontrol.h   |   4 +
 mm/backing-dev.c | 398 +++
 mm/memcontrol.c  |  19 +-
 mm/page-writeback.c  |  11 +-
 11 files changed, 699 insertions(+), 11 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9e0fe38..d2b1cbf 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -811,6 +812,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state 
*css)
}
 
spin_unlock_irq(>lock);
+
+   wb_blkcg_offline(blkcg);
 }
 
 static void blkcg_css_free(struct cgroup_subsys_state *css)
@@ -841,7 +844,9 @@ done:
spin_lock_init(>lock);
INIT_RADIX_TREE(>blkg_tree, GFP_ATOMIC);
INIT_HLIST_HEAD(>blkg_list);
-
+#ifdef CONFIG_CGROUP_WRITEBACK
+   INIT_LIST_HEAD(>cgwb_list);
+#endif
return >css;
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4fd264d..48db5e6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -173,11 +173,11 @@ void bdi_start_background_writeback(struct 
backing_dev_info *bdi)
  */
 void inode_wb_list_del(struct inode *inode)
 {
-   struct backing_dev_info *bdi = inode_to_bdi(inode);
+   struct bdi_writeback *wb = inode_to_wb(inode);
 
-   spin_lock(>wb.list_lock);
+   spin_lock(>list_lock);
list_del_init(>i_wb_list);
-   spin_unlock(>wb.list_lock);
+   spin_unlock(>list_lock);
 }
 
 /*
@@ -1200,6 +1200,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
 
+   inode_attach_wb(inode, NULL);
+
if (flags & I_DIRTY_INODE)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
diff --git a/fs/inode.c b/fs/inode.c
index f00b16f..55cedf8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu);
 void __destroy_inode(struct inode *inode)
 {
BUG_ON(inode_has_buffers(inode));
+   inode_detach_wb(inode);
security_inode_free(inode);
fsnotify_inode_delete(inode);
locks_free_lock_context(inode->i_flctx);
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 9e9eafa..a1e9c40 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -2,8 +2,11 @@
 #define __LINUX_BACKING_DEV_DEFS_H
 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -37,10 +40,43 @@ enum wb_stat_item {
 
 #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+/*
+ * For cgroup writeback, multiple wb's may map to the same blkcg.  Those
+ * wb's can operate mostly independently but should share the congested
+ * state.  To facilitate such sharing, the congested state is tracked using
+ * the following struct which is created on demand, indexed by blkcg ID on
+ * its bdi, and refcounted.
+ */
 struct bdi_writeback_congested {
unsigned long state;/* WB_[a]sync_congested flags */
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+   struct backing_dev_info *bdi;   /* the associated bdi */
+ 

[PATCH 03/18] writeback: implement wb_domain

2015-03-22 Thread Tejun Heo
Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.

For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg.  IOW, a wb will belong to two writeback domains - the global
and memcg domains.

Currently, what constitutes the global writeback domain are scattered
across a number of global states.  This patch starts collecting them
into struct wb_domain.

* fprop_global which serves as the basis for proportional bandwidth
  measurement and its period timer are moved into struct wb_domain.

* global_wb_domain hosts the states for the global domain.

* While at it, flatten wb_writeout_fraction() into its callers.  This
  thin wrapper doesn't provide any actual benefits while getting in
  the way.

This is pure reorganization and doesn't introduce any behavioral
changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 include/linux/writeback.h | 32 +
 mm/page-writeback.c   | 72 ++-
 2 files changed, 59 insertions(+), 45 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 82e0e39..5af0a57e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_PER_CPU(int, dirty_throttle_leaks);
 
@@ -87,6 +88,36 @@ struct writeback_control {
 };
 
 /*
+ * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
+ * and are measured against each other in.  There always is one global
+ * domain, global_wb_domain, that every wb in the system is a member of.
+ * This allows measuring the relative bandwidth of each wb to distribute
+ * dirtyable memory accordingly.
+ */
+struct wb_domain {
+   /*
+* Scale the writeback cache size proportional to the relative
+* writeout speed.
+*
+* We do this by keeping a floating proportion between BDIs, based
+* on page writeback completions [end_page_writeback()]. Those
+* devices that write out pages fastest will get the larger share,
+* while the slower will get a smaller share.
+*
+* We use page writeout completions because we are interested in
+* getting rid of dirty pages. Having them written out is the
+* primary goal.
+*
+* We introduce a concept of time, a period over which we measure
+* these events, because demand can/will vary over time. The length
+* of this period itself is measured in page writeback completions.
+*/
+   struct fprop_global completions;
+   struct timer_list period_timer; /* timer for aging of completions */
+   unsigned long period_time;
+};
+
+/*
  * fs/fs-writeback.c
  */
 struct bdi_writeback;
@@ -120,6 +151,7 @@ static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d9ebabe..3c6ccc7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,29 +124,7 @@ EXPORT_SYMBOL(laptop_mode);
 
 unsigned long global_dirty_limit;
 
-/*
- * Scale the writeback cache size proportional to the relative writeout speeds.
- *
- * We do this by keeping a floating proportion between BDIs, based on page
- * writeback completions [end_page_writeback()]. Those devices that write out
- * pages fastest will get the larger share, while the slower will get a smaller
- * share.
- *
- * We use page writeout completions because we are interested in getting rid of
- * dirty pages. Having them written out is the primary goal.
- *
- * We introduce a concept of time, a period over which we measure these events,
- * because demand can/will vary over time. The length of this period itself is
- * measured in page writeback completions.
- *
- */
-static struct fprop_global writeout_completions;
-
-static void writeout_period(unsigned long t);
-/* Timer for aging of writeout_completions */
-static struct timer_list writeout_period_timer =
-   TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0);
-static unsigned long writeout_period_time = 0;
+static struct wb_domain global_wb_domain;
 
 /*
  * Length of period for aging writeout fractions of bdis. This is an
@@ -433,24 +411,26 @@ static unsigned long wp_next_time(unsigned long cur_time)
 }
 
 /*
- * Increment the BDI's writeout completion count and the global writeout
+ * Increment the wb's writeout completion count and the global writeout
  * completion count. Called from 

[PATCH 01/18] memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online

2015-03-22 Thread Tejun Heo
cpu_possible_mask represents the CPUs which are actually possible
during that boot instance.  For systems which don't support CPU
hotplug, this will match cpu_online_mask exactly in most cases.  Even
for systems which support CPU hotplug, the number of possible CPU
slots is highly unlikely to diverge greatly from the number of online
CPUs.  The only cases where the difference between possible and online
caused problems were when the boot code failed to initialize the
possible mask and left it fully set at NR_CPUS - 1.

As such, most per-cpu constructs allocate for all possible CPUs and
often iterate over the possibles, which also has the benefit of
avoiding the blocking CPU hotplug synchronization.

memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
mem_cgroup_read_events(), which iterates over online CPUs and handles
CPU hotplug operations explicitly.  This complexity doesn't actually
buy anything.  Switch to iterating over the possibles and drop the
explicit CPU hotplug handling.

Eventually, we want to convert memcg to use percpu_counter instead of
its own custom implementation which also benefits from quick access
w/o summing for cases where larger error margin is acceptable.

This will allow mem_cgroup_read_stat() to be called from non-sleepable
contexts which will be used by cgroup writeback.

Signed-off-by: Tejun Heo 
Cc: Johannes Weiner 
Cc: Michal Hocko 
---
 mm/memcontrol.c | 51 ++-
 1 file changed, 2 insertions(+), 49 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6fa6fe..ab483e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -323,11 +323,6 @@ struct mem_cgroup {
 * percpu counter.
 */
struct mem_cgroup_stat_cpu __percpu *stat;
-   /*
-* used when a cpu is offlined or other synchronizations
-* See mem_cgroup_read_stat().
-*/
-   struct mem_cgroup_stat_cpu nocpu_base;
spinlock_t pcp_counter_lock;
 
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
@@ -808,15 +803,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
long val = 0;
int cpu;
 
-   get_online_cpus();
-   for_each_online_cpu(cpu)
+   for_each_possible_cpu(cpu)
val += per_cpu(memcg->stat->count[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
-   spin_lock(>pcp_counter_lock);
-   val += memcg->nocpu_base.count[idx];
-   spin_unlock(>pcp_counter_lock);
-#endif
-   put_online_cpus();
return val;
 }
 
@@ -826,15 +814,8 @@ static unsigned long mem_cgroup_read_events(struct 
mem_cgroup *memcg,
unsigned long val = 0;
int cpu;
 
-   get_online_cpus();
-   for_each_online_cpu(cpu)
+   for_each_possible_cpu(cpu)
val += per_cpu(memcg->stat->events[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
-   spin_lock(>pcp_counter_lock);
-   val += memcg->nocpu_base.events[idx];
-   spin_unlock(>pcp_counter_lock);
-#endif
-   put_online_cpus();
return val;
 }
 
@@ -2182,37 +2163,12 @@ static void drain_all_stock(struct mem_cgroup 
*root_memcg)
mutex_unlock(_charge_mutex);
 }
 
-/*
- * This function drains percpu counter value from DEAD cpu and
- * move it to local cpu. Note that this function can be preempted.
- */
-static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
-{
-   int i;
-
-   spin_lock(>pcp_counter_lock);
-   for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-   long x = per_cpu(memcg->stat->count[i], cpu);
-
-   per_cpu(memcg->stat->count[i], cpu) = 0;
-   memcg->nocpu_base.count[i] += x;
-   }
-   for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
-   unsigned long x = per_cpu(memcg->stat->events[i], cpu);
-
-   per_cpu(memcg->stat->events[i], cpu) = 0;
-   memcg->nocpu_base.events[i] += x;
-   }
-   spin_unlock(>pcp_counter_lock);
-}
-
 static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
unsigned long action,
void *hcpu)
 {
int cpu = (unsigned long)hcpu;
struct memcg_stock_pcp *stock;
-   struct mem_cgroup *iter;
 
if (action == CPU_ONLINE)
return NOTIFY_OK;
@@ -2220,9 +2176,6 @@ static int memcg_cpu_hotplug_callback(struct 
notifier_block *nb,
if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
return NOTIFY_OK;
 
-   for_each_mem_cgroup(iter)
-   mem_cgroup_drain_pcp_counter(iter, cpu);
-
stock = _cpu(memcg_stock, cpu);
drain_stock(stock);
return NOTIFY_OK;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/18] writeback: consolidate dirty throttle parameters into dirty_throttle_control

2015-03-22 Thread Tejun Heo
Dirty throttling implemented in balance_dirty_pages() and its
subroutines makes use of a number of parameters which are passed
around individually.  This renders these functions somewhat unwieldy
and makes it difficult to add or change the involved parameters.  Also
some functions use different or conflicting naming schemes for the
same parameters making the code confusing to follow.

This patch consolidates the main parameters into struct
dirty_throttle_control so that they can be passed around easily and
adding new paramters isn't painful.  This also unifies how a given
parameter is named and accessed.  The drawback of using this type of
control structure rather than explicit paramters is that it isn't
immediately obvious which function accesses and modifies what;
however, it's fairly clear that the benefits outweigh in this case.

GDTC_INIT() macro is provided to ease initializing
dirty_throttle_control for the global_wb_domain and
balance_dirty_pages() uses a separate pointer to point to its global
dirty_throttle_control.  This is to make it uniform with memcg domain
handling which will be added later.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 212 +---
 1 file changed, 101 insertions(+), 111 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 06c5d3a..b8e95a4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,6 +124,20 @@ EXPORT_SYMBOL(laptop_mode);
 
 struct wb_domain global_wb_domain;
 
+/* consolidated parameters for balance_dirty_pages() and its subroutines */
+struct dirty_throttle_control {
+   struct bdi_writeback*wb;
+
+   unsigned long   dirty;  /* file_dirty + write + nfs */
+   unsigned long   thresh; /* dirty threshold */
+   unsigned long   bg_thresh;  /* dirty background threshold */
+
+   unsigned long   wb_dirty;   /* per-wb counterparts */
+   unsigned long   wb_thresh;
+};
+
+#define GDTC_INIT(__wb).wb = (__wb)
+
 /*
  * Length of period for aging writeout fractions of bdis. This is an
  * arbitrarily chosen number. The longer the period, the slower fractions will
@@ -695,16 +709,13 @@ static long long pos_ratio_polynom(unsigned long setpoint,
  *   card's wb_dirty may rush to many times higher than wb_setpoint.
  * - the wb dirty thresh drops quickly due to change of JBOD workload
  */
-static unsigned long wb_position_ratio(struct bdi_writeback *wb,
-  unsigned long thresh,
-  unsigned long bg_thresh,
-  unsigned long dirty,
-  unsigned long wb_thresh,
-  unsigned long wb_dirty)
+static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
 {
+   struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = wb->avg_write_bandwidth;
-   unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
-   unsigned long limit = hard_dirty_limit(thresh);
+   unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, 
dtc->bg_thresh);
+   unsigned long limit = hard_dirty_limit(dtc->thresh);
+   unsigned long wb_thresh = dtc->wb_thresh;
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
unsigned long wb_setpoint;
@@ -712,7 +723,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback 
*wb,
long long pos_ratio;/* for scaling up/down the rate limit */
long x;
 
-   if (unlikely(dirty >= limit))
+   if (unlikely(dtc->dirty >= limit))
return 0;
 
/*
@@ -721,7 +732,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback 
*wb,
 * See comment for pos_ratio_polynom().
 */
setpoint = (freerun + limit) / 2;
-   pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
+   pos_ratio = pos_ratio_polynom(setpoint, dtc->dirty, limit);
 
/*
 * The strictlimit feature is a tool preventing mistrusted filesystems
@@ -752,20 +763,21 @@ static unsigned long wb_position_ratio(struct 
bdi_writeback *wb,
long long wb_pos_ratio;
unsigned long wb_bg_thresh;
 
-   if (wb_dirty < 8)
+   if (dtc->wb_dirty < 8)
return min_t(long long, pos_ratio * 2,
 2 << RATELIMIT_CALC_SHIFT);
 
-   if (wb_dirty >= wb_thresh)
+   if (dtc->wb_dirty >= wb_thresh)
return 0;
 
-   wb_bg_thresh = div_u64((u64)wb_thresh * bg_thresh, thresh);
+   wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh,
+ 

[PATCH 06/18] writeback: add dirty_throttle_control->wb_bg_thresh

2015-03-22 Thread Tejun Heo
wb_bg_thresh is currently treated as a second-class citizen.  It's
only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages()
doesn't calculate it unless the cap is set.  When the cap is set, the
calculated value is not passed around but instead recalculated
whenever it's used.

wb_position_ratio() calculates it by scaling wb_thresh proportional to
bg_thresh / thresh.  wb_update_dirty_ratelimit() uses wb_dirty_limit()
on bg_thresh, which should generally lead to a similar result as the
proportional scaling but can also be way off in the presence of
max/min_ratio settings.

Avoiding wb_bg_thresh calculation saves us one u64 multiplication and
divsion when BDI_CAP_STRICTLIMIT is not set.  Given that
balance_dirty_pages() is already ratelimited, this doesn't justify the
incurred extra complexity.

This patch adds wb_bg_thresh to dirty_throttle_control and makes
wb_dirty_limits() always calculate it and updates the users to use the
pre-calculated value.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 27 +++
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b8e95a4..00218e9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -134,6 +134,7 @@ struct dirty_throttle_control {
 
unsigned long   wb_dirty;   /* per-wb counterparts */
unsigned long   wb_thresh;
+   unsigned long   wb_bg_thresh;
 };
 
 #define GDTC_INIT(__wb).wb = (__wb)
@@ -761,7 +762,6 @@ static unsigned long wb_position_ratio(struct 
dirty_throttle_control *dtc)
 */
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
long long wb_pos_ratio;
-   unsigned long wb_bg_thresh;
 
if (dtc->wb_dirty < 8)
return min_t(long long, pos_ratio * 2,
@@ -770,9 +770,8 @@ static unsigned long wb_position_ratio(struct 
dirty_throttle_control *dtc)
if (dtc->wb_dirty >= wb_thresh)
return 0;
 
-   wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh,
-  dtc->thresh);
-   wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh);
+   wb_setpoint = dirty_freerun_ceiling(wb_thresh,
+   dtc->wb_bg_thresh);
 
if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
return 0;
@@ -1104,15 +1103,14 @@ static void wb_update_dirty_ratelimit(struct 
dirty_throttle_control *dtc,
 *
 * We rampup dirty_ratelimit forcibly if wb_dirty is low because
 * it's possible that wb_thresh is close to zero due to inactivity
-* of backing device (see the implementation of wb_dirty_limit()).
+* of backing device.
 */
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
dirty = dtc->wb_dirty;
if (dtc->wb_dirty < 8)
setpoint = dtc->wb_dirty + 1;
else
-   setpoint = (dtc->wb_thresh +
-   wb_dirty_limit(wb, dtc->bg_thresh)) / 2;
+   setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
}
 
if (dirty < setpoint) {
@@ -1307,8 +1305,7 @@ static long wb_min_pause(struct bdi_writeback *wb,
return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
 }
 
-static inline void wb_dirty_limits(struct dirty_throttle_control *dtc,
-  unsigned long *wb_bg_thresh)
+static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
 {
struct bdi_writeback *wb = dtc->wb;
unsigned long wb_reclaimable;
@@ -1327,11 +1324,8 @@ static inline void wb_dirty_limits(struct 
dirty_throttle_control *dtc,
 *   at some rate <= (write_bw / 2) for bringing down wb_dirty.
 */
dtc->wb_thresh = wb_dirty_limit(dtc->wb, dtc->thresh);
-
-   if (wb_bg_thresh)
-   *wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh *
- dtc->bg_thresh,
- dtc->thresh) : 0;
+   dtc->wb_bg_thresh = dtc->thresh ?
+   div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
 
/*
 * In order to avoid the stacked BDI deadlock we need
@@ -1396,10 +1390,11 @@ static void balance_dirty_pages(struct address_space 
*mapping,
global_dirty_limits(>bg_thresh, >thresh);
 
if (unlikely(strictlimit)) {
-   wb_dirty_limits(gdtc, _thresh);
+   wb_dirty_limits(gdtc);
 
dirty = gdtc->wb_dirty;
thresh = gdtc->wb_thresh;
+   bg_thresh = gdtc->wb_bg_thresh;
   

[PATCH 07/18] writeback: make __wb_dirty_limit() take dirty_throttle_control

2015-03-22 Thread Tejun Heo
wb_dirty_limit() calculates wb_dirty by scaling thresh according to
the wb's portion in the system-wide write bandwidth.  cgroup writeback
support would need to calculate wb_dirty against memcg domain too.
This patch renames wb_dirty_limit() to __wb_dirty_limit() and makes it
take dirty_throttle_control so that the function can later be updated
to calculate against different domains according to
dirty_throttle_control.

wb_dirty_limit() is now a thin wrapper around __wb_dirty_limit().

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 00218e9..a4b6dab 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -557,9 +557,8 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
 }
 
 /**
- * wb_dirty_limit - @wb's share of dirty throttling threshold
- * @wb: bdi_writeback to query
- * @dirty: global dirty limit in pages
+ * __wb_dirty_limit - @wb's share of dirty throttling threshold
+ * @dtc: dirty_throttle_context of interest
  *
  * Returns @wb's dirty limit in pages. The term "dirty" in the context of
  * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
@@ -578,9 +577,10 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
  * The wb's share of dirty limit will be adapting to its throughput and
  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
  */
-unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
+static unsigned long __wb_dirty_limit(struct dirty_throttle_control *dtc)
 {
struct wb_domain *dom = _wb_domain;
+   unsigned long dirty = dtc->dirty;
u64 wb_dirty;
long numerator, denominator;
unsigned long wb_min_ratio, wb_max_ratio;
@@ -588,14 +588,14 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, 
unsigned long dirty)
/*
 * Calculate this BDI's share of the dirty ratio.
 */
-   fprop_fraction_percpu(>completions, >completions,
+   fprop_fraction_percpu(>completions, >wb->completions,
  , );
 
wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
wb_dirty *= numerator;
do_div(wb_dirty, denominator);
 
-   wb_min_max_ratio(wb, _min_ratio, _max_ratio);
+   wb_min_max_ratio(dtc->wb, _min_ratio, _max_ratio);
 
wb_dirty += (dirty * wb_min_ratio) / 100;
if (wb_dirty > (dirty * wb_max_ratio) / 100)
@@ -604,6 +604,13 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, 
unsigned long dirty)
return wb_dirty;
 }
 
+unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
+{
+   struct dirty_throttle_control gdtc = { GDTC_INIT(wb), .dirty = dirty };
+
+   return __wb_dirty_limit();
+}
+
 /*
  *   setpoint - dirty 3
  *f(dirty) := 1.0 + ()
@@ -1323,7 +1330,7 @@ static inline void wb_dirty_limits(struct 
dirty_throttle_control *dtc)
 *   wb_position_ratio() will let the dirtier task progress
 *   at some rate <= (write_bw / 2) for bringing down wb_dirty.
 */
-   dtc->wb_thresh = wb_dirty_limit(dtc->wb, dtc->thresh);
+   dtc->wb_thresh = __wb_dirty_limit(dtc);
dtc->wb_bg_thresh = dtc->thresh ?
div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/18] writeback: add dirty_throttle_control->wb_completions

2015-03-22 Thread Tejun Heo
wb->completions measures the wb's proportional write bandwidth in
global_wb_domain and thus naturally tied to the wb_domain.  This patch
adds dirty_throttle_control->wb_completions which is initialized to
wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use
it instead of dereferencing wb->completions directly.

This will allow dirty_throttle_control to represent different
wb_domains and the matching wb completions.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ac2d7b1..1f216cf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -127,6 +127,7 @@ struct wb_domain global_wb_domain;
 /* consolidated parameters for balance_dirty_pages() and its subroutines */
 struct dirty_throttle_control {
struct bdi_writeback*wb;
+   struct fprop_local_percpu *wb_completions;
 
unsigned long   dirty;  /* file_dirty + write + nfs */
unsigned long   thresh; /* dirty threshold */
@@ -139,7 +140,8 @@ struct dirty_throttle_control {
unsigned long   pos_ratio;
 };
 
-#define GDTC_INIT(__wb).wb = (__wb)
+#define GDTC_INIT(__wb).wb = (__wb),   
\
+   .wb_completions = &(__wb)->completions
 
 /*
  * Length of period for aging writeout fractions of bdis. This is an
@@ -590,7 +592,7 @@ static unsigned long __wb_dirty_limit(struct 
dirty_throttle_control *dtc)
/*
 * Calculate this BDI's share of the dirty ratio.
 */
-   fprop_fraction_percpu(>completions, >wb->completions,
+   fprop_fraction_percpu(>completions, dtc->wb_completions,
  , );
 
wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/18] writeback: move global_dirty_limit into wb_domain

2015-03-22 Thread Tejun Heo
This patch is a part of the series to define wb_domain which
represents a domain that wb's (bdi_writeback's) belong to and are
measured against each other in.  This will enable IO backpressure
propagation for cgroup writeback.

global_dirty_limit exists to regulate the global dirty threshold which
is a property of the wb_domain.  This patch moves hard_dirty_limit,
dirty_lock, and update_time into wb_domain.

This is pure reorganization and doesn't introduce any behavioral
changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/fs-writeback.c|  2 +-
 include/linux/writeback.h| 17 ++-
 include/trace/events/writeback.h |  7 +++---
 mm/page-writeback.c  | 46 
 4 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3d9b360..6232ae9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -878,7 +878,7 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
pages = LONG_MAX;
else {
pages = min(wb->avg_write_bandwidth / 2,
-   global_dirty_limit / DIRTY_SCOPE);
+   global_wb_domain.dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
   MIN_WRITEBACK_PAGES);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5af0a57e..ff627d6 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -95,6 +95,8 @@ struct writeback_control {
  * dirtyable memory accordingly.
  */
 struct wb_domain {
+   spinlock_t lock;
+
/*
 * Scale the writeback cache size proportional to the relative
 * writeout speed.
@@ -115,6 +117,19 @@ struct wb_domain {
struct fprop_global completions;
struct timer_list period_timer; /* timer for aging of completions */
unsigned long period_time;
+
+   /*
+* The dirtyable memory and dirty threshold could be suddenly
+* knocked down by a large amount (eg. on the startup of KVM in a
+* swapless system). This may throw the system into deep dirty
+* exceeded state and throttle heavy/light dirtiers alike. To
+* retain good responsiveness, maintain global_dirty_limit for
+* tracking slowly down to the knocked down dirty threshold.
+*
+* Both fields are protected by ->lock.
+*/
+   unsigned long dirty_limit_tstamp;
+   unsigned long dirty_limit;
 };
 
 /*
@@ -153,7 +168,7 @@ void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 
-extern unsigned long global_dirty_limit;
+extern struct wb_domain global_wb_domain;
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 5c9a68c..d5ac3dd 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -344,7 +344,7 @@ TRACE_EVENT(global_dirty_state,
__entry->nr_written = global_page_state(NR_WRITTEN);
__entry->background_thresh = background_thresh;
__entry->dirty_thresh   = dirty_thresh;
-   __entry->dirty_limit = global_dirty_limit;
+   __entry->dirty_limit= global_wb_domain.dirty_limit;
),
 
TP_printk("dirty=%lu writeback=%lu unstable=%lu "
@@ -446,8 +446,9 @@ TRACE_EVENT(balance_dirty_pages,
unsigned long freerun = (thresh + bg_thresh) / 2;
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
 
-   __entry->limit  = global_dirty_limit;
-   __entry->setpoint   = (global_dirty_limit + freerun) / 2;
+   __entry->limit  = global_wb_domain.dirty_limit;
+   __entry->setpoint   = (global_wb_domain.dirty_limit +
+   freerun) / 2;
__entry->dirty  = dirty;
__entry->bdi_setpoint   = __entry->setpoint *
bdi_thresh / (thresh + 1);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c6ccc7..06c5d3a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -122,9 +122,7 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
-unsigned long global_dirty_limit;
-
-static struct wb_domain global_wb_domain;
+struct wb_domain global_wb_domain;
 
 /*
  * Length of period for aging writeout fractions of bdis. This is an
@@ -470,9 +468,15 @@ static void writeout_period(unsigned long t)
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp)
 {
memset(dom, 0, sizeof(*dom));
+
+   spin_lock_init(>lock);
+

[PATCH] ubifs: return err value than 0

2015-03-22 Thread Sanidhya Kashyap
Currently, ubifs_readpage only returns 0 even if ubifs_bulk_read()
fails. Like the other file systems, the error value should be
propagated further instead of 0.

Another check that is missing is ENOMEM for kmalloc.

Signed-off-by: Sanidhya Kashyap 
---
 fs/ubifs/file.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index e627c0a..28fe892 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -863,8 +863,10 @@ static int ubifs_bulk_read(struct page *page)
bu = >bu;
else {
bu = kmalloc(sizeof(struct bu_info), GFP_NOFS | __GFP_NOWARN);
-   if (!bu)
+   if (!bu) {
+   err = -ENOMEM;
goto out_unlock;
+   }
 
bu->buf = NULL;
allocated = 1;
@@ -887,11 +889,14 @@ out_unlock:
 
 static int ubifs_readpage(struct file *file, struct page *page)
 {
-   if (ubifs_bulk_read(page))
-   return 0;
+   int err = 0;
+
+   err = ubifs_bulk_read(page);
+   if (err)
+   return err;
do_readpage(page);
unlock_page(page);
-   return 0;
+   return err;
 }
 
 static int do_writepage(struct page *page, int len)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/18] writeback: add dirty_throttle_control->dom

2015-03-22 Thread Tejun Heo
Currently all dirty throttle operations use global_wb_domain; however,
cgroup writeback support requires considering per-memcg wb_domain too.
This patch adds dirty_throttle_control->dom and updates functions
which are directly using globabl_wb_domain to use it instead.

As this makes global_update_bandwidth() a misnomer, the function is
renamed to domain_update_bandwidth().

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1f216cf..840b8f2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -126,6 +126,9 @@ struct wb_domain global_wb_domain;
 
 /* consolidated parameters for balance_dirty_pages() and its subroutines */
 struct dirty_throttle_control {
+#ifdef CONFIG_CGROUP_WRITEBACK
+   struct wb_domain*dom;
+#endif
struct bdi_writeback*wb;
struct fprop_local_percpu *wb_completions;
 
@@ -140,7 +143,7 @@ struct dirty_throttle_control {
unsigned long   pos_ratio;
 };
 
-#define GDTC_INIT(__wb).wb = (__wb),   
\
+#define DTC_INIT_COMMON(__wb)  .wb = (__wb),   \
.wb_completions = &(__wb)->completions
 
 /*
@@ -152,6 +155,14 @@ struct dirty_throttle_control {
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
+#define GDTC_INIT(__wb).dom = _wb_domain,   
\
+   DTC_INIT_COMMON(__wb)
+
+static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
+{
+   return dtc->dom;
+}
+
 static void wb_min_max_ratio(struct bdi_writeback *wb,
 unsigned long *minp, unsigned long *maxp)
 {
@@ -181,6 +192,13 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
+#define GDTC_INIT(__wb)DTC_INIT_COMMON(__wb)
+
+static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
+{
+   return _wb_domain;
+}
+
 static void wb_min_max_ratio(struct bdi_writeback *wb,
 unsigned long *minp, unsigned long *maxp)
 {
@@ -583,7 +601,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
  */
 static unsigned long __wb_dirty_limit(struct dirty_throttle_control *dtc)
 {
-   struct wb_domain *dom = _wb_domain;
+   struct wb_domain *dom = dtc_dom(dtc);
unsigned long dirty = dtc->dirty;
u64 wb_dirty;
long numerator, denominator;
@@ -952,7 +970,7 @@ out:
 
 static void update_dirty_limit(struct dirty_throttle_control *dtc)
 {
-   struct wb_domain *dom = _wb_domain;
+   struct wb_domain *dom = dtc_dom(dtc);
unsigned long thresh = dtc->thresh;
unsigned long limit = dom->dirty_limit;
 
@@ -979,10 +997,10 @@ update:
dom->dirty_limit = limit;
 }
 
-static void global_update_bandwidth(struct dirty_throttle_control *dtc,
+static void domain_update_bandwidth(struct dirty_throttle_control *dtc,
unsigned long now)
 {
-   struct wb_domain *dom = _wb_domain;
+   struct wb_domain *dom = dtc_dom(dtc);
 
/*
 * check locklessly first to optimize away locking for the most time
@@ -1190,7 +1208,7 @@ static void __wb_update_bandwidth(struct 
dirty_throttle_control *dtc,
goto snapshot;
 
if (update_ratelimit) {
-   global_update_bandwidth(dtc, now);
+   domain_update_bandwidth(dtc, now);
wb_update_dirty_ratelimit(dtc, dirtied, elapsed);
}
wb_update_write_bandwidth(wb, elapsed, written);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/18] writeback: separate out domain_dirty_limits()

2015-03-22 Thread Tejun Heo
global_dirty_limits() calculates thresh and bg_thresh (confusingly
called *pdirty and *pbackground in the function) assuming
global_wb_domain; however, cgroup writeback support requires
considering per-memcg wb_domain too.

This patch separates out domain_dirty_limits() which takes
dirty_throttle_control out of global_dirty_limits().  As thresh and
bg_thresh calculation needs the amount of dirtyable memory in the
domain, dirty_throttle_control->avail is added.  The new function
calculates the two thresholds and store them directly in the
dirty_throttle_control.

Also, as memcg domains can't follow vm_dirty_bytes and
dirty_background_bytes settings directly.  If those are set and
domain_dirty_limits() is invoked for a !global domain, the settings
are translated to ratios by scaling them against globally available
memory.  dirty_throttle_control->gdtc is added to enable this when
CONFIG_CGROUP_WRITEBACK.

global_dirty_limits() is now a thin wrapper around
domain_dirty_limits() and balance_dirty_pages() is updated to use the
new function too.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 111 
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e6c7572..7e9922f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -128,10 +128,12 @@ struct wb_domain global_wb_domain;
 struct dirty_throttle_control {
 #ifdef CONFIG_CGROUP_WRITEBACK
struct wb_domain*dom;
+   struct dirty_throttle_control *gdtc;/* only set in memcg dtc's */
 #endif
struct bdi_writeback*wb;
struct fprop_local_percpu *wb_completions;
 
+   unsigned long   avail;  /* dirtyable */
unsigned long   dirty;  /* file_dirty + write + nfs */
unsigned long   thresh; /* dirty threshold */
unsigned long   bg_thresh;  /* dirty background threshold */
@@ -157,12 +159,18 @@ struct dirty_throttle_control {
 
 #define GDTC_INIT(__wb).dom = _wb_domain,   
\
DTC_INIT_COMMON(__wb)
+#define GDTC_INIT_NO_WB.dom = _wb_domain
 
 static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
 {
return dtc->dom;
 }
 
+static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control 
*mdtc)
+{
+   return mdtc->gdtc;
+}
+
 static void wb_min_max_ratio(struct bdi_writeback *wb,
 unsigned long *minp, unsigned long *maxp)
 {
@@ -193,12 +201,18 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
 #define GDTC_INIT(__wb)DTC_INIT_COMMON(__wb)
+#define GDTC_INIT_NO_WB
 
 static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
 {
return _wb_domain;
 }
 
+static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control 
*mdtc)
+{
+   return NULL;
+}
+
 static void wb_min_max_ratio(struct bdi_writeback *wb,
 unsigned long *minp, unsigned long *maxp)
 {
@@ -303,42 +317,88 @@ static unsigned long global_dirtyable_memory(void)
return x + 1;   /* Ensure that we never return 0 */
 }
 
-/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
+/**
+ * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain
+ * @dtc: dirty_throttle_control of interest
  *
- * Calculate the dirty thresholds based on sysctl parameters
- * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
- * - vm.dirty_ratio or  vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * Calculate @dtc->thresh and ->bg_thresh considering
+ * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}.  The caller
+ * must ensure that @dtc->avail is set before calling this function.  The
+ * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
  * real-time tasks.
  */
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
-{
-   const unsigned long available_memory = global_dirtyable_memory();
-   unsigned long background;
-   unsigned long dirty;
+static void domain_dirty_limits(struct dirty_throttle_control *dtc)
+{
+   const unsigned long available_memory = dtc->avail;
+   struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
+   unsigned long bytes = vm_dirty_bytes;
+   unsigned long bg_bytes = dirty_background_bytes;
+   unsigned long ratio = vm_dirty_ratio;
+   unsigned long bg_ratio = dirty_background_ratio;
+   unsigned long thresh;
+   unsigned long bg_thresh;
struct task_struct *tsk;
 
-   if (vm_dirty_bytes)
-   dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+   /* gdtc is 

[PATCH 08/18] writeback: add dirty_throttle_control->pos_ratio

2015-03-22 Thread Tejun Heo
wb_position_ratio() is used to calculate pos_ratio, which is used for
two purposes.  wb_update_dirty_ratelimit() uses it to adjust
wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to
immediately adjust dirty_ratelimit right before applying it to
determine pause duration.

While wb_update_dirty_ratelimit() is separately rate limited from
balance_dirty_pages(), on the run where the ratelimit is updated, we
end up calculating pos_ratio twice with the same parameters.

This patch adds dirty_throttle_control->pos_ratio.
balance_dirty_pages() calculates it once per run and
wb_update_dirty_ratelimit() uses the value stored in
dirty_throttle_control.

This removes the duplicate calculation and also will help implementing
memcg wb_domain.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 36 +---
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a4b6dab..ac2d7b1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -135,6 +135,8 @@ struct dirty_throttle_control {
unsigned long   wb_dirty;   /* per-wb counterparts */
unsigned long   wb_thresh;
unsigned long   wb_bg_thresh;
+
+   unsigned long   pos_ratio;
 };
 
 #define GDTC_INIT(__wb).wb = (__wb)
@@ -717,7 +719,7 @@ static long long pos_ratio_polynom(unsigned long setpoint,
  *   card's wb_dirty may rush to many times higher than wb_setpoint.
  * - the wb dirty thresh drops quickly due to change of JBOD workload
  */
-static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
+static void wb_position_ratio(struct dirty_throttle_control *dtc)
 {
struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = wb->avg_write_bandwidth;
@@ -731,8 +733,10 @@ static unsigned long wb_position_ratio(struct 
dirty_throttle_control *dtc)
long long pos_ratio;/* for scaling up/down the rate limit */
long x;
 
+   dtc->pos_ratio = 0;
+
if (unlikely(dtc->dirty >= limit))
-   return 0;
+   return;
 
/*
 * global setpoint
@@ -770,18 +774,20 @@ static unsigned long wb_position_ratio(struct 
dirty_throttle_control *dtc)
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
long long wb_pos_ratio;
 
-   if (dtc->wb_dirty < 8)
-   return min_t(long long, pos_ratio * 2,
-2 << RATELIMIT_CALC_SHIFT);
+   if (dtc->wb_dirty < 8) {
+   dtc->pos_ratio = min_t(long long, pos_ratio * 2,
+  2 << RATELIMIT_CALC_SHIFT);
+   return;
+   }
 
if (dtc->wb_dirty >= wb_thresh)
-   return 0;
+   return;
 
wb_setpoint = dirty_freerun_ceiling(wb_thresh,
dtc->wb_bg_thresh);
 
if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
-   return 0;
+   return;
 
wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty,
 wb_thresh);
@@ -807,7 +813,8 @@ static unsigned long wb_position_ratio(struct 
dirty_throttle_control *dtc)
 * is 2. We might want to tweak this if we observe the control
 * system is too slow to adapt.
 */
-   return min(pos_ratio, wb_pos_ratio);
+   dtc->pos_ratio = min(pos_ratio, wb_pos_ratio);
+   return;
}
 
/*
@@ -888,7 +895,7 @@ static unsigned long wb_position_ratio(struct 
dirty_throttle_control *dtc)
pos_ratio *= 8;
}
 
-   return pos_ratio;
+   dtc->pos_ratio = pos_ratio;
 }
 
 static void wb_update_write_bandwidth(struct bdi_writeback *wb,
@@ -1009,7 +1016,6 @@ static void wb_update_dirty_ratelimit(struct 
dirty_throttle_control *dtc,
unsigned long dirty_rate;
unsigned long task_ratelimit;
unsigned long balanced_dirty_ratelimit;
-   unsigned long pos_ratio;
unsigned long step;
unsigned long x;
 
@@ -1019,12 +1025,11 @@ static void wb_update_dirty_ratelimit(struct 
dirty_throttle_control *dtc,
 */
dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;
 
-   pos_ratio = wb_position_ratio(dtc);
/*
 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
 */
task_ratelimit = (u64)dirty_ratelimit *
-   pos_ratio >> RATELIMIT_CALC_SHIFT;
+   dtc->pos_ratio >> RATELIMIT_CALC_SHIFT;
task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */
 

[PATCH 17/18] writeback: implement memcg writeback domain based throttling

2015-03-22 Thread Tejun Heo
While cgroup writeback support now connects memcg and blkcg so that
writeback IOs are properly attributed and controlled, the IO back
pressure propagation mechanism implemented in balance_dirty_pages()
and its subroutines wasn't aware of cgroup writeback.

Processes belonging to a memcg may have access to only subset of total
memory available in the system and not factoring this into dirty
throttling rendered it completely ineffective for processes under
memcg limits and memcg ended up building a separate ad-hoc degenerate
mechanism directly into vmscan code to limit page dirtying.

The previous patches updated balance_dirty_pages() and its subroutines
so that they can deal with multiple wb_domain's (writeback domains)
and defined per-memcg wb_domain.  Processes belonging to a non-root
memcg are bound to two wb_domains, global wb_domain and memcg
wb_domain, and should be throttled according to IO pressures from both
domains.  This patch updates dirty throttling code so that it repeats
similar calculations for the two domains - the differences between the
two are few and minor - and applies the lower of the two sets of
resulting constraints.

wb_over_bg_thresh(), which controls when background writeback
terminates, is also updated to consider both global and memcg
wb_domains.  It returns true if dirty is over bg_thresh for either
domain.

This makes the dirty throttling mechanism operational for memcg
domains including writeback-bandwidth-proportional dirty page
distribution inside them but the ad-hoc memcg throttling mechanism in
vmscan is still in place.  The next patch will rip it out.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 include/linux/memcontrol.h |   9 +++
 mm/memcontrol.c|  43 
 mm/page-writeback.c| 158 ++---
 3 files changed, 188 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e3177be..c3eb19e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -392,6 +392,8 @@ enum {
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
+unsigned long *pdirty, unsigned long *pwriteback);
 
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
@@ -400,6 +402,13 @@ static inline struct wb_domain 
*mem_cgroup_wb_domain(struct bdi_writeback *wb)
return NULL;
 }
 
+static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
+  unsigned long *pavail,
+  unsigned long *pdirty,
+  unsigned long *pwriteback)
+{
+}
+
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 108acfc..d76f61c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4111,6 +4111,49 @@ struct wb_domain *mem_cgroup_wb_domain(struct 
bdi_writeback *wb)
return >cgwb_domain;
 }
 
+/**
+ * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
+ * @wb: bdi_writeback in question
+ * @pavail: out parameter for number of available pages
+ * @pdirty: out parameter for number of dirty pages
+ * @pwriteback: out parameter for number of pages under writeback
+ *
+ * Determine the numbers of available, dirty, and writeback pages in @wb's
+ * memcg.  Dirty and writeback are self-explanatory.  Available is a bit
+ * more involved.
+ *
+ * A memcg's headroom is "min(max, high) - used".  The available memory is
+ * calculated as the lowest headroom of itself and the ancestors plus the
+ * number of pages already being used for file pages.  Note that this
+ * doesn't consider the actual amount of available memory in the system.
+ * The caller should further cap *@pavail accordingly.
+ */
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
+unsigned long *pdirty, unsigned long *pwriteback)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
+   struct mem_cgroup *parent;
+   unsigned long head_room = PAGE_COUNTER_MAX;
+   unsigned long file_pages;
+
+   *pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+
+   /* this should eventually include NR_UNSTABLE_NFS */
+   *pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+
+   file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
+   (1 << LRU_ACTIVE_FILE));
+   while ((parent = parent_mem_cgroup(memcg))) {
+   unsigned long ceiling = min(memcg->memory.limit, memcg->high);
+   unsigned long used = page_counter_read(>memory);
+
+   head_room = min(head_room, ceiling - min(ceiling, used));
+   memcg = parent;
+   }
+

[PATCH 13/18] writeback: move over_bground_thresh() to mm/page-writeback.c

2015-03-22 Thread Tejun Heo
and rename it to wb_over_bg_thresh().  The function is closely tied to
the dirty throttling mechanism implemented in page-writeback.c.  This
relocation will allow future updates necessary for cgroup writeback
support.

While at it, add function comment.

This is pure reorganization and doesn't introduce any behavioral
changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/fs-writeback.c | 20 ++--
 include/linux/writeback.h |  1 +
 mm/page-writeback.c   | 23 +++
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6232ae9..683bd92 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1062,22 +1062,6 @@ static long writeback_inodes_wb(struct bdi_writeback 
*wb, long nr_pages,
return nr_pages - work.nr_pages;
 }
 
-static bool over_bground_thresh(struct bdi_writeback *wb)
-{
-   unsigned long background_thresh, dirty_thresh;
-
-   global_dirty_limits(_thresh, _thresh);
-
-   if (global_page_state(NR_FILE_DIRTY) +
-   global_page_state(NR_UNSTABLE_NFS) > background_thresh)
-   return true;
-
-   if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh))
-   return true;
-
-   return false;
-}
-
 /*
  * Explicit flushing or periodic writeback of "old" data.
  *
@@ -1127,7 +,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 * For background writeout, stop when we are below the
 * background dirty threshold
 */
-   if (work->for_background && !over_bground_thresh(wb))
+   if (work->for_background && !wb_over_bg_thresh(wb))
break;
 
/*
@@ -1218,7 +1202,7 @@ static unsigned long get_nr_dirty_pages(void)
 
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
-   if (over_bground_thresh(wb)) {
+   if (wb_over_bg_thresh(wb)) {
 
struct wb_writeback_work work = {
.nr_pages   = LONG_MAX,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index ff627d6..fa6c3b4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -204,6 +204,7 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, 
unsigned long dirty);
 void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
+bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e9922f..99f8d02 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1740,6 +1740,29 @@ void balance_dirty_pages_ratelimited(struct 
address_space *mapping)
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
 
+/**
+ * wb_over_bg_thresh - does @wb need to be written back?
+ * @wb: bdi_writeback of interest
+ *
+ * Determines whether background writeback should keep writing @wb or it's
+ * clean enough.  Returns %true if writeback should continue.
+ */
+bool wb_over_bg_thresh(struct bdi_writeback *wb)
+{
+   unsigned long background_thresh, dirty_thresh;
+
+   global_dirty_limits(_thresh, _thresh);
+
+   if (global_page_state(NR_FILE_DIRTY) +
+   global_page_state(NR_UNSTABLE_NFS) > background_thresh)
+   return true;
+
+   if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh))
+   return true;
+
+   return false;
+}
+
 void throttle_vm_writeout(gfp_t gfp_mask)
 {
unsigned long background_thresh;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim

2015-03-22 Thread Tejun Heo
Because writeback wasn't cgroup aware before, the usual dirty
throttling mechanism in balance_dirty_pages() didn't work for
processes under memcg limit.  The writeback path didn't know how much
memory is available or how fast the dirty pages are being written out
for a given memcg and balance_dirty_pages() didn't have any measure of
IO back pressure for the memcg.

To work around the issue, memcg implemented an ad-hoc dirty throttling
mechanism in the direct reclaim path by stalling on pages under
writeback which are encountered during direct reclaim scan.  This is
rather ugly and crude - none of the configurability, fairness, or
bandwidth-proportional distribution of the normal path.

The previous patches implemented proper memcg aware dirty throttling
and the ad-hoc mechanism is no longer necessary.  Remove it.

Note: I removed the parts which seemed obvious and it behaves fine
  while testing but my understanding of this code path is
  rudimentary and it's quite possible that I got something wrong.
  Please let me know if I got some wrong or more global_reclaim()
  sites should be updated.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
Cc: Vladimir Davydov 
---
 mm/vmscan.c | 109 ++--
 1 file changed, 33 insertions(+), 76 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9f8d3c0..d084c95 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -929,53 +929,24 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
nr_congested++;
 
/*
-* If a page at the tail of the LRU is under writeback, there
-* are three cases to consider.
-*
-* 1) If reclaim is encountering an excessive number of pages
-*under writeback and this page is both under writeback and
-*PageReclaim then it indicates that pages are being queued
-*for IO but are being recycled through the LRU before the
-*IO can complete. Waiting on the page itself risks an
-*indefinite stall if it is impossible to writeback the
-*page due to IO error or disconnected storage so instead
-*note that the LRU is being scanned too quickly and the
-*caller can stall after page list has been processed.
-*
-* 2) Global reclaim encounters a page, memcg encounters a
-*page that is not marked for immediate reclaim or
-*the caller does not have __GFP_IO. In this case mark
-*the page for immediate reclaim and continue scanning.
-*
-*__GFP_IO is checked  because a loop driver thread might
-*enter reclaim, and deadlock if it waits on a page for
-*which it is needed to do the write (loop masks off
-*__GFP_IO|__GFP_FS for this reason); but more thought
-*would probably show more reasons.
-*
-*Don't require __GFP_FS, since we're not going into the
-*FS, just waiting on its writeback completion. Worryingly,
-*ext4 gfs2 and xfs allocate pages with
-*grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
-*may_enter_fs here is liable to OOM on them.
-*
-* 3) memcg encounters a page that is not already marked
-*PageReclaim. memcg does not have any dirty pages
-*throttling so we could easily OOM just because too many
-*pages are in writeback and there is nothing else to
-*reclaim. Wait for the writeback to complete.
+* A page at the tail of the LRU is under writeback.  If
+* reclaim is encountering an excessive number of pages
+* under writeback and this page is both under writeback
+* and PageReclaim then it indicates that pages are being
+* queued for IO but are being recycled through the LRU
+* before the IO can complete.  Waiting on the page itself
+* risks an indefinite stall if it is impossible to
+* writeback the page due to IO error or disconnected
+* storage so instead note that the LRU is being scanned
+* too quickly and the caller can stall after page list has
+* been processed.
 */
if (PageWriteback(page)) {
-   /* Case 1 above */
if (current_is_kswapd() &&
PageReclaim(page) &&
test_bit(ZONE_WRITEBACK, >flags)) {

[PATCH 14/18] writeback: update wb_over_bg_thresh() to use wb_domain aware operations

2015-03-22 Thread Tejun Heo
wb_over_bg_thresh() currently uses global_dirty_limits() and
wb_dirty_limit() both of which are wrappers around operations which
take dirty_throttle_control.  For cgroup writeback support, the
function will be updated to also consider memcg wb_domains which
requires the context information carried in dirty_throttle_control.

This patch updates wb_over_bg_thresh() so that it uses the underlying
wb_domain aware operations directly and builds the global
dirty_throttle_control in the process.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 99f8d02..2626e6c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1749,15 +1749,22 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
  */
 bool wb_over_bg_thresh(struct bdi_writeback *wb)
 {
-   unsigned long background_thresh, dirty_thresh;
+   struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
+   struct dirty_throttle_control * const gdtc = _stor;
 
-   global_dirty_limits(_thresh, _thresh);
+   /*
+* Similar to balance_dirty_pages() but ignores pages being written
+* as we're trying to decide whether to put more under writeback.
+*/
+   gdtc->avail = global_dirtyable_memory();
+   gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
+ global_page_state(NR_UNSTABLE_NFS);
+   domain_dirty_limits(gdtc);
 
-   if (global_page_state(NR_FILE_DIRTY) +
-   global_page_state(NR_UNSTABLE_NFS) > background_thresh)
+   if (gdtc->dirty > gdtc->bg_thresh)
return true;
 
-   if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh))
+   if (wb_stat(wb, WB_RECLAIMABLE) > __wb_dirty_limit(gdtc))
return true;
 
return false;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/18] writeback: reorganize [__]wb_update_bandwidth()

2015-03-22 Thread Tejun Heo
__wb_update_bandwidth() is called from two places -
fs/fs-writeback.c::balance_dirty_pages() and
mm/page-writeback.c::wb_writeback().  The latter updates only the
write bandwidth while the former also deals with the dirty ratelimit.
The two callsites are distinguished by whether @thresh parameter is
zero or not, which is cryptic.  In addition, the two files define
their own different versions of wb_update_bandwidth() on top of
__wb_update_bandwidth(), which is confusing to say the least.  This
patch cleans up [__]wb_update_bandwidth() in the following ways.

* __wb_update_bandwidth() now takes explicit @update_ratelimit
  parameter to gate dirty ratelimit handling.

* mm/page-writeback.c::wb_update_bandwidth() is flattened into its
  caller - balance_dirty_pages().

* fs/fs-writeback.c::wb_update_bandwidth() is moved to
  mm/page-writeback.c and __wb_update_bandwidth() is made static.

* While at it, add a lockdep assertion to __wb_update_bandwidth().

Except for the lockdep addition, this is pure reorganization and
doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 fs/fs-writeback.c | 10 --
 include/linux/writeback.h |  9 +
 mm/page-writeback.c   | 45 ++---
 3 files changed, 23 insertions(+), 41 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 890cff1..3d9b360 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1079,16 +1079,6 @@ static bool over_bground_thresh(struct bdi_writeback *wb)
 }
 
 /*
- * Called under wb->list_lock. If there are multiple wb per bdi,
- * only the flusher working on the first wb should do it.
- */
-static void wb_update_bandwidth(struct bdi_writeback *wb,
-   unsigned long start_time)
-{
-   __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time);
-}
-
-/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 75349bb..82e0e39 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -154,14 +154,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, 
int,
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty);
 
-void __wb_update_bandwidth(struct bdi_writeback *wb,
-  unsigned long thresh,
-  unsigned long bg_thresh,
-  unsigned long dirty,
-  unsigned long bdi_thresh,
-  unsigned long bdi_dirty,
-  unsigned long start_time);
-
+void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fd441ea..d9ebabe 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1160,19 +1160,22 @@ static void wb_update_dirty_ratelimit(struct 
bdi_writeback *wb,
trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
 }
 
-void __wb_update_bandwidth(struct bdi_writeback *wb,
-  unsigned long thresh,
-  unsigned long bg_thresh,
-  unsigned long dirty,
-  unsigned long wb_thresh,
-  unsigned long wb_dirty,
-  unsigned long start_time)
+static void __wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long wb_thresh,
+ unsigned long wb_dirty,
+ unsigned long start_time,
+ bool update_ratelimit)
 {
unsigned long now = jiffies;
unsigned long elapsed = now - wb->bw_time_stamp;
unsigned long dirtied;
unsigned long written;
 
+   lockdep_assert_held(>list_lock);
+
/*
 * rate-limit, only update once every 200ms.
 */
@@ -1189,7 +1192,7 @@ void __wb_update_bandwidth(struct bdi_writeback *wb,
if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time))
goto snapshot;
 
-   if (thresh) {
+   if (update_ratelimit) {
global_update_bandwidth(thresh, dirty, now);
wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty,
  wb_thresh, wb_dirty,
@@ -1203,20 +1206,9 @@ snapshot:
wb->bw_time_stamp = now;
 }
 
-static void wb_update_bandwidth(struct bdi_writeback *wb,
- 

[PATCH 11/18] writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a parameter

2015-03-22 Thread Tejun Heo
Currently __wb_writeout_inc() and hard_dirty_limit() assume
global_wb_domain; however, cgroup writeback support requires
considering per-memcg wb_domain too.

This patch separates out domain-specific part of __wb_writeout_inc()
into wb_domain_writeout_inc() which takes wb_domain as a parameter and
adds the parameter to hard_dirty_limit().  This will allow these two
functions to handle per-memcg wb_domains.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 mm/page-writeback.c | 37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 840b8f2..e6c7572 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -445,17 +445,12 @@ static unsigned long wp_next_time(unsigned long cur_time)
return cur_time;
 }
 
-/*
- * Increment the wb's writeout completion count and the global writeout
- * completion count. Called from test_clear_page_writeback().
- */
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static void wb_domain_writeout_inc(struct wb_domain *dom,
+  struct fprop_local_percpu *completions,
+  unsigned int max_prop_frac)
 {
-   struct wb_domain *dom = _wb_domain;
-
-   __inc_wb_stat(wb, WB_WRITTEN);
-   __fprop_inc_percpu_max(>completions, >completions,
-  wb->bdi->max_prop_frac);
+   __fprop_inc_percpu_max(>completions, completions,
+  max_prop_frac);
/* First event after period switching was turned off? */
if (!unlikely(dom->period_time)) {
/*
@@ -469,6 +464,17 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
}
 }
 
+/*
+ * Increment @wb's writeout completion count and the global writeout
+ * completion count. Called from test_clear_page_writeback().
+ */
+static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+{
+   __inc_wb_stat(wb, WB_WRITTEN);
+   wb_domain_writeout_inc(_wb_domain, >completions,
+  wb->bdi->max_prop_frac);
+}
+
 void wb_writeout_inc(struct bdi_writeback *wb)
 {
unsigned long flags;
@@ -571,10 +577,9 @@ static unsigned long dirty_freerun_ceiling(unsigned long 
thresh,
return (thresh + bg_thresh) / 2;
 }
 
-static unsigned long hard_dirty_limit(unsigned long thresh)
+static unsigned long hard_dirty_limit(struct wb_domain *dom,
+ unsigned long thresh)
 {
-   struct wb_domain *dom = _wb_domain;
-
return max(thresh, dom->dirty_limit);
 }
 
@@ -744,7 +749,7 @@ static void wb_position_ratio(struct dirty_throttle_control 
*dtc)
struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, 
dtc->bg_thresh);
-   unsigned long limit = hard_dirty_limit(dtc->thresh);
+   unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
unsigned long wb_thresh = dtc->wb_thresh;
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
@@ -1029,7 +1034,7 @@ static void wb_update_dirty_ratelimit(struct 
dirty_throttle_control *dtc,
struct bdi_writeback *wb = dtc->wb;
unsigned long dirty = dtc->dirty;
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, 
dtc->bg_thresh);
-   unsigned long limit = hard_dirty_limit(dtc->thresh);
+   unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
unsigned long setpoint = (freerun + limit) / 2;
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long dirty_ratelimit = wb->dirty_ratelimit;
@@ -1681,7 +1686,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
 for ( ; ; ) {
global_dirty_limits(_thresh, _thresh);
-   dirty_thresh = hard_dirty_limit(dirty_thresh);
+   dirty_thresh = hard_dirty_limit(_wb_domain, 
dirty_thresh);
 
 /*
  * Boost the allowable dirty threshold a bit for page
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/18] writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes

2015-03-22 Thread Tejun Heo
The amount of available memory to a memcg wb_domain can change as
memcg configuration changes.  A domain's ->dirty_limit exists to
smooth out sudden drops in dirty threshold; however, when a domain's
size actually drops significantly, it hinders the dirty throttling
from adjusting to the new configuration leading to unexpected
behaviors including unnecessary OOM kills.

This patch resolves the issue by adding wb_domain_size_changed() which
resets ->dirty_limit[_tstmp] and making memcg call it on configuration
changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 include/linux/writeback.h | 20 
 mm/memcontrol.c   | 12 
 2 files changed, 32 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e421625..9ae0648 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -132,6 +132,26 @@ struct wb_domain {
unsigned long dirty_limit;
 };
 
+/**
+ * wb_domain_size_changed - memory available to a wb_domain has changed
+ * @dom: wb_domain of interest
+ *
+ * This function should be called when the amount of memory available to
+ * @dom has changed.  It resets @dom's dirty limit parameters to prevent
+ * the past values which don't match the current configuration from skewing
+ * dirty throttling.  Without this, when memory size of a wb_domain is
+ * greatly reduced, the dirty throttling logic may allow too many pages to
+ * be dirtied leading to consecutive unnecessary OOMs and may get stuck in
+ * that situation.
+ */
+static inline void wb_domain_size_changed(struct wb_domain *dom)
+{
+   spin_lock(>lock);
+   dom->dirty_limit_tstamp = jiffies;
+   dom->dirty_limit = 0;
+   spin_unlock(>lock);
+}
+
 /*
  * fs/fs-writeback.c
  */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2a74cf3..108acfc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4096,6 +4096,11 @@ static void memcg_wb_domain_exit(struct mem_cgroup 
*memcg)
wb_domain_exit(>cgwb_domain);
 }
 
+static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
+{
+   wb_domain_size_changed(>cgwb_domain);
+}
+
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
@@ -4117,6 +4122,10 @@ static void memcg_wb_domain_exit(struct mem_cgroup 
*memcg)
 {
 }
 
+static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
+{
+}
+
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 /*
@@ -4715,6 +4724,7 @@ static void mem_cgroup_css_reset(struct 
cgroup_subsys_state *css)
memcg->low = 0;
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
+   memcg_wb_domain_size_changed(memcg);
 }
 
 #ifdef CONFIG_MMU
@@ -5347,6 +5357,7 @@ static ssize_t memory_high_write(struct kernfs_open_file 
*of,
 
memcg->high = high;
 
+   memcg_wb_domain_size_changed(memcg);
return nbytes;
 }
 
@@ -5379,6 +5390,7 @@ static ssize_t memory_max_write(struct kernfs_open_file 
*of,
if (err)
return err;
 
+   memcg_wb_domain_size_changed(memcg);
return nbytes;
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/18] writeback: implement memcg wb_domain

2015-03-22 Thread Tejun Heo
Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.

For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg.  IOW, a wb will belong to two writeback domains - the global
and memcg domains.

The previous patches laid the groundwork to support the two wb_domains
and this patch implements memcg wb_domain.  memcg->cgwb_domain is
initialized on css online and destroyed on css release,
wb->memcg_completions is added, and __wb_writeout_inc() is updated to
increment completions against both global and memcg wb_domains.

The following patches will update balance_dirty_pages() and its
subroutines to actually consider memcg wb_domain for throttling.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Greg Thelen 
---
 include/linux/backing-dev-defs.h |  1 +
 include/linux/memcontrol.h   | 12 +++-
 include/linux/writeback.h|  3 +++
 mm/backing-dev.c |  9 -
 mm/memcontrol.c  | 39 +++
 mm/page-writeback.c  | 25 +
 6 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 97a92fa..8d470b7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -118,6 +118,7 @@ struct bdi_writeback {
 
 #ifdef CONFIG_CGROUP_WRITEBACK
struct percpu_ref refcnt;   /* used only for !root wb's */
+   struct fprop_local_percpu memcg_completions;
struct cgroup_subsys_state *memcg_css; /* the associated memcg */
struct cgroup_subsys_state *blkcg_css; /* and blkcg */
struct list_head memcg_node;/* anchored at memcg->cgwb_list */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 662a953..e3177be 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -389,8 +389,18 @@ enum {
 };
 
 #ifdef CONFIG_CGROUP_WRITEBACK
+
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
-#endif
+struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
+
+#else  /* CONFIG_CGROUP_WRITEBACK */
+
+static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
+{
+   return NULL;
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fa6c3b4..e421625 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -167,6 +167,9 @@ static inline void laptop_sync_completion(void) { }
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
+#ifdef CONFIG_CGROUP_WRITEBACK
+void wb_domain_exit(struct wb_domain *dom);
+#endif
 
 extern struct wb_domain global_wb_domain;
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 331e4d7..8828edf 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -483,6 +483,7 @@ static void cgwb_release_workfn(struct work_struct *work)
css_put(wb->blkcg_css);
wb_congested_put(wb->congested);
 
+   fprop_local_destroy_percpu(>memcg_completions);
percpu_ref_exit(>refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
@@ -549,9 +550,13 @@ static int cgwb_create(struct backing_dev_info *bdi,
if (ret)
goto err_wb_exit;
 
+   ret = fprop_local_init_percpu(>memcg_completions, gfp);
+   if (ret)
+   goto err_ref_exit;
+
wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp);
if (!wb->congested)
-   goto err_ref_exit;
+   goto err_fprop_exit;
 
wb->memcg_css = memcg_css;
wb->blkcg_css = blkcg_css;
@@ -588,6 +593,8 @@ static int cgwb_create(struct backing_dev_info *bdi,
 
 err_put_congested:
wb_congested_put(wb->congested);
+err_fprop_exit:
+   fprop_local_destroy_percpu(>memcg_completions);
 err_ref_exit:
percpu_ref_exit(>refcnt);
 err_wb_exit:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab483e9..2a74cf3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -344,6 +344,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_list;
+   struct wb_domain cgwb_domain;
 #endif
 
/* List of events which userspace want to receive */
@@ -4085,6 +4086,37 @@ struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup 
*memcg)
return >cgwb_list;
 }
 
+static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
+{
+   return wb_domain_init(>cgwb_domain, gfp);
+}
+
+static void memcg_wb_domain_exit(struct 

[PATCH 25/48] writeback: make congestion functions per bdi_writeback

2015-03-22 Thread Tejun Heo
Currently, all congestion functions take bdi (backing_dev_info) and
always operate on the root wb (bdi->wb) and the congestion state from
the block layer is propagated only for the root blkcg.  This patch
introduces {set|clear}_wb_congested() and wb_congested() which take a
bdi_writeback_congested and bdi_writeback respectively.  The bdi
counteparts are now wrappers invoking the wb based functions on
@bdi->wb.

While converting clear_bdi_congested() to clear_wb_congested(), the
local variable declaration order between @wqh and @bit is swapped for
cosmetic reason.

This patch just adds the new wb based functions.  The following
patches will apply them.

v2: Updated for bdi_writeback_congested.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 include/linux/backing-dev-defs.h | 14 +++--
 include/linux/backing-dev.h  | 45 +++-
 mm/backing-dev.c | 22 ++--
 3 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index a1e9c40..eb38676 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -163,7 +163,17 @@ enum {
BLK_RW_SYNC = 1,
 };
 
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
+void clear_wb_congested(struct bdi_writeback_congested *congested, int sync);
+void set_wb_congested(struct bdi_writeback_congested *congested, int sync);
+
+static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
+{
+   clear_wb_congested(bdi->wb.congested, sync);
+}
+
+static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
+{
+   set_wb_congested(bdi->wb.congested, sync);
+}
 
 #endif /* __LINUX_BACKING_DEV_DEFS_H */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8ae59df..2c498a2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,27 +167,13 @@ static inline struct backing_dev_info 
*inode_to_bdi(struct inode *inode)
return sb->s_bdi;
 }
 
-static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
+static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 {
-   if (bdi->congested_fn)
-   return bdi->congested_fn(bdi->congested_data, bdi_bits);
-   return (bdi->wb.congested->state & bdi_bits);
-}
-
-static inline int bdi_read_congested(struct backing_dev_info *bdi)
-{
-   return bdi_congested(bdi, 1 << WB_sync_congested);
-}
-
-static inline int bdi_write_congested(struct backing_dev_info *bdi)
-{
-   return bdi_congested(bdi, 1 << WB_async_congested);
-}
+   struct backing_dev_info *bdi = wb->bdi;
 
-static inline int bdi_rw_congested(struct backing_dev_info *bdi)
-{
-   return bdi_congested(bdi, (1 << WB_sync_congested) |
- (1 << WB_async_congested));
+   if (bdi->congested_fn)
+   return bdi->congested_fn(bdi->congested_data, cong_bits);
+   return wb->congested->state & cong_bits;
 }
 
 long congestion_wait(int sync, long timeout);
@@ -454,4 +440,25 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
+static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
+{
+   return wb_congested(>wb, cong_bits);
+}
+
+static inline int bdi_read_congested(struct backing_dev_info *bdi)
+{
+   return bdi_congested(bdi, 1 << WB_sync_congested);
+}
+
+static inline int bdi_write_congested(struct backing_dev_info *bdi)
+{
+   return bdi_congested(bdi, 1 << WB_async_congested);
+}
+
+static inline int bdi_rw_congested(struct backing_dev_info *bdi)
+{
+   return bdi_congested(bdi, (1 << WB_sync_congested) |
+ (1 << WB_async_congested));
+}
+
 #endif /* _LINUX_BACKING_DEV_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9d5a75e..7721e7a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -897,31 +897,31 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
-static atomic_t nr_bdi_congested[2];
+static atomic_t nr_wb_congested[2];
 
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
+void clear_wb_congested(struct bdi_writeback_congested *congested, int sync)
 {
-   enum wb_state bit;
wait_queue_head_t *wqh = _wqh[sync];
+   enum wb_state bit;
 
bit = sync ? WB_sync_congested : WB_async_congested;
-   if (test_and_clear_bit(bit, >wb.congested->state))
-   atomic_dec(_bdi_congested[sync]);
+   if (test_and_clear_bit(bit, >state))
+   atomic_dec(_wb_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
wake_up(wqh);
 }

[PATCH 24/48] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback

2015-03-22 Thread Tejun Heo
Currently, balance_dirty_pages() always work on bdi->wb.  This patch
updates it to work on the wb (bdi_writeback) matching memcg and blkcg
of the current task as that's what the inode is being dirtied against.

balance_dirty_pages_ratelimited() now pins the current wb and passes
it to balance_dirty_pages().

As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
visible behavior differences.

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 mm/page-writeback.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d5635cf..bfbd8d2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1337,6 +1337,7 @@ static inline void wb_dirty_limits(struct bdi_writeback 
*wb,
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
+   struct bdi_writeback *wb,
unsigned long pages_dirtied)
 {
unsigned long nr_reclaimable;   /* = file_dirty + unstable_nfs */
@@ -1352,8 +1353,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
unsigned long task_ratelimit;
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
-   struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-   struct bdi_writeback *wb = >wb;
+   struct backing_dev_info *bdi = wb->bdi;
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
unsigned long start_time = jiffies;
 
@@ -1575,14 +1575,20 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
  */
 void balance_dirty_pages_ratelimited(struct address_space *mapping)
 {
-   struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-   struct bdi_writeback *wb = >wb;
+   struct inode *inode = mapping->host;
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+   struct bdi_writeback *wb = NULL;
int ratelimit;
int *p;
 
if (!bdi_cap_account_dirty(bdi))
return;
 
+   if (inode_cgwb_enabled(inode))
+   wb = wb_get_create_current(bdi, GFP_KERNEL);
+   if (!wb)
+   wb = >wb;
+
ratelimit = current->nr_dirtied_pause;
if (wb->dirty_exceeded)
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
@@ -1616,7 +1622,9 @@ void balance_dirty_pages_ratelimited(struct address_space 
*mapping)
preempt_enable();
 
if (unlikely(current->nr_dirtied >= ratelimit))
-   balance_dirty_pages(mapping, current->nr_dirtied);
+   balance_dirty_pages(mapping, wb, current->nr_dirtied);
+
+   wb_put(wb);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 22/48] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested

2015-03-22 Thread Tejun Heo
A blkg (blkcg_gq) can be congested and decongested independently from
other blkgs on the same request_queue.  Accordingly, for cgroup
writeback support, the congestion status at bdi (backing_dev_info)
should be split and updated separately from matching blkg's.

This patch prepares by adding blkg->wb_congested and associating a
blkg with its matching per-blkcg bdi_writeback_congested on creation.

v2: Updated to associate bdi_writeback_congested instead of
bdi_writeback.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Vivek Goyal 
---
 block/blk-cgroup.c | 17 +++--
 include/linux/blk-cgroup.h |  6 ++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d2b1cbf..8b6372b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -182,6 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
struct blkcg_gq *new_blkg)
 {
struct blkcg_gq *blkg;
+   struct bdi_writeback_congested *wb_congested;
int i, ret;
 
WARN_ON_ONCE(!rcu_read_lock_held());
@@ -193,22 +194,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
goto err_free_blkg;
}
 
+   wb_congested = wb_congested_get_create(>backing_dev_info,
+  blkcg->css.id, GFP_ATOMIC);
+   if (!wb_congested) {
+   ret = -ENOMEM;
+   goto err_put_css;
+   }
+
/* allocate */
if (!new_blkg) {
new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
if (unlikely(!new_blkg)) {
ret = -ENOMEM;
-   goto err_put_css;
+   goto err_put_congested;
}
}
blkg = new_blkg;
+   blkg->wb_congested = wb_congested;
 
/* link parent */
if (blkcg_parent(blkcg)) {
blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
if (WARN_ON_ONCE(!blkg->parent)) {
ret = -EINVAL;
-   goto err_put_css;
+   goto err_put_congested;
}
blkg_get(blkg->parent);
}
@@ -250,6 +259,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_put(blkg);
return ERR_PTR(ret);
 
+err_put_congested:
+   wb_congested_put(wb_congested);
 err_put_css:
css_put(>css);
 err_free_blkg:
@@ -405,6 +416,8 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
if (blkg->parent)
blkg_put(blkg->parent);
 
+   wb_congested_put(blkg->wb_congested);
+
blkg_free(blkg);
 }
 EXPORT_SYMBOL_GPL(__blkg_release_rcu);
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 3033eb1..07a32b8 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -99,6 +99,12 @@ struct blkcg_gq {
struct hlist_node   blkcg_node;
struct blkcg*blkcg;
 
+   /*
+* Each blkg gets congested separately and the congestion state is
+* propagated to the matching bdi_writeback_congested.
+*/
+   struct bdi_writeback_congested  *wb_congested;
+
/* all non-root blkcg_gq's are guaranteed to have access to parent */
struct blkcg_gq *parent;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 28/48] writeback: implement and use mapping_congested()

2015-03-22 Thread Tejun Heo
In several places, bdi_congested() and its wrappers are used to
determine whether more IOs should be issued.  With cgroup writeback
support, this question can't be answered solely based on the bdi
(backing_dev_info).  It's dependent on whether the filesystem and bdi
support cgroup writeback and the blkcg the asking task belongs to.

This patch implements mapping_congested() and its wrappers which take
@mapping and @task and determines the congestion state considering
cgroup writeback for the combination.  The new functions replace
bdi_*congested() calls in places where the query is about specific
mapping and task.

There are several filesystem users which also fit this criteria but
they should be updated when each filesystem implements cgroup
writeback support.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Vivek Goyal 
---
 fs/fs-writeback.c   | 39 +++
 include/linux/backing-dev.h | 27 +++
 mm/fadvise.c|  2 +-
 mm/readahead.c  |  2 +-
 mm/vmscan.c | 12 ++--
 5 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 48db5e6..015f359 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -130,6 +130,45 @@ static void __wb_start_writeback(struct bdi_writeback *wb, 
long nr_pages,
wb_queue_work(wb, work);
 }
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * mapping_congested - test whether a mapping is congested for a task
+ * @mapping: address space to test for congestion
+ * @task: task to test congestion for
+ * @cong_bits: mask of WB_[a]sync_congested bits to test
+ *
+ * Tests whether @mapping is congested for @task.  @cong_bits is the mask
+ * of congestion bits to test and the return value is the mask of set bits.
+ *
+ * If cgroup writeback is enabled for @mapping, its congestion state for
+ * @task is determined by whether the cgwb (cgroup bdi_writeback) for the
+ * blkcg of %current on @mapping->backing_dev_info is congested; otherwise,
+ * the root's congestion state is used.
+ */
+int mapping_congested(struct address_space *mapping,
+ struct task_struct *task, int cong_bits)
+{
+   struct inode *inode = mapping->host;
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+   struct bdi_writeback *wb;
+   int ret = 0;
+
+   if (!inode || !inode_cgwb_enabled(inode))
+   return wb_congested(>wb, cong_bits);
+
+   rcu_read_lock();
+   wb = wb_find_current(bdi);
+   if (wb)
+   ret = wb_congested(wb, cong_bits);
+   rcu_read_unlock();
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(mapping_congested);
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
 /**
  * bdi_start_writeback - start writeback
  * @bdi: the backing device to write from
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2c498a2..cfa23ab 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -230,6 +230,8 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info 
*bdi,
 void __inode_attach_wb(struct inode *inode, struct page *page);
 void wb_memcg_offline(struct mem_cgroup *memcg);
 void wb_blkcg_offline(struct blkcg *blkcg);
+int mapping_congested(struct address_space *mapping, struct task_struct *task,
+ int cong_bits);
 
 /**
  * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
@@ -438,8 +440,33 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 {
 }
 
+static inline int mapping_congested(struct address_space *mapping,
+   struct task_struct *task, int cong_bits)
+{
+   return wb_congested(_to_bdi(mapping->host)->wb, cong_bits);
+}
+
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
+static inline int mapping_read_congested(struct address_space *mapping,
+struct task_struct *task)
+{
+   return mapping_congested(mapping, task, 1 << WB_sync_congested);
+}
+
+static inline int mapping_write_congested(struct address_space *mapping,
+ struct task_struct *task)
+{
+   return mapping_congested(mapping, task, 1 << WB_async_congested);
+}
+
+static inline int mapping_rw_congested(struct address_space *mapping,
+  struct task_struct *task)
+{
+   return mapping_congested(mapping, task, (1 << WB_sync_congested) |
+   (1 << WB_async_congested));
+}
+
 static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
 {
return wb_congested(>wb, cong_bits);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4a3907c..174727c 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -115,7 +115,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, 
loff_t, len, int, advice)
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
-   

[PATCH 30/48] writeback: implement backing_dev_info->tot_write_bandwidth

2015-03-22 Thread Tejun Heo
cgroup writeback support needs to keep track of the sum of
avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
distribute write workload.  This patch adds bdi->tot_write_bandwidth
and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
and wb_update_write_bandwidth() to adjust it as wb's gain and lose
dirty inodes and its avg_write_bandwidth gets updated.

As the update events are not synchronized with each other,
bdi->tot_write_bandwidth is an atomic_long_t.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c| 7 ++-
 include/linux/backing-dev-defs.h | 2 ++
 mm/page-writeback.c  | 3 +++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dc4e399..9d85f59 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -87,6 +87,8 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
return false;
} else {
set_bit(WB_has_dirty_io, >state);
+   atomic_long_add(wb->avg_write_bandwidth,
+   >bdi->tot_write_bandwidth);
return true;
}
 }
@@ -94,8 +96,11 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
 static void wb_io_lists_depopulated(struct bdi_writeback *wb)
 {
if (wb_has_dirty_io(wb) && list_empty(>b_dirty) &&
-   list_empty(>b_io) && list_empty(>b_more_io))
+   list_empty(>b_io) && list_empty(>b_more_io)) {
clear_bit(WB_has_dirty_io, >state);
+   atomic_long_sub(wb->avg_write_bandwidth,
+   >bdi->tot_write_bandwidth);
+   }
 }
 
 /**
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 7a94b78..d631a61 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -142,6 +142,8 @@ struct backing_dev_info {
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
 
+   atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
+
struct bdi_writeback wb;  /* the root writeback info for this bdi */
struct bdi_writeback_congested wb_congested; /* its congested state */
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bfbd8d2..813e820 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -881,6 +881,9 @@ static void wb_update_write_bandwidth(struct bdi_writeback 
*wb,
avg += (old - avg) >> 3;
 
 out:
+   if (wb_has_dirty_io(wb))
+   atomic_long_add(avg - wb->avg_write_bandwidth,
+   >bdi->tot_write_bandwidth);
wb->write_bandwidth = bw;
wb->avg_write_bandwidth = avg;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 29/48] writeback: implement WB_has_dirty_io wb_state flag

2015-03-22 Thread Tejun Heo
Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
has any dirty inode by testing all three IO lists on each invocation
without actively keeping track.  For cgroup writeback support, a
single bdi will host multiple wb's each of which will host dirty
inodes separately and we'll need to make bdi_has_dirty_io(), which
currently only represents the root wb, aggregate has_dirty_io from all
member wb's, which requires tracking transitions in has_dirty_io state
on each wb.

This patch introduces inode_wb_list_{move|del}_locked() to consolidate
IO list operations leaving queue_io() the only other function which
directly manipulates IO lists (via move_expired_inodes()).  All three
functions are updated to call wb_io_lists_[de]populated() which keep
track of whether the wb has dirty inodes or not and record it using
the new WB_has_dirty_io flag.  inode_wb_list_moved_locked()'s return
value indicates whether the wb had no dirty inodes before.

mark_inode_dirty() is restructured so that the return value of
inode_wb_list_move_locked() can be used for deciding whether to wake
up the wb.

While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
These functions were returning 0 and 1 before.  Also, add a comment
explaining the synchronization of wb_state flags.

v2: Updated to accommodate b_dirty_time.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c| 104 ++-
 include/linux/backing-dev-defs.h |   1 +
 include/linux/backing-dev.h  |   8 ++-
 mm/backing-dev.c |   2 +-
 4 files changed, 86 insertions(+), 29 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 015f359..dc4e399 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -81,6 +81,66 @@ static inline struct inode *wb_inode(struct list_head *head)
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
 
+static bool wb_io_lists_populated(struct bdi_writeback *wb)
+{
+   if (wb_has_dirty_io(wb)) {
+   return false;
+   } else {
+   set_bit(WB_has_dirty_io, >state);
+   return true;
+   }
+}
+
+static void wb_io_lists_depopulated(struct bdi_writeback *wb)
+{
+   if (wb_has_dirty_io(wb) && list_empty(>b_dirty) &&
+   list_empty(>b_io) && list_empty(>b_more_io))
+   clear_bit(WB_has_dirty_io, >state);
+}
+
+/**
+ * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list
+ * @inode: inode to be moved
+ * @wb: target bdi_writeback
+ * @head: one of @wb->b_{dirty|io|more_io}
+ *
+ * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io.
+ * Returns %true if @inode is the first occupant of the !dirty_time IO
+ * lists; otherwise, %false.
+ */
+static bool inode_wb_list_move_locked(struct inode *inode,
+ struct bdi_writeback *wb,
+ struct list_head *head)
+{
+   assert_spin_locked(>list_lock);
+
+   list_move(>i_wb_list, head);
+
+   /* dirty_time doesn't count as dirty_io until expiration */
+   if (head != >b_dirty_time)
+   return wb_io_lists_populated(wb);
+
+   wb_io_lists_depopulated(wb);
+   return false;
+}
+
+/**
+ * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list
+ * @inode: inode to be removed
+ * @wb: bdi_writeback @inode is being removed from
+ *
+ * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
+ * clear %WB_has_dirty_io if all are empty afterwards.
+ */
+static void inode_wb_list_del_locked(struct inode *inode,
+struct bdi_writeback *wb)
+{
+   assert_spin_locked(>list_lock);
+
+   list_del_init(>i_wb_list);
+   wb_io_lists_depopulated(wb);
+}
+
 static void wb_wakeup(struct bdi_writeback *wb)
 {
spin_lock_bh(>work_lock);
@@ -215,7 +275,7 @@ void inode_wb_list_del(struct inode *inode)
struct bdi_writeback *wb = inode_to_wb(inode);
 
spin_lock(>list_lock);
-   list_del_init(>i_wb_list);
+   inode_wb_list_del_locked(inode, wb);
spin_unlock(>list_lock);
 }
 
@@ -230,7 +290,6 @@ void inode_wb_list_del(struct inode *inode)
  */
 static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-   assert_spin_locked(>list_lock);
if (!list_empty(>b_dirty)) {
struct inode *tail;
 
@@ -238,7 +297,7 @@ static void redirty_tail(struct inode *inode, struct 
bdi_writeback *wb)
if (time_before(inode->dirtied_when, tail->dirtied_when))
inode->dirtied_when = jiffies;
}
-   list_move(>i_wb_list, >b_dirty);
+   inode_wb_list_move_locked(inode, wb, >b_dirty);
 }
 
 /*
@@ -246,8 +305,7 @@ static void redirty_tail(struct inode *inode, struct 
bdi_writeback *wb)
  */
 static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-   assert_spin_locked(>list_lock);
-   

[PATCH] xfs: use GFP_NOFS argument in radix_tree_preload

2015-03-22 Thread Sanidhya Kashyap
From: Byoungyoung Lee 

Following the convention of other file systems, GFP_NOFS
should be used as an argument for radix_tree_preload() instead
of GFP_KERNEL.

Signed-off-by: Byoungyoung Lee 
Signed-off-by: Sanidhya Kashyap 
---
 fs/xfs/xfs_mru_cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mru_cache.c b/fs/xfs/xfs_mru_cache.c
index 30ecca3..f8a674d 100644
--- a/fs/xfs/xfs_mru_cache.c
+++ b/fs/xfs/xfs_mru_cache.c
@@ -437,7 +437,7 @@ xfs_mru_cache_insert(
if (!mru || !mru->lists)
return -EINVAL;
 
-   if (radix_tree_preload(GFP_KERNEL))
+   if (radix_tree_preload(GFP_NOFS))
return -ENOMEM;
 
INIT_LIST_HEAD(>list_node);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 26/48] writeback, blkcg: restructure blk_{set|clear}_queue_congested()

2015-03-22 Thread Tejun Heo
blk_{set|clear}_queue_congested() take @q and set or clear,
respectively, the congestion state of its bdi's root wb.  Because bdi
used to be able to handle congestion state only on the root wb, the
callers of those functions tested whether the congestion is on the
root blkcg and skipped if not.

This is cumbersome and makes implementation of per cgroup
bdi_writeback congestion state propagation difficult.  This patch
renames blk_{set|clear}_queue_congested() to
blk_{set|clear}_congested(), and makes them take request_list instead
of request_queue and test whether the specified request_list is the
root one before updating bdi_writeback congestion state.  This makes
the tests in the callers unnecessary and simplifies them.

As there are no external users of these functions, the definitions are
moved from include/linux/blkdev.h to block/blk-core.c.

This patch doesn't introduce any noticeable behavior difference.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Vivek Goyal 
---
 block/blk-core.c   | 62 ++
 include/linux/blkdev.h | 19 
 2 files changed, 37 insertions(+), 44 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c44018a..cad26e3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -63,6 +63,28 @@ struct kmem_cache *blk_requestq_cachep;
  */
 static struct workqueue_struct *kblockd_workqueue;
 
+static void blk_clear_congested(struct request_list *rl, int sync)
+{
+   if (rl != >q->root_rl)
+   return;
+#ifdef CONFIG_CGROUP_WRITEBACK
+   clear_wb_congested(rl->blkg->wb_congested, sync);
+#else
+   clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+#endif
+}
+
+static void blk_set_congested(struct request_list *rl, int sync)
+{
+   if (rl != >q->root_rl)
+   return;
+#ifdef CONFIG_CGROUP_WRITEBACK
+   set_wb_congested(rl->blkg->wb_congested, sync);
+#else
+   set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+#endif
+}
+
 void blk_queue_congestion_threshold(struct request_queue *q)
 {
int nr;
@@ -827,13 +849,8 @@ static void __freed_request(struct request_list *rl, int 
sync)
 {
struct request_queue *q = rl->q;
 
-   /*
-* bdi isn't aware of blkcg yet.  As all async IOs end up root
-* blkcg anyway, just use root blkcg state.
-*/
-   if (rl == >root_rl &&
-   rl->count[sync] < queue_congestion_off_threshold(q))
-   blk_clear_queue_congested(q, sync);
+   if (rl->count[sync] < queue_congestion_off_threshold(q))
+   blk_clear_congested(rl, sync);
 
if (rl->count[sync] + 1 <= q->nr_requests) {
if (waitqueue_active(>wait[sync]))
@@ -866,25 +883,25 @@ static void freed_request(struct request_list *rl, 
unsigned int flags)
 int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
struct request_list *rl;
+   int on_thresh, off_thresh;
 
spin_lock_irq(q->queue_lock);
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
+   on_thresh = queue_congestion_on_threshold(q);
+   off_thresh = queue_congestion_off_threshold(q);
 
-   /* congestion isn't cgroup aware and follows root blkcg for now */
-   rl = >root_rl;
-
-   if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
-   blk_set_queue_congested(q, BLK_RW_SYNC);
-   else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
-   blk_clear_queue_congested(q, BLK_RW_SYNC);
+   blk_queue_for_each_rl(rl, q) {
+   if (rl->count[BLK_RW_SYNC] >= on_thresh)
+   blk_set_congested(rl, BLK_RW_SYNC);
+   else if (rl->count[BLK_RW_SYNC] < off_thresh)
+   blk_clear_congested(rl, BLK_RW_SYNC);
 
-   if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
-   blk_set_queue_congested(q, BLK_RW_ASYNC);
-   else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
-   blk_clear_queue_congested(q, BLK_RW_ASYNC);
+   if (rl->count[BLK_RW_ASYNC] >= on_thresh)
+   blk_set_congested(rl, BLK_RW_ASYNC);
+   else if (rl->count[BLK_RW_ASYNC] < off_thresh)
+   blk_clear_congested(rl, BLK_RW_ASYNC);
 
-   blk_queue_for_each_rl(rl, q) {
if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_rl_full(rl, BLK_RW_SYNC);
} else {
@@ -994,12 +1011,7 @@ static struct request *__get_request(struct request_list 
*rl, int rw_flags,
}
}
}
-   /*
-* bdi isn't aware of blkcg yet.  As all async IOs end up
-* root blkcg anyway, just use root blkcg state.
-*/
-   if (rl == >root_rl)
-   

[PATCH 27/48] writeback, blkcg: propagate non-root blkcg congestion state

2015-03-22 Thread Tejun Heo
Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
blk_{set|clear}_congested() can propagate non-root blkcg congestion
state to them.

This can be easily achieved by disabling the root_rl tests in
blk_{set|clear}_congested().  Note that we still need those tests when
!CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
wb's congestion state for events happening on other blkcgs.

v2: Updated for bdi_writeback_congested.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Vivek Goyal 
---
 block/blk-core.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cad26e3..95488fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -65,23 +65,26 @@ static struct workqueue_struct *kblockd_workqueue;
 
 static void blk_clear_congested(struct request_list *rl, int sync)
 {
-   if (rl != >q->root_rl)
-   return;
 #ifdef CONFIG_CGROUP_WRITEBACK
clear_wb_congested(rl->blkg->wb_congested, sync);
 #else
-   clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+   /*
+* If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't
+* flip its congestion state for events on other blkcgs.
+*/
+   if (rl == >q->root_rl)
+   clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
 #endif
 }
 
 static void blk_set_congested(struct request_list *rl, int sync)
 {
-   if (rl != >q->root_rl)
-   return;
 #ifdef CONFIG_CGROUP_WRITEBACK
set_wb_congested(rl->blkg->wb_congested, sync);
 #else
-   set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+   /* see blk_clear_congested() */
+   if (rl == >q->root_rl)
+   set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
 #endif
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 32/48] writeback: don't issue wb_writeback_work if clean

2015-03-22 Thread Tejun Heo
There are several places in fs/fs-writeback.c which queues
wb_writeback_work without checking whether the target wb
(bdi_writeback) has dirty inodes or not.  The only thing
wb_writeback_work does is writing back the dirty inodes for the target
wb and queueing a work item for a clean wb is essentially noop.  There
are some side effects such as bandwidth stats being updated and
triggering tracepoints but these don't affect the operation in any
meaningful way.

This patch makes all writeback_inodes_sb_nr() and sync_inodes_sb()
skip wb_queue_work() if the target bdi is clean.  Also, it moves
dirtiness check from wakeup_flusher_threads() to
__wb_start_writeback() so that all its callers benefit from the check.

While the overhead incurred by scheduling a noop work isn't currently
significant, the overhead may be higher with cgroup writeback support
as we may end up issuing noop work items to a lot of clean wb's.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7f44c02..3ceacbb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -177,6 +177,9 @@ static void __wb_start_writeback(struct bdi_writeback *wb, 
long nr_pages,
 {
struct wb_writeback_work *work;
 
+   if (!wb_has_dirty_io(wb))
+   return;
+
/*
 * This is WB_SYNC_NONE writeback, so if allocation fails just
 * wakeup the thread for old dirty data writeback
@@ -1207,11 +1210,8 @@ void wakeup_flusher_threads(long nr_pages, enum 
wb_reason reason)
nr_pages = get_nr_dirty_pages();
 
rcu_read_lock();
-   list_for_each_entry_rcu(bdi, _list, bdi_list) {
-   if (!bdi_has_dirty_io(bdi))
-   continue;
+   list_for_each_entry_rcu(bdi, _list, bdi_list)
__wb_start_writeback(>wb, nr_pages, false, reason);
-   }
rcu_read_unlock();
 }
 
@@ -1445,11 +1445,12 @@ void writeback_inodes_sb_nr(struct super_block *sb,
.nr_pages   = nr,
.reason = reason,
};
+   struct backing_dev_info *bdi = sb->s_bdi;
 
-   if (sb->s_bdi == _backing_dev_info)
+   if (!bdi_has_dirty_io(bdi) || bdi == _backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(>s_umount));
-   wb_queue_work(>s_bdi->wb, );
+   wb_queue_work(>wb, );
wait_for_completion();
 }
 EXPORT_SYMBOL(writeback_inodes_sb_nr);
@@ -1527,13 +1528,14 @@ void sync_inodes_sb(struct super_block *sb)
.reason = WB_REASON_SYNC,
.for_sync   = 1,
};
+   struct backing_dev_info *bdi = sb->s_bdi;
 
/* Nothing to do? */
-   if (sb->s_bdi == _backing_dev_info)
+   if (!bdi_has_dirty_io(bdi) || bdi == _backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(>s_umount));
 
-   wb_queue_work(>s_bdi->wb, );
+   wb_queue_work(>wb, );
wait_for_completion();
 
wait_sb_inodes(sb);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 31/48] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account

2015-03-22 Thread Tejun Heo
bdi_has_dirty_io() used to only reflect whether the root wb
(bdi_writeback) has dirty inodes.  For cgroup writeback support, it
needs to take all active wb's into account.  If any wb on the bdi has
dirty inodes, bdi_has_dirty_io() should return true.

To achieve that, as inode_wb_list_{move|del}_locked() now keep track
of the dirty state transition of each wb, the number of dirty wbs can
be counted in the bdi; however, bdi is already aggregating
wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
dip below 1.  bdi_has_dirty_io() can simply test whether
bdi->tot_write_bandwidth is zero or not.

While this bumps the value of wb->avg_write_bandwidth to one when it
used to be zero, this shouldn't cause any meaningful behavior
difference.

bdi_has_dirty_io() is made an inline function which tests whether
->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
value are added to inode_wb_list_{move|del}_locked().

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c|  5 +++--
 include/linux/backing-dev-defs.h |  8 ++--
 include/linux/backing-dev.h  | 10 +-
 mm/backing-dev.c |  5 -
 mm/page-writeback.c  | 10 +++---
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9d85f59..7f44c02 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -87,6 +87,7 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
return false;
} else {
set_bit(WB_has_dirty_io, >state);
+   WARN_ON_ONCE(!wb->avg_write_bandwidth);
atomic_long_add(wb->avg_write_bandwidth,
>bdi->tot_write_bandwidth);
return true;
@@ -98,8 +99,8 @@ static void wb_io_lists_depopulated(struct bdi_writeback *wb)
if (wb_has_dirty_io(wb) && list_empty(>b_dirty) &&
list_empty(>b_io) && list_empty(>b_more_io)) {
clear_bit(WB_has_dirty_io, >state);
-   atomic_long_sub(wb->avg_write_bandwidth,
-   >bdi->tot_write_bandwidth);
+   WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth,
+   >bdi->tot_write_bandwidth) < 0);
}
 }
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index d631a61..8c857d7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -98,7 +98,7 @@ struct bdi_writeback {
unsigned long dirtied_stamp;
unsigned long written_stamp;/* pages written at bw_time_stamp */
unsigned long write_bandwidth;  /* the estimated write bandwidth */
-   unsigned long avg_write_bandwidth; /* further smoothed write bw */
+   unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */
 
/*
 * The base dirty throttle rate, re-calculated on every 200ms.
@@ -142,7 +142,11 @@ struct backing_dev_info {
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
 
-   atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
+   /*
+* Sum of avg_write_bw of wbs with dirty inodes.  > 0 if there are
+* any dirty wbs, which is depended upon by bdi_has_dirty().
+*/
+   atomic_long_t tot_write_bandwidth;
 
struct bdi_writeback wb;  /* the root writeback info for this bdi */
struct bdi_writeback_congested wb_congested; /* its congested state */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bab5927..433b308 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,6 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long 
nr_pages,
enum wb_reason reason);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void wb_workfn(struct work_struct *work);
-bool bdi_has_dirty_io(struct backing_dev_info *bdi);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
 
 extern spinlock_t bdi_lock;
@@ -42,6 +41,15 @@ static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
return test_bit(WB_has_dirty_io, >state);
 }
 
+static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+   /*
+* @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are
+* any dirty wbs.  See wb_update_write_bandwidth().
+*/
+   return atomic_long_read(>tot_write_bandwidth);
+}
+
 static inline void __add_wb_stat(struct bdi_writeback *wb,
 enum wb_stat_item item, s64 amount)
 {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 56d7622..eab5181 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -256,11 +256,6 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
-bool 

Re: [PATCH 1/2] spi: Add SPI driver for Mikrotik RB4xx series boards

2015-03-22 Thread Mark Brown
On Sun, Mar 22, 2015 at 12:25:24PM +0100, Bert Vermeulen wrote:

> It is bitbanging, at least on write. The hardware has a shift register that
> is uses for reads. The generic spi for this board's architecture (ath79)
> indeed uses spi-bitbang.

> This "fast SPI" thing is what makes this one different: the boot flash and
> MMC use regular SPI on the same bus as the CPLD. This CPLD needs this fast
> SPI: a mode where it shifts in two bits per clock. The second bit is
> apparently sent via the CS2 pin.

Please make sure that all this is visible to someone looking at the
driver, it's really not at all clear what's going on just from reading
the code.

> So I don't think spi-bitbang will do. I need to see about reworking things
> to use less custom queueing -- I'm not that familiar with this yet.

Mostly it's just a case of deleting the loops and using the core ones
instead.


signature.asc
Description: Digital signature


Re: [PATCH 6/7 linux-next] ASoC: fsi: constify of_device_id array

2015-03-22 Thread Mark Brown
On Wed, Mar 18, 2015 at 05:49:01PM +0100, Fabian Frederick wrote:
> of_device_id is always used as const.
> (See driver.of_match_table and open firmware functions)

Applied, thanks.


signature.asc
Description: Digital signature


[PATCH 36/48] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's

2015-03-22 Thread Tejun Heo
For cgroup writeback support, all bdi-wide operations should be
distributed to all its wb's (bdi_writeback's).

This patch updates laptop_mode_timer_fn() so that it invokes
wb_start_writeback() on all wb's rather than just the root one.  As
the intent is writing out all dirty data, there's no reason to split
the number of pages to write.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 mm/page-writeback.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7c3a555..fa37e73 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1723,14 +1723,20 @@ void laptop_mode_timer_fn(unsigned long data)
struct request_queue *q = (struct request_queue *)data;
int nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
+   struct bdi_writeback *wb;
+   struct wb_iter iter;
 
/*
 * We want to write everything out, not just down to the dirty
 * threshold
 */
-   if (bdi_has_dirty_io(>backing_dev_info))
-   wb_start_writeback(>backing_dev_info.wb, nr_pages, true,
-  WB_REASON_LAPTOP_TIMER);
+   if (!bdi_has_dirty_io(>backing_dev_info))
+   return;
+
+   bdi_for_each_wb(wb, >backing_dev_info, , 0)
+   if (wb_has_dirty_io(wb))
+   wb_start_writeback(wb, nr_pages, true,
+  WB_REASON_LAPTOP_TIMER);
 }
 
 /*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7 linux-next] ASoC: kirkwood: constify of_device_id array

2015-03-22 Thread Mark Brown
On Wed, Mar 18, 2015 at 05:48:58PM +0100, Fabian Frederick wrote:
> of_device_id is always used as const.
> (See driver.of_match_table and open firmware functions)

Applied, thanks.


signature.asc
Description: Digital signature


Re: [PATCH] ASoC:pcm512x: Fix divide by zero issue.

2015-03-22 Thread Mark Brown
On Fri, Mar 20, 2015 at 09:13:45PM +, Howard Mitchell wrote:
> If den=1 and pllin_rate>20MHz then den and num are adjusted to 0
> causing a divide by zero error a few lines further on. Therefore
> this patch correctly scales num and den such that
> pllin_rate/den < 20MHz as required in the device data sheet.

Applied, thanks.


signature.asc
Description: Digital signature


[PATCH 33/48] writeback: make bdi->min/max_ratio handling cgroup writeback aware

2015-03-22 Thread Tejun Heo
bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
dirty limit of each bdi.  For cgroup writeback, they need to be
further distributed across wb's (bdi_writeback's) belonging to the
configured bdi.

This patch introduces wb_min_max_ratio() which distributes
bdi->min/max_ratio according to a wb's proportion in the total active
bandwidth of its bdi.

v2: Update wb_min_max_ratio() to fix a bug where both min and max were
assigned the min value and avoid calculations when possible.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 mm/page-writeback.c | 50 ++
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8480a45..349e32b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -155,6 +155,46 @@ static unsigned long writeout_period_time = 0;
  */
 #define VM_COMPLETIONS_PERIOD_LEN (3*HZ)
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+static void wb_min_max_ratio(struct bdi_writeback *wb,
+unsigned long *minp, unsigned long *maxp)
+{
+   unsigned long this_bw = wb->avg_write_bandwidth;
+   unsigned long tot_bw = atomic_long_read(>bdi->tot_write_bandwidth);
+   unsigned long long min = wb->bdi->min_ratio;
+   unsigned long long max = wb->bdi->max_ratio;
+
+   /*
+* @wb may already be clean by the time control reaches here and
+* the total may not include its bw.
+*/
+   if (this_bw < tot_bw) {
+   if (min) {
+   min *= this_bw;
+   do_div(min, tot_bw);
+   }
+   if (max < 100) {
+   max *= this_bw;
+   do_div(max, tot_bw);
+   }
+   }
+
+   *minp = min;
+   *maxp = max;
+}
+
+#else  /* CONFIG_CGROUP_WRITEBACK */
+
+static void wb_min_max_ratio(struct bdi_writeback *wb,
+unsigned long *minp, unsigned long *maxp)
+{
+   *minp = wb->bdi->min_ratio;
+   *maxp = wb->bdi->max_ratio;
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
 /*
  * In a memory zone, there is a certain amount of pages we consider
  * available for the page cache, which is essentially the number of
@@ -539,9 +579,9 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
  */
 unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
 {
-   struct backing_dev_info *bdi = wb->bdi;
u64 wb_dirty;
long numerator, denominator;
+   unsigned long wb_min_ratio, wb_max_ratio;
 
/*
 * Calculate this BDI's share of the dirty ratio.
@@ -552,9 +592,11 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, 
unsigned long dirty)
wb_dirty *= numerator;
do_div(wb_dirty, denominator);
 
-   wb_dirty += (dirty * bdi->min_ratio) / 100;
-   if (wb_dirty > (dirty * bdi->max_ratio) / 100)
-   wb_dirty = dirty * bdi->max_ratio / 100;
+   wb_min_max_ratio(wb, _min_ratio, _max_ratio);
+
+   wb_dirty += (dirty * wb_min_ratio) / 100;
+   if (wb_dirty > (dirty * wb_max_ratio) / 100)
+   wb_dirty = dirty * wb_max_ratio / 100;
 
return wb_dirty;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][ASoC]Add ability to remove rate constraints from generic ASoC AC'97 CODEC driver

2015-03-22 Thread Mark Brown
On Wed, Mar 11, 2015 at 12:28:19AM +0100, Maciej S. Szmigiero wrote:
> Add ability to remove rate constraints from generic ASoC AC'97 CODEC
> driver via passed platform data, make it selectable in config.

Please use subject lines matching the style for the subsystem.  This is
helpful for identifying relevant patches and not getting your messages
deleted unread...

> This way this driver can be used for platforms which don't need
> specialized AC'97 CODEC drivers while at the same avoiding
> code duplication from implementing equivalent functionality in
> a controller driver.

I'm sorry but this just doesn't explain what this patch is intended to
accomplish.  If we can talk to the AC'97 CODEC at all we can already
identify whatever constraints it has by looking at the ID registers so
it's not clear when or why a platform would need to use this.  It feels
like there is some underlying problem that you're trying to address.


signature.asc
Description: Digital signature


[PATCH 38/48] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info

2015-03-22 Thread Tejun Heo
bdi_start_background_writeback() currently takes @bdi and kicks the
root wb (bdi_writeback).  In preparation for cgroup writeback support,
make it take wb instead.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c   | 12 ++--
 include/linux/backing-dev.h |  2 +-
 mm/page-writeback.c |  4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c342c05..c9bda4d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -226,23 +226,23 @@ void wb_start_writeback(struct bdi_writeback *wb, long 
nr_pages,
 }
 
 /**
- * bdi_start_background_writeback - start background writeback
- * @bdi: the backing device to write from
+ * wb_start_background_writeback - start background writeback
+ * @wb: bdi_writback to write from
  *
  * Description:
  *   This makes sure WB_SYNC_NONE background writeback happens. When
- *   this function returns, it is only guaranteed that for given BDI
+ *   this function returns, it is only guaranteed that for given wb
  *   some IO is happening if we are over background dirty threshold.
  *   Caller need not hold sb s_umount semaphore.
  */
-void bdi_start_background_writeback(struct backing_dev_info *bdi)
+void wb_start_background_writeback(struct bdi_writeback *wb)
 {
/*
 * We just wake up the flusher thread. It will perform background
 * writeback as soon as there is no other work to do.
 */
-   trace_writeback_wake_background(bdi);
-   wb_wakeup(>wb);
+   trace_writeback_wake_background(wb->bdi);
+   wb_wakeup(wb);
 }
 
 /*
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 216a016..fee39cd 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -27,7 +27,7 @@ void bdi_unregister(struct backing_dev_info *bdi);
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason);
-void bdi_start_background_writeback(struct backing_dev_info *bdi);
+void wb_start_background_writeback(struct bdi_writeback *wb);
 void wb_workfn(struct work_struct *work);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3e698f4..fd441ea 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1456,7 +1456,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
}
 
if (unlikely(!writeback_in_progress(wb)))
-   bdi_start_background_writeback(bdi);
+   wb_start_background_writeback(wb);
 
if (!strictlimit)
wb_dirty_limits(wb, dirty_thresh, background_thresh,
@@ -1588,7 +1588,7 @@ pause:
return;
 
if (nr_reclaimable > background_thresh)
-   bdi_start_background_writeback(bdi);
+   wb_start_background_writeback(wb);
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: S3C64XX: Use fixed IRQ bases to avoid conflicts on Cragganmore

2015-03-22 Thread Mark Brown
On Sun, Mar 22, 2015 at 10:40:41AM +, Charles Keepax wrote:
> There are two PMICs on Cragganmore, currently one dynamically assign
> its IRQ base and the other uses a fixed base. It is possible for the
> statically assigned PMIC to fail if its IRQ is taken by the dynamically
> assigned one. Fix this by statically assigning both the IRQ bases.
> 
> Signed-off-by: Charles Keepax 

Reviwed-by: Mark Brown 

This probably wants to go to stable as well.


signature.asc
Description: Digital signature


Re: [PATCH] ASoC:pcm512x: Make PLL lock output selectable via device tree.

2015-03-22 Thread Mark Brown
On Fri, Mar 20, 2015 at 09:22:43PM +, Howard Mitchell wrote:

> + if (pcm512x->pll_lock) {

> +if (of_property_read_u32(np, "pll-lock", ) >= 0) {
> +if (val > 6) {
> +dev_err(dev, "Invalid pll-lock\n");
> +ret = -EINVAL;
> +goto err_clk;
> +}
> +pcm512x->pll_lock = val;
> +}

This breaks existing boards which rely on GPIO 4 being set as the lock
output.  This is very unfortunate since it's a silly thing for the
driver to default to but nontheless we should really continue to support
them - at a guess Peter's board is relying on this, and even if it isn't
someone else's might.


signature.asc
Description: Digital signature


[PATCH 40/48] writeback: add wb_writeback_work->auto_free

2015-03-22 Thread Tejun Heo
Currently, a wb_writeback_work is freed automatically on completion if
it doesn't have ->done set.  Add wb_writeback_work->auto_free to make
the switch explicit.  This will help cgroup writeback support where
waiting for completion and whether to free automatically don't
necessarily move together.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 75d5e5c..25504be 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -47,6 +47,7 @@ struct wb_writeback_work {
unsigned int range_cyclic:1;
unsigned int for_background:1;
unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
+   unsigned int auto_free:1;   /* free on completion */
enum wb_reason reason;  /* why was writeback initiated? */
 
struct list_head list;  /* pending work list */
@@ -256,6 +257,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long 
nr_pages,
work->nr_pages  = nr_pages;
work->range_cyclic = range_cyclic;
work->reason= reason;
+   work->auto_free = 1;
 
wb_queue_work(wb, work);
 }
@@ -1133,19 +1135,16 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
set_bit(WB_writeback_running, >state);
while ((work = get_next_work_item(wb)) != NULL) {
+   struct completion *done = work->done;
 
trace_writeback_exec(wb->bdi, work);
 
wrote += wb_writeback(wb, work);
 
-   /*
-* Notify the caller of completion if this is a synchronous
-* work item, otherwise just free it.
-*/
-   if (work->done)
-   complete(work->done);
-   else
+   if (work->auto_free)
kfree(work);
+   if (done)
+   complete(done);
}
 
/*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 41/48] writeback: implement bdi_wait_for_completion()

2015-03-22 Thread Tejun Heo
If the completion of a wb_writeback_work can be waited upon by setting
its ->done to a struct completion and waiting on it; however, for
cgroup writeback support, it's necessary to issue multiple work items
to multiple bdi_writebacks and wait for the completion of all.

This patch implements wb_completion which can wait for multiple work
items and replaces the struct completion with it.  It can be defined
using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
waited for by wb_wait_for_completion().

Nobody currently issues multiple work items and this patch doesn't
introduce any behavior changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c| 57 +++-
 include/linux/backing-dev-defs.h |  2 ++
 mm/backing-dev.c |  1 +
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 25504be..944e53d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,6 +34,10 @@
  */
 #define MIN_WRITEBACK_PAGES(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
+struct wb_completion {
+   atomic_tcnt;
+};
+
 /*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
@@ -51,9 +55,21 @@ struct wb_writeback_work {
enum wb_reason reason;  /* why was writeback initiated? */
 
struct list_head list;  /* pending work list */
-   struct completion *done;/* set if the caller waits */
+   struct wb_completion *done; /* set if the caller waits */
 };
 
+/*
+ * If one wants to wait for one or more wb_writeback_works, each work's
+ * ->done should be set to a wb_completion defined using the following
+ * macro.  Once all work items are issued with wb_queue_work(), the caller
+ * can wait for the completion of all using wb_wait_for_completion().  Work
+ * items which are waited upon aren't freed automatically on completion.
+ */
+#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \
+   struct wb_completion cmpl = {   \
+   .cnt= ATOMIC_INIT(1),   \
+   }
+
 static inline struct inode *wb_inode(struct list_head *head)
 {
return list_entry(head, struct inode, i_wb_list);
@@ -149,17 +165,34 @@ static void wb_queue_work(struct bdi_writeback *wb,
trace_writeback_queue(wb->bdi, work);
 
spin_lock_bh(>work_lock);
-   if (!test_bit(WB_registered, >state)) {
-   if (work->done)
-   complete(work->done);
+   if (!test_bit(WB_registered, >state))
goto out_unlock;
-   }
+   if (work->done)
+   atomic_inc(>done->cnt);
list_add_tail(>list, >work_list);
mod_delayed_work(bdi_wq, >dwork, 0);
 out_unlock:
spin_unlock_bh(>work_lock);
 }
 
+/**
+ * wb_wait_for_completion - wait for completion of bdi_writeback_works
+ * @bdi: bdi work items were issued to
+ * @done: target wb_completion
+ *
+ * Wait for one or more work items issued to @bdi with their ->done field
+ * set to @done, which should have been defined with
+ * DEFINE_WB_COMPLETION_ONSTACK().  This function returns after all such
+ * work items are completed.  Work items which are waited upon aren't freed
+ * automatically on completion.
+ */
+static void wb_wait_for_completion(struct backing_dev_info *bdi,
+  struct wb_completion *done)
+{
+   atomic_dec(>cnt); /* put down the initial count */
+   wait_event(bdi->wb_waitq, !atomic_read(>cnt));
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
@@ -1135,7 +1168,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
set_bit(WB_writeback_running, >state);
while ((work = get_next_work_item(wb)) != NULL) {
-   struct completion *done = work->done;
+   struct wb_completion *done = work->done;
 
trace_writeback_exec(wb->bdi, work);
 
@@ -1143,8 +1176,8 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
if (work->auto_free)
kfree(work);
-   if (done)
-   complete(done);
+   if (done && atomic_dec_and_test(>cnt))
+   wake_up_all(>bdi->wb_waitq);
}
 
/*
@@ -1448,7 +1481,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
unsigned long nr,
enum wb_reason reason)
 {
-   DECLARE_COMPLETION_ONSTACK(done);
+   DEFINE_WB_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
.sb = sb,
.sync_mode  = WB_SYNC_NONE,
@@ -1463,7 +1496,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
return;
WARN_ON(!rwsem_is_locked(>s_umount));
wb_queue_work(>wb, );
-   wait_for_completion();
+   

Re: [PATCH 2/7 linux-next] ASoC: fsl: constify of_device_id array

2015-03-22 Thread Mark Brown
On Wed, Mar 18, 2015 at 05:48:57PM +0100, Fabian Frederick wrote:
> of_device_id is always used as const.
> (See driver.of_match_table and open firmware functions)

Applied, thanks.


signature.asc
Description: Digital signature


[PATCH 44/48] writeback: make writeback initiation functions handle multiple bdi_writeback's

2015-03-22 Thread Tejun Heo
[try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only
handle dirty inodes on the root wb (bdi_writeback) of the target bdi.
This patch implements bdi_split_work_to_wbs() and use it to make these
functions handle multiple wb's.

bdi_split_work_to_wbs() takes a base wb_writeback_work and create
clones of it and issue them to the wb's of the target bdi.  The base
work's nr_pages is distributed using wb_split_bdi_pages() -
ie. according to each wb's write bandwidth's proportion in the bdi.

Cloning a bdi involves memory allocation which may fail.  In such
cases, bdi_split_work_to_wbs() issues the base work directly and waits
for its completion before proceeding to the next wb to guarantee
forward progress and correctness under memory pressure.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c | 96 ---
 1 file changed, 91 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be374ae..57b2282 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -289,6 +289,80 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, 
long nr_pages)
return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
 }
 
+/**
+ * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb
+ * @wb: target bdi_writeback
+ * @base_work: source wb_writeback_work
+ *
+ * Try to make a clone of @base_work and issue it to @wb.  If cloning
+ * succeeds, %true is returned; otherwise, @base_work is issued directly
+ * and %false is returned.  In the latter case, the caller is required to
+ * wait for @base_work's completion using wb_wait_for_single_work().
+ *
+ * A clone is auto-freed on completion.  @base_work never is.
+ */
+static bool wb_clone_and_queue_work(struct bdi_writeback *wb,
+   struct wb_writeback_work *base_work)
+{
+   struct wb_writeback_work *work;
+
+   work = kmalloc(sizeof(*work), GFP_ATOMIC);
+   if (work) {
+   *work = *base_work;
+   work->auto_free = 1;
+   work->single_wait = 0;
+   } else {
+   work = base_work;
+   work->auto_free = 0;
+   work->single_wait = 1;
+   }
+   work->single_done = 0;
+   wb_queue_work(wb, work);
+   return work != base_work;
+}
+
+/**
+ * bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi
+ * @bdi: target backing_dev_info
+ * @base_work: wb_writeback_work to issue
+ * @skip_if_busy: skip wb's which already have writeback in progress
+ *
+ * Split and issue @base_work to all wb's (bdi_writeback's) of @bdi which
+ * have dirty inodes.  If @base_work->nr_page isn't %LONG_MAX, it's
+ * distributed to the busy wbs according to each wb's proportion in the
+ * total active write bandwidth of @bdi.
+ */
+static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+ struct wb_writeback_work *base_work,
+ bool skip_if_busy)
+{
+   long nr_pages = base_work->nr_pages;
+   int next_blkcg_id = 0;
+   struct bdi_writeback *wb;
+   struct wb_iter iter;
+
+   might_sleep();
+
+   if (!bdi_has_dirty_io(bdi))
+   return;
+restart:
+   rcu_read_lock();
+   bdi_for_each_wb(wb, bdi, , next_blkcg_id) {
+   if (!wb_has_dirty_io(wb) ||
+   (skip_if_busy && writeback_in_progress(wb)))
+   continue;
+
+   base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages);
+   if (!wb_clone_and_queue_work(wb, base_work)) {
+   next_blkcg_id = wb->blkcg_css->id + 1;
+   rcu_read_unlock();
+   wb_wait_for_single_work(bdi, base_work);
+   goto restart;
+   }
+   }
+   rcu_read_unlock();
+}
+
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
 static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
@@ -296,6 +370,21 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, 
long nr_pages)
return nr_pages;
 }
 
+static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+ struct wb_writeback_work *base_work,
+ bool skip_if_busy)
+{
+   might_sleep();
+
+   if (bdi_has_dirty_io(bdi) &&
+   (!skip_if_busy || !writeback_in_progress(>wb))) {
+   base_work->auto_free = 0;
+   base_work->single_wait = 0;
+   base_work->single_done = 0;
+   wb_queue_work(>wb, base_work);
+   }
+}
+
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
@@ -1528,10 +1617,7 @@ static void __writeback_inodes_sb_nr(struct super_block 
*sb, unsigned long nr,
return;
WARN_ON(!rwsem_is_locked(>s_umount));
 
-   if (skip_if_busy && 

[PATCH 34/48] writeback: implement bdi_for_each_wb()

2015-03-22 Thread Tejun Heo
This will be used to implement bdi-wide operations which should be
distributed across all its cgroup bdi_writebacks.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 include/linux/backing-dev.h | 63 +
 1 file changed, 63 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 433b308..9dc4eea 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -384,6 +384,61 @@ static inline struct bdi_writeback *inode_to_wb(struct 
inode *inode)
return inode->i_wb;
 }
 
+struct wb_iter {
+   int start_blkcg_id;
+   struct radix_tree_iter  tree_iter;
+   void**slot;
+};
+
+static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
+  struct backing_dev_info *bdi)
+{
+   struct radix_tree_iter *titer = >tree_iter;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   if (iter->start_blkcg_id >= 0) {
+   iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id);
+   iter->start_blkcg_id = -1;
+   } else {
+   iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
+   }
+
+   if (!iter->slot)
+   iter->slot = radix_tree_next_chunk(>cgwb_tree, titer, 0);
+   if (iter->slot)
+   return *iter->slot;
+   return NULL;
+}
+
+static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
+  struct backing_dev_info *bdi,
+  int start_blkcg_id)
+{
+   iter->start_blkcg_id = start_blkcg_id;
+
+   if (start_blkcg_id)
+   return __wb_iter_next(iter, bdi);
+   else
+   return >wb;
+}
+
+/**
+ * bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order
+ * @wb_cur: cursor struct bdi_writeback pointer
+ * @bdi: bdi to walk wb's of
+ * @iter: pointer to struct wb_iter to be used as iteration buffer
+ * @start_blkcg_id: blkcg ID to start iteration from
+ *
+ * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
+ * blkcg ID order starting from @start_blkcg_id.  @iter is struct wb_iter
+ * to be used as temp storage during iteration.  rcu_read_lock() must be
+ * held throughout iteration.
+ */
+#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
+   for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id);  \
+(wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
+
 #else  /* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool inode_cgwb_enabled(struct inode *inode)
@@ -446,6 +501,14 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 {
 }
 
+struct wb_iter {
+   int next_id;
+};
+
+#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
+   for ((iter)->next_id = (start_blkcg_id);\
+({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
+
 static inline int mapping_congested(struct address_space *mapping,
struct task_struct *task, int cong_bits)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/7 linux-next] ASoC: rsnd: constify of_device_id array

2015-03-22 Thread Mark Brown
On Wed, Mar 18, 2015 at 05:49:02PM +0100, Fabian Frederick wrote:
> of_device_id is always used as const.
> (See driver.of_match_table and open firmware functions)

Applied, thanks.


signature.asc
Description: Digital signature


[PATCH 35/48] writeback: remove bdi_start_writeback()

2015-03-22 Thread Tejun Heo
bdi_start_writeback() is a thin wrapper on top of
__wb_start_writeback() which is used only by laptop_mode_timer_fn().
This patches removes bdi_start_writeback(), renames
__wb_start_writeback() to wb_start_writeback() and makes
laptop_mode_timer_fn() use it instead.

This doesn't cause any functional difference and will ease making
laptop_mode_timer_fn() cgroup writeback aware.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c   | 68 +
 include/linux/backing-dev.h |  4 +--
 mm/page-writeback.c |  4 +--
 3 files changed, 29 insertions(+), 47 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3ceacbb..c24d6fd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,33 +172,6 @@ out_unlock:
spin_unlock_bh(>work_lock);
 }
 
-static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
-bool range_cyclic, enum wb_reason reason)
-{
-   struct wb_writeback_work *work;
-
-   if (!wb_has_dirty_io(wb))
-   return;
-
-   /*
-* This is WB_SYNC_NONE writeback, so if allocation fails just
-* wakeup the thread for old dirty data writeback
-*/
-   work = kzalloc(sizeof(*work), GFP_ATOMIC);
-   if (!work) {
-   trace_writeback_nowork(wb->bdi);
-   wb_wakeup(wb);
-   return;
-   }
-
-   work->sync_mode = WB_SYNC_NONE;
-   work->nr_pages  = nr_pages;
-   work->range_cyclic = range_cyclic;
-   work->reason= reason;
-
-   wb_queue_work(wb, work);
-}
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
@@ -238,22 +211,31 @@ EXPORT_SYMBOL_GPL(mapping_congested);
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
-/**
- * bdi_start_writeback - start writeback
- * @bdi: the backing device to write from
- * @nr_pages: the number of pages to write
- * @reason: reason why some writeback work was initiated
- *
- * Description:
- *   This does WB_SYNC_NONE opportunistic writeback. The IO is only
- *   started when this function returns, we make no guarantees on
- *   completion. Caller need not hold sb s_umount semaphore.
- *
- */
-void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
-   enum wb_reason reason)
+void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+   bool range_cyclic, enum wb_reason reason)
 {
-   __wb_start_writeback(>wb, nr_pages, true, reason);
+   struct wb_writeback_work *work;
+
+   if (!wb_has_dirty_io(wb))
+   return;
+
+   /*
+* This is WB_SYNC_NONE writeback, so if allocation fails just
+* wakeup the thread for old dirty data writeback
+*/
+   work = kzalloc(sizeof(*work), GFP_ATOMIC);
+   if (!work) {
+   trace_writeback_nowork(wb->bdi);
+   wb_wakeup(wb);
+   return;
+   }
+
+   work->sync_mode = WB_SYNC_NONE;
+   work->nr_pages  = nr_pages;
+   work->range_cyclic = range_cyclic;
+   work->reason= reason;
+
+   wb_queue_work(wb, work);
 }
 
 /**
@@ -1211,7 +1193,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason 
reason)
 
rcu_read_lock();
list_for_each_entry_rcu(bdi, _list, bdi_list)
-   __wb_start_writeback(>wb, nr_pages, false, reason);
+   wb_start_writeback(>wb, nr_pages, false, reason);
rcu_read_unlock();
 }
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 9dc4eea..81e39ff 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -25,8 +25,8 @@ int bdi_register(struct backing_dev_info *bdi, struct device 
*parent,
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
-void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
-   enum wb_reason reason);
+void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+   bool range_cyclic, enum wb_reason reason);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void wb_workfn(struct work_struct *work);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 349e32b..7c3a555 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1729,8 +1729,8 @@ void laptop_mode_timer_fn(unsigned long data)
 * threshold
 */
if (bdi_has_dirty_io(>backing_dev_info))
-   bdi_start_writeback(>backing_dev_info, nr_pages,
-   WB_REASON_LAPTOP_TIMER);
+   wb_start_writeback(>backing_dev_info.wb, nr_pages, true,
+  WB_REASON_LAPTOP_TIMER);
 }
 
 /*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH 4/7 linux-next] ASoC: rt5631: constify of_device_id array

2015-03-22 Thread Mark Brown
On Wed, Mar 18, 2015 at 05:48:59PM +0100, Fabian Frederick wrote:
> of_device_id is always used as const.
> (See driver.of_match_table and open firmware functions)

Applied, thanks.


signature.asc
Description: Digital signature


[PATCH 45/48] writeback: dirty inodes against their matching cgroup bdi_writeback's

2015-03-22 Thread Tejun Heo
__mark_inode_dirty() always dirtied the inode against the root wb
(bdi_writeback).  The previous patches added all the infrastructure
necessary to attribute an inode against the wb of the dirtying cgroup.

This patch updates __mark_inode_dirty() so that it uses the wb
associated with the inode instead of unconditionally using the root
one.

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being dirtied against the root wb.

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 57b2282..890cff1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1442,7 +1442,6 @@ static noinline void block_dump___mark_inode_dirty(struct 
inode *inode)
 void __mark_inode_dirty(struct inode *inode, int flags)
 {
struct super_block *sb = inode->i_sb;
-   struct backing_dev_info *bdi = NULL;
int dirtytime;
 
trace_writeback_mark_inode_dirty(inode, flags);
@@ -1512,21 +1511,21 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 * reposition it (that would break b_dirty time-ordering).
 */
if (!was_dirty) {
+   struct bdi_writeback *wb = inode_to_wb(inode);
bool wakeup_bdi = false;
-   bdi = inode_to_bdi(inode);
 
spin_unlock(>i_lock);
-   spin_lock(>wb.list_lock);
+   spin_lock(>list_lock);
 
-   WARN(bdi_cap_writeback_dirty(bdi) &&
-!test_bit(WB_registered, >wb.state),
-"bdi-%s not registered\n", bdi->name);
+   WARN(bdi_cap_writeback_dirty(wb->bdi) &&
+!test_bit(WB_registered, >state),
+"bdi-%s not registered\n", wb->bdi->name);
 
inode->dirtied_when = jiffies;
-   wakeup_bdi = inode_wb_list_move_locked(inode, >wb,
-   dirtytime ? >wb.b_dirty_time :
-   >wb.b_dirty);
-   spin_unlock(>wb.list_lock);
+   wakeup_bdi = inode_wb_list_move_locked(inode, wb,
+   dirtytime ? >b_dirty_time :
+   >b_dirty);
+   spin_unlock(>list_lock);
trace_writeback_dirty_inode_enqueue(inode);
 
/*
@@ -1535,8 +1534,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 * to make sure background write-back happens
 * later.
 */
-   if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi)
-   wb_wakeup_delayed(>wb);
+   if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi)
+   wb_wakeup_delayed(wb);
return;
}
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 47/48] mpage: make __mpage_writepage() honor cgroup writeback

2015-03-22 Thread Tejun Heo
__mpage_writepage() is used to implement mpage_writepages() which in
turn is used for ->writepages() of various filesystems.  All writeback
logic is now updated to handle cgroup writeback and the block cgroup
to issue IOs for is encoded in writeback_control and can be retrieved
from the inode; however, __mpage_writepage() currently ignores the
blkcg indicated by the inode and issues all bio's without explicit
blkcg association.

This patch updates __mpage_writepage() so that the issued bio's are
associated with inode_to_writeback_blkcg_css(inode).

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Andrew Morton 
Cc: Alexander Viro 
---
 fs/mpage.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..a3ccb0b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -605,6 +605,8 @@ alloc_new:
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
goto confused;
+
+   bio_associate_blkcg(bio, inode_to_wb_blkcg_css(inode));
}
 
/*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 39/48] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's

2015-03-22 Thread Tejun Heo
wakeup_flusher_threads() currently only starts writeback on the root
wb (bdi_writeback).  For cgroup writeback support, update the function
to wake up all wbs and distribute the number of pages to write
according to the proportion of each wb's write bandwidth, which is
implemented in wb_split_bdi_pages().

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c | 48 ++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c9bda4d..75d5e5c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -196,6 +196,41 @@ int mapping_congested(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(mapping_congested);
 
+/**
+ * wb_split_bdi_pages - split nr_pages to write according to bandwidth
+ * @wb: target bdi_writeback to split @nr_pages to
+ * @nr_pages: number of pages to write for the whole bdi
+ *
+ * Split @wb's portion of @nr_pages according to @wb's write bandwidth in
+ * relation to the total write bandwidth of all wb's w/ dirty inodes on
+ * @wb->bdi.
+ */
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+   unsigned long this_bw = wb->avg_write_bandwidth;
+   unsigned long tot_bw = atomic_long_read(>bdi->tot_write_bandwidth);
+
+   if (nr_pages == LONG_MAX)
+   return LONG_MAX;
+
+   /*
+* This may be called on clean wb's and proportional distribution
+* may not make sense, just use the original @nr_pages in those
+* cases.  In general, we wanna err on the side of writing more.
+*/
+   if (!tot_bw || this_bw >= tot_bw)
+   return nr_pages;
+   else
+   return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
+}
+
+#else  /* CONFIG_CGROUP_WRITEBACK */
+
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+   return nr_pages;
+}
+
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
@@ -1179,8 +1214,17 @@ void wakeup_flusher_threads(long nr_pages, enum 
wb_reason reason)
nr_pages = get_nr_dirty_pages();
 
rcu_read_lock();
-   list_for_each_entry_rcu(bdi, _list, bdi_list)
-   wb_start_writeback(>wb, nr_pages, false, reason);
+   list_for_each_entry_rcu(bdi, _list, bdi_list) {
+   struct bdi_writeback *wb;
+   struct wb_iter iter;
+
+   if (!bdi_has_dirty_io(bdi))
+   continue;
+
+   bdi_for_each_wb(wb, bdi, , 0)
+   wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
+  false, reason);
+   }
rcu_read_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


bpf+tracing next steps. Was: [PATCH v9 tip 3/9] tracing: attach BPF programs to kprobes

2015-03-22 Thread Alexei Starovoitov
On 3/22/15 7:17 PM, Masami Hiramatsu wrote:
> (2015/03/23 3:03), Alexei Starovoitov wrote:
> 
>> User space tools that will compile ktap/dtrace scripts into bpf might
>> use build-id for their own purpose, but that's a different discussion.
> 
> Agreed.
> I'd like to discuss it since kprobe event interface may also have same
> issue.

I'm not sure what 'issue' you're seeing. My understanding is that
build-ids are used by perf to associate binaries with their debug info
and by systemtap to make sure that probes actually match the kernel
they were compiled for. In bpf case it probably will be perf way only.
Are you interested in doing something with bpf ? ;)
I know that Jovi is working on clang-based front-end, He Kuang is doing
something fancy and I'm going to focus on 'tcp instrumentation' once
bpf+kprobes is in. I think these efforts will help us make it
concrete and will establish a path towards bpf+tracepoints
(debug tracepoints or trace markers) and eventual integration with perf.
Here is the wish-list (for kernel and userspace) inspired by Brendan:
- access to pid, uid, tid, comm, etc
- access to kernel stack trace
- access to user-level stack trace
- kernel debuginfo for walking kernel structs, and accessing kprobe
entry args as variables
- tracing of uprobes
- tracing of user markers
- user debuginfo for user structs and args
- easy to use language
- library of scripting features
- nice one-liner syntax

I think there is a lot of interest in bpf+tracing and would be good to
align the efforts.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 48/48] ext2: enable cgroup writeback support

2015-03-22 Thread Tejun Heo
Writeback now supports cgroup writeback and the generic writeback,
buffer, libfs, and mpage helpers that ext2 uses are all updated to
work with cgroup writeback.

This patch enables cgroup writeback for ext2 by adding
FS_CGROUP_WRITEBACK to its ->fs_flags.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
---
 fs/ext2/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d0e746e..549219d 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1543,7 +1543,7 @@ static struct file_system_type ext2_fs_type = {
.name   = "ext2",
.mount  = ext2_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK,
 };
 MODULE_ALIAS_FS("ext2");
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 37/48] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info

2015-03-22 Thread Tejun Heo
writeback_in_progress() currently takes @bdi and returns whether
writeback is in progress on its root wb (bdi_writeback).  In
preparation for cgroup writeback support, make it take wb instead.
While at it, make it an inline function.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c   | 15 +--
 include/linux/backing-dev.h | 12 +++-
 mm/page-writeback.c |  4 ++--
 3 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c24d6fd..c342c05 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,19 +53,6 @@ struct wb_writeback_work {
struct completion *done;/* set if the caller waits */
 };
 
-/**
- * writeback_in_progress - determine whether there is writeback in progress
- * @bdi: the device's backing_dev_info structure.
- *
- * Determine whether there is writeback waiting to be handled against a
- * backing device.
- */
-int writeback_in_progress(struct backing_dev_info *bdi)
-{
-   return test_bit(WB_writeback_running, >wb.state);
-}
-EXPORT_SYMBOL(writeback_in_progress);
-
 static inline struct inode *wb_inode(struct list_head *head)
 {
return list_entry(head, struct inode, i_wb_list);
@@ -1465,7 +1452,7 @@ int try_to_writeback_inodes_sb_nr(struct super_block *sb,
  unsigned long nr,
  enum wb_reason reason)
 {
-   if (writeback_in_progress(sb->s_bdi))
+   if (writeback_in_progress(>s_bdi->wb))
return 1;
 
if (!down_read_trylock(>s_umount))
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 81e39ff..216a016 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -156,7 +156,17 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, 
unsigned int max_ratio);
 
 extern struct backing_dev_info noop_backing_dev_info;
 
-int writeback_in_progress(struct backing_dev_info *bdi);
+/**
+ * writeback_in_progress - determine whether there is writeback in progress
+ * @wb: bdi_writeback of interest
+ *
+ * Determine whether there is writeback waiting to be handled against a
+ * bdi_writeback.
+ */
+static inline bool writeback_in_progress(struct bdi_writeback *wb)
+{
+   return test_bit(WB_writeback_running, >state);
+}
 
 static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fa37e73..3e698f4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1455,7 +1455,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
break;
}
 
-   if (unlikely(!writeback_in_progress(bdi)))
+   if (unlikely(!writeback_in_progress(wb)))
bdi_start_background_writeback(bdi);
 
if (!strictlimit)
@@ -1573,7 +1573,7 @@ pause:
if (!dirty_exceeded && wb->dirty_exceeded)
wb->dirty_exceeded = 0;
 
-   if (writeback_in_progress(bdi))
+   if (writeback_in_progress(wb))
return;
 
/*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 43/48] writeback: restructure try_writeback_inodes_sb[_nr]()

2015-03-22 Thread Tejun Heo
try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
handles s_umount locking and skips if writeback is already in
progress.  The in progress test is performed on the root wb
(bdi_writeback) which isn't sufficient for cgroup writeback support.
The test must be done per-wb.

To prepare for the change, this patch factors out
__writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
@skip_if_busy and moves the in progress test right before queueing the
wb_writeback_work.  try_writeback_inodes_sb_nr() now just grabs
s_umount and invokes __writeback_inodes_sb_nr() with asserted
@skip_if_busy.  This way, later addition of multiple wb handling can
skip only the wb's which already have writeback in progress.

This swaps the order between in progress test and s_umount test which
can flip the return value when writeback is in progress and s_umount
is being held by someone else but this shouldn't cause any meaningful
difference.  It's a fringe condition and the return value is an
unsynchronized hint anyway.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c | 52 ++-
 include/linux/writeback.h |  6 +++---
 2 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f565635..be374ae 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1510,19 +1510,8 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
 }
 
-/**
- * writeback_inodes_sb_nr -writeback dirty inodes from given super_block
- * @sb: the superblock
- * @nr: the number of pages to write
- * @reason: reason why some writeback work initiated
- *
- * Start writeback on some inodes on this super_block. No guarantees are made
- * on how many (if any) will be written, and this function does not wait
- * for IO completion of submitted IO.
- */
-void writeback_inodes_sb_nr(struct super_block *sb,
-   unsigned long nr,
-   enum wb_reason reason)
+static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+enum wb_reason reason, bool skip_if_busy)
 {
DEFINE_WB_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
@@ -1538,9 +1527,30 @@ void writeback_inodes_sb_nr(struct super_block *sb,
if (!bdi_has_dirty_io(bdi) || bdi == _backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(>s_umount));
+
+   if (skip_if_busy && writeback_in_progress(>wb))
+   return;
+
wb_queue_work(>wb, );
wb_wait_for_completion(bdi, );
 }
+
+/**
+ * writeback_inodes_sb_nr -writeback dirty inodes from given super_block
+ * @sb: the superblock
+ * @nr: the number of pages to write
+ * @reason: reason why some writeback work initiated
+ *
+ * Start writeback on some inodes on this super_block. No guarantees are made
+ * on how many (if any) will be written, and this function does not wait
+ * for IO completion of submitted IO.
+ */
+void writeback_inodes_sb_nr(struct super_block *sb,
+   unsigned long nr,
+   enum wb_reason reason)
+{
+   __writeback_inodes_sb_nr(sb, nr, reason, false);
+}
 EXPORT_SYMBOL(writeback_inodes_sb_nr);
 
 /**
@@ -1567,19 +1577,15 @@ EXPORT_SYMBOL(writeback_inodes_sb);
  * Invoke writeback_inodes_sb_nr if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
  */
-int try_to_writeback_inodes_sb_nr(struct super_block *sb,
- unsigned long nr,
- enum wb_reason reason)
+bool try_to_writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+  enum wb_reason reason)
 {
-   if (writeback_in_progress(>s_bdi->wb))
-   return 1;
-
if (!down_read_trylock(>s_umount))
-   return 0;
+   return false;
 
-   writeback_inodes_sb_nr(sb, nr, reason);
+   __writeback_inodes_sb_nr(sb, nr, reason, true);
up_read(>s_umount);
-   return 1;
+   return true;
 }
 EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);
 
@@ -1591,7 +1597,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);
  * Implement by try_to_writeback_inodes_sb_nr()
  * Returns 1 if writeback was started, 0 if not.
  */
-int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
+bool try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
 {
return try_to_writeback_inodes_sb_nr(sb, get_nr_dirty_pages(), reason);
 }
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8e4485f..75349bb 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -93,9 +93,9 @@ struct bdi_writeback;
 void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
 

[PATCH 46/48] buffer, writeback: make __block_write_full_page() honor cgroup writeback

2015-03-22 Thread Tejun Heo
[__]block_write_full_page() is used to implement ->writepage in
various filesystems.  All writeback logic is now updated to handle
cgroup writeback and the block cgroup to issue IOs for is encoded in
writeback_control and can be retrieved from the inode; however,
[__]block_write_full_page() currently ignores the blkcg indicated by
inode and issues all bio's without explicit blkcg association.

This patch adds submit_bh_blkcg() which associates the bio with the
specified blkio cgroup before issuing and uses it in
__block_write_full_page() so that the issued bio's are associated with
inode_to_writeback_blkcg_css(inode).

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Andrew Morton 
---
 fs/buffer.c | 26 --
 include/linux/backing-dev.h | 12 
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4aa1dc2..f2d594c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -44,6 +45,9 @@
 #include 
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
+static int submit_bh_blkcg(int rw, struct buffer_head *bh,
+  unsigned long bio_flags,
+  struct cgroup_subsys_state *blkcg_css);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1704,8 +1708,8 @@ static int __block_write_full_page(struct inode *inode, 
struct page *page,
struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr_underway = 0;
-   int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
-   WRITE_SYNC : WRITE);
+   int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+   struct cgroup_subsys_state *blkcg_css = inode_to_wb_blkcg_css(inode);
 
head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
@@ -1794,7 +1798,7 @@ static int __block_write_full_page(struct inode *inode, 
struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
-   submit_bh(write_op, bh);
+   submit_bh_blkcg(write_op, bh, 0, blkcg_css);
nr_underway++;
}
bh = next;
@@ -1848,7 +1852,7 @@ recover:
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
-   submit_bh(write_op, bh);
+   submit_bh_blkcg(write_op, bh, 0, blkcg_css);
nr_underway++;
}
bh = next;
@@ -3017,7 +3021,9 @@ void guard_bio_eod(int rw, struct bio *bio)
}
 }
 
-int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
+static int submit_bh_blkcg(int rw, struct buffer_head *bh,
+  unsigned long bio_flags,
+  struct cgroup_subsys_state *blkcg_css)
 {
struct bio *bio;
int ret = 0;
@@ -3040,6 +3046,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned 
long bio_flags)
 */
bio = bio_alloc(GFP_NOIO, 1);
 
+   if (blkcg_css)
+   bio_associate_blkcg(bio, blkcg_css);
+
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
bio->bi_io_vec[0].bv_page = bh->b_page;
@@ -3070,11 +3079,16 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned 
long bio_flags)
bio_put(bio);
return ret;
 }
+
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
+{
+   return submit_bh_blkcg(rw, bh, bio_flags, NULL);
+}
 EXPORT_SYMBOL_GPL(_submit_bh);
 
 int submit_bh(int rw, struct buffer_head *bh)
 {
-   return _submit_bh(rw, bh, 0);
+   return submit_bh_blkcg(rw, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index fee39cd..a9a843c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -394,6 +394,12 @@ static inline struct bdi_writeback *inode_to_wb(struct 
inode *inode)
return inode->i_wb;
 }
 
+static inline struct cgroup_subsys_state *
+inode_to_wb_blkcg_css(struct inode *inode)
+{
+   return inode_to_wb(inode)->blkcg_css;
+}
+
 struct wb_iter {
int start_blkcg_id;
struct radix_tree_iter  tree_iter;
@@ -511,6 +517,12 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 {
 }
 
+static inline struct cgroup_subsys_state *
+inode_to_wb_blkcg_css(struct inode *inode)
+{
+   return blkcg_root_css;
+}
+
 struct wb_iter {
int next_id;
 };
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe 

[PATCH 42/48] writeback: implement wb_wait_for_single_work()

2015-03-22 Thread Tejun Heo
For cgroup writeback, multiple wb_writeback_work items may need to be
issuedto accomplish a single task.  The previous patch updated the
waiting mechanism such that wb_wait_for_completion() can wait for
multiple work items.

Issuing mulitple work items involves memory allocation which may fail.
As most writeback operations can't fail or blocked on memory
allocation, in such cases, we'll fall back to sequential issuing of an
on-stack work item, which would need to be waited upon sequentially.

This patch implements wb_wait_for_single_work() which waits for a
single work item independently from wb_completion waiting so that such
fallback mechanism can be used without getting tangled with the usual
issuing / completion operation.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 fs/fs-writeback.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 944e53d..f565635 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -52,6 +52,8 @@ struct wb_writeback_work {
unsigned int for_background:1;
unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
unsigned int auto_free:1;   /* free on completion */
+   unsigned int single_wait:1;
+   unsigned int single_done:1;
enum wb_reason reason;  /* why was writeback initiated? */
 
struct list_head list;  /* pending work list */
@@ -165,8 +167,11 @@ static void wb_queue_work(struct bdi_writeback *wb,
trace_writeback_queue(wb->bdi, work);
 
spin_lock_bh(>work_lock);
-   if (!test_bit(WB_registered, >state))
+   if (!test_bit(WB_registered, >state)) {
+   if (work->single_wait)
+   work->single_done = 1;
goto out_unlock;
+   }
if (work->done)
atomic_inc(>done->cnt);
list_add_tail(>list, >work_list);
@@ -231,6 +236,32 @@ int mapping_congested(struct address_space *mapping,
 EXPORT_SYMBOL_GPL(mapping_congested);
 
 /**
+ * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work
+ * @bdi: bdi the work item was issued to
+ * @work: work item to wait for
+ *
+ * Wait for the completion of @work which was issued to one of @bdi's
+ * bdi_writeback's.  The caller must have set @work->single_wait before
+ * issuing it.  This wait operates independently fo
+ * wb_wait_for_completion() and also disables automatic freeing of @work.
+ */
+static void wb_wait_for_single_work(struct backing_dev_info *bdi,
+   struct wb_writeback_work *work)
+{
+   if (WARN_ON_ONCE(!work->single_wait))
+   return;
+
+   wait_event(bdi->wb_waitq, work->single_done);
+
+   /*
+* Paired with smp_wmb() in wb_do_writeback() and ensures that all
+* modifications to @work prior to assertion of ->single_done is
+* visible to the caller once this function returns.
+*/
+   smp_rmb();
+}
+
+/**
  * wb_split_bdi_pages - split nr_pages to write according to bandwidth
  * @wb: target bdi_writeback to split @nr_pages to
  * @nr_pages: number of pages to write for the whole bdi
@@ -1169,14 +1200,26 @@ static long wb_do_writeback(struct bdi_writeback *wb)
set_bit(WB_writeback_running, >state);
while ((work = get_next_work_item(wb)) != NULL) {
struct wb_completion *done = work->done;
+   bool need_wake_up = false;
 
trace_writeback_exec(wb->bdi, work);
 
wrote += wb_writeback(wb, work);
 
-   if (work->auto_free)
+   if (work->single_wait) {
+   WARN_ON_ONCE(work->auto_free);
+   /* paired w/ rmb in wb_wait_for_single_work() */
+   smp_wmb();
+   work->single_done = 1;
+   need_wake_up = true;
+   } else if (work->auto_free) {
kfree(work);
+   }
+
if (done && atomic_dec_and_test(>cnt))
+   need_wake_up = true;
+
+   if (need_wake_up)
wake_up_all(>bdi->wb_waitq);
}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 23/48] writeback: attribute stats to the matching per-cgroup bdi_writeback

2015-03-22 Thread Tejun Heo
Until now, all WB_* stats were accounted against the root wb
(bdi_writeback), now that multiple wb (bdi_writeback) support is in
place, let's attributes the stats to the respective per-cgroup wb's.

As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
visible behavior differences.

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 mm/filemap.c|  2 +-
 mm/page-writeback.c | 22 ++
 mm/truncate.c   |  6 --
 3 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a2b098b..64698fa 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -215,7 +215,7 @@ void __delete_from_page_cache(struct page *page, void 
*shadow,
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
-   dec_wb_stat(_to_bdi(mapping->host)->wb, WB_RECLAIMABLE);
+   dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE);
}
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 10624e3..d5635cf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2175,10 +2175,13 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 void account_page_redirty(struct page *page)
 {
struct address_space *mapping = page->mapping;
+
if (mapping && mapping_cap_account_dirty(mapping)) {
+   struct bdi_writeback *wb = inode_to_wb(mapping->host);
+
current->nr_dirtied--;
dec_zone_page_state(page, NR_DIRTIED);
-   dec_wb_stat(_to_bdi(mapping->host)->wb, WB_DIRTIED);
+   dec_wb_stat(wb, WB_DIRTIED);
}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2324,8 +2327,7 @@ int clear_page_dirty_for_io(struct page *page)
if (TestClearPageDirty(page)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
-   dec_wb_stat(_to_bdi(mapping->host)->wb,
-   WB_RECLAIMABLE);
+   dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE);
ret = 1;
}
mem_cgroup_end_page_stat(memcg);
@@ -2343,7 +2345,8 @@ int test_clear_page_writeback(struct page *page)
 
memcg = mem_cgroup_begin_page_stat(page);
if (mapping) {
-   struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+   struct inode *inode = mapping->host;
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;
 
spin_lock_irqsave(>tree_lock, flags);
@@ -2353,8 +2356,10 @@ int test_clear_page_writeback(struct page *page)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi)) {
-   __dec_wb_stat(>wb, WB_WRITEBACK);
-   __wb_writeout_inc(>wb);
+   struct bdi_writeback *wb = inode_to_wb(inode);
+
+   __dec_wb_stat(wb, WB_WRITEBACK);
+   __wb_writeout_inc(wb);
}
}
spin_unlock_irqrestore(>tree_lock, flags);
@@ -2378,7 +2383,8 @@ int __test_set_page_writeback(struct page *page, bool 
keep_write)
 
memcg = mem_cgroup_begin_page_stat(page);
if (mapping) {
-   struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+   struct inode *inode = mapping->host;
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;
 
spin_lock_irqsave(>tree_lock, flags);
@@ -2388,7 +2394,7 @@ int __test_set_page_writeback(struct page *page, bool 
keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
-   __inc_wb_stat(>wb, WB_WRITEBACK);
+   __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
}
if (!PageDirty(page))
radix_tree_tag_clear(>page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index df16f8c..fe2d769 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -113,11 +113,13 @@ void cancel_dirty_page(struct page *page, unsigned int 
account_size)
memcg = mem_cgroup_begin_page_stat(page);
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
+
if (mapping && mapping_cap_account_dirty(mapping)) {
+   struct bdi_writeback *wb 

[PATCH 18/48] writeback: add @gfp to wb_init()

2015-03-22 Thread Tejun Heo
wb_init() currently always uses GFP_KERNEL but the planned cgroup
writeback support needs using other allocation masks.  Add @gfp to
wb_init().

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
---
 mm/backing-dev.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b0707d1..805b287 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -291,7 +291,8 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
  */
 #define INIT_BW(100 << (20 - PAGE_SHIFT))
 
-static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
+  gfp_t gfp)
 {
int i, err;
 
@@ -315,12 +316,12 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi)
INIT_LIST_HEAD(>work_list);
INIT_DELAYED_WORK(>dwork, wb_workfn);
 
-   err = fprop_local_init_percpu(>completions, GFP_KERNEL);
+   err = fprop_local_init_percpu(>completions, gfp);
if (err)
return err;
 
for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
-   err = percpu_counter_init(>stat[i], 0, GFP_KERNEL);
+   err = percpu_counter_init(>stat[i], 0, gfp);
if (err) {
while (--i)
percpu_counter_destroy(>stat[i]);
@@ -378,7 +379,7 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->max_prop_frac = FPROP_FRAC_BASE;
INIT_LIST_HEAD(>bdi_list);
 
-   err = wb_init(>wb, bdi);
+   err = wb_init(>wb, bdi, GFP_KERNEL);
if (err)
return err;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/48] writeback: move backing_dev_info->bdi_stat[] into bdi_writeback

2015-03-22 Thread Tejun Heo
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->bdi_stat[] into wb.

* enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
  enums is changed from BDI_ to WB_.

* BDI_STAT_BATCH() -> WB_STAT_BATCH()

* [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

* bdi_stat[_error]() -> wb_stat[_error]()

* bdi_writeout_inc() -> wb_writeout_inc()

* stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
  frees stat.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
  introducing no behavior changes.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Cc: Jan Kara 
Cc: Wu Fengguang 
Cc: Miklos Szeredi 
Cc: Trond Myklebust 
---
 fs/fs-writeback.c   |  2 +-
 fs/fuse/file.c  | 12 
 fs/nfs/internal.h   |  2 +-
 fs/nfs/write.c  |  3 +-
 include/linux/backing-dev.h | 68 +
 mm/backing-dev.c| 60 ---
 mm/filemap.c|  2 +-
 mm/page-writeback.c | 53 +--
 mm/truncate.c   |  4 +--
 9 files changed, 108 insertions(+), 98 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index fea13fe..992a065 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -822,7 +822,7 @@ static bool over_bground_thresh(struct backing_dev_info 
*bdi)
global_page_state(NR_UNSTABLE_NFS) > background_thresh)
return true;
 
-   if (bdi_stat(bdi, BDI_RECLAIMABLE) >
+   if (wb_stat(>wb, WB_RECLAIMABLE) >
bdi_dirty_limit(bdi, background_thresh))
return true;
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c01ec3b..997c88a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1469,9 +1469,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 
list_del(>writepages_entry);
for (i = 0; i < req->num_pages; i++) {
-   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_wb_stat(>wb, WB_WRITEBACK);
dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-   bdi_writeout_inc(bdi);
+   wb_writeout_inc(>wb);
}
wake_up(>page_waitq);
 }
@@ -1658,7 +1658,7 @@ static int fuse_writepage_locked(struct page *page)
req->end = fuse_writepage_end;
req->inode = inode;
 
-   inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
+   inc_wb_stat(_to_bdi(inode)->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
spin_lock(>lock);
@@ -1773,9 +1773,9 @@ static bool fuse_writepage_in_flight(struct fuse_req 
*new_req,
copy_highpage(old_req->pages[0], page);
spin_unlock(>lock);
 
-   dec_bdi_stat(bdi, BDI_WRITEBACK);
+   dec_wb_stat(>wb, WB_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK_TEMP);
-   bdi_writeout_inc(bdi);
+   wb_writeout_inc(>wb);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
@@ -1872,7 +1872,7 @@ static int fuse_writepages_fill(struct page *page,
req->page_descs[req->num_pages].offset = 0;
req->page_descs[req->num_pages].length = PAGE_SIZE;
 
-   inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
+   inc_wb_stat(_to_bdi(inode)->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
err = 0;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9e6475b..7e3c460 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -607,7 +607,7 @@ void nfs_mark_page_unstable(struct page *page)
struct inode *inode = page_file_mapping(page)->host;
 
inc_zone_page_state(page, NR_UNSTABLE_NFS);
-   inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE);
+   inc_wb_stat(_to_bdi(inode)->wb, WB_RECLAIMABLE);
 __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 849ed78..8bbeafe 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -853,7 +853,8 @@ static void
 nfs_clear_page_commit(struct page *page)
 {
dec_zone_page_state(page, NR_UNSTABLE_NFS);
-   dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), 
BDI_RECLAIMABLE);
+   dec_wb_stat(_to_bdi(page_file_mapping(page)->host)->wb,
+   WB_RECLAIMABLE);
 }
 
 /* Called holding inode (/cinfo) lock */
diff --git a/include/linux/backing-dev.h 

[PATCHSET 1/3 v2 block/for-4.1/core] writeback: cgroup writeback support

2015-03-22 Thread Tejun Heo
ne.patch
 0018-writeback-add-gfp-to-wb_init.patch
 0019-bdi-separate-out-congested-state-into-a-separate-str.patch
 0020-writeback-add-CONFIG-BDI_CAP-FS-_CGROUP_WRITEBACK.patch
 0021-writeback-make-backing_dev_info-host-cgroup-specific.patch
 0022-writeback-blkcg-associate-each-blkcg_gq-with-the-cor.patch
 0023-writeback-attribute-stats-to-the-matching-per-cgroup.patch
 0024-writeback-let-balance_dirty_pages-work-on-the-matchi.patch
 0025-writeback-make-congestion-functions-per-bdi_writebac.patch
 0026-writeback-blkcg-restructure-blk_-set-clear-_queue_co.patch
 0027-writeback-blkcg-propagate-non-root-blkcg-congestion-.patch
 0028-writeback-implement-and-use-mapping_congested.patch
 0029-writeback-implement-WB_has_dirty_io-wb_state-flag.patch
 0030-writeback-implement-backing_dev_info-tot_write_bandw.patch
 0031-writeback-make-bdi_has_dirty_io-take-multiple-bdi_wr.patch
 0032-writeback-don-t-issue-wb_writeback_work-if-clean.patch
 0033-writeback-make-bdi-min-max_ratio-handling-cgroup-wri.patch
 0034-writeback-implement-bdi_for_each_wb.patch
 0035-writeback-remove-bdi_start_writeback.patch
 0036-writeback-make-laptop_mode_timer_fn-handle-multiple-.patch
 0037-writeback-make-writeback_in_progress-take-bdi_writeb.patch
 0038-writeback-make-bdi_start_background_writeback-take-b.patch
 0039-writeback-make-wakeup_flusher_threads-handle-multipl.patch
 0040-writeback-add-wb_writeback_work-auto_free.patch
 0041-writeback-implement-bdi_wait_for_completion.patch
 0042-writeback-implement-wb_wait_for_single_work.patch
 0043-writeback-restructure-try_writeback_inodes_sb-_nr.patch
 0044-writeback-make-writeback-initiation-functions-handle.patch
 0045-writeback-dirty-inodes-against-their-matching-cgroup.patch
 0046-buffer-writeback-make-__block_write_full_page-honor-.patch
 0047-mpage-make-__mpage_writepage-honor-cgroup-writeback.patch
 0048-ext2-enable-cgroup-writeback-support.patch

0001-0019 are preps.

0020-0045 gradually convert writeback code so that wb (bdi_writeback)
operates as an independent writeback domain instead of bdi
(backing_dev_info), a single bdi can have multiple per-cgroup wb's
working for it, and per-bdi operations are translated and distributed
to all its member wb's.

0046-0048 make lower layers to properly propagate the cgroup
association from the writeback layer and enable cgroup writeback on
ext2.

This patchset is on top of

  block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() 
if __GFP_WAIT isn't set")
+ [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
review-cgroup-writeback-20150322

diffstat follows.  Thanks.

 Documentation/cgroups/memory.txt |1 
 block/bio.c  |   35 +-
 block/blk-cgroup.c   |   28 +
 block/blk-cgroup.h   |  603 ---
 block/blk-core.c |   70 ++--
 block/blk-integrity.c|1 
 block/blk-sysfs.c|3 
 block/blk-throttle.c |2 
 block/bounce.c   |1 
 block/cfq-iosched.c  |2 
 block/elevator.c |2 
 block/genhd.c|1 
 drivers/block/drbd/drbd_int.h|1 
 drivers/block/drbd/drbd_main.c   |   10 
 drivers/block/pktcdvd.c  |1 
 drivers/char/raw.c   |1 
 drivers/md/bcache/request.c  |1 
 drivers/md/dm.c  |2 
 drivers/md/dm.h  |1 
 drivers/md/md.h  |1 
 drivers/md/raid1.c   |4 
 drivers/md/raid10.c  |2 
 drivers/mtd/devices/block2mtd.c  |1 
 fs/block_dev.c   |9 
 fs/buffer.c  |   60 ++-
 fs/ext2/super.c  |2 
 fs/ext4/extents.c|1 
 fs/ext4/mballoc.c|1 
 fs/ext4/super.c  |1 
 fs/f2fs/node.c   |4 
 fs/f2fs/segment.h|3 
 fs/fs-writeback.c|  609 +--
 fs/fuse/file.c   |   12 
 fs/gfs2/super.c  |2 
 fs/hfs/super.c   |1 
 fs/hfsplus/super.c   |1 
 fs/inode.c   |1 
 fs/mpage.c   |2 
 fs/nfs/filelayout/filelayout.c   |1 
 fs/nfs/internal.h|2 
 fs/nfs/write.c   |3 
 fs/ocfs2/file.c  |1 
 fs/reiserfs/super.c  |1 
 fs/ufs/super.c   |1 
 fs/xfs/xfs_aops.c|   12 
 fs/xfs/xfs_file.c|1 
 include/linux/backing-dev-defs.h |  188 ++
 include/linux/backing-dev.h  |  572 -
 include/linux/bio.h  |3 
 include/linux/blk-cgroup.h   |  631 +++

[PATCH 04/48] memcg: add mem_cgroup_root_css

2015-03-22 Thread Tejun Heo
Add global mem_cgroup_root_css which points to the root memcg css.
This will be used by cgroup writeback support.  If memcg is disabled,
it's defined as ERR_PTR(-EINVAL).

Signed-off-by: Tejun Heo 
Cc: Johannes Weiner 
aCc: Michal Hocko 
---
 include/linux/memcontrol.h | 4 
 mm/memcontrol.c| 2 ++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5fe6411..294498f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -68,6 +68,8 @@ enum mem_cgroup_events_index {
 };
 
 #ifdef CONFIG_MEMCG
+extern struct cgroup_subsys_state *mem_cgroup_root_css;
+
 void mem_cgroup_events(struct mem_cgroup *memcg,
   enum mem_cgroup_events_index idx,
   unsigned int nr);
@@ -196,6 +198,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
+
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
 enum mem_cgroup_events_index idx,
 unsigned int nr)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf25d1a..fda7025 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -71,6 +71,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
 
 #define MEM_CGROUP_RECLAIM_RETRIES 5
 static struct mem_cgroup *root_mem_cgroup __read_mostly;
+struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
 
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
@@ -4551,6 +4552,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
+   mem_cgroup_root_css = >css;
page_counter_init(>memory, NULL);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] pci: fix unhandled interrupt on shutdown

2015-03-22 Thread Fam Zheng
On Thu, 03/19 19:57, Michael S. Tsirkin wrote:
> Fam Zheng noticed that pci shutdown disables msi and msix of a device while
> device is still active. This was intended to fix kexec with fusion devices but
> had the unintended effect of breaking even regular shutdown when using virtio.

Series:
Reviewed-by: Fam Zheng 

> 
> The same problem would affect any driver which doesn't register
> a level interrupt handler when using msix.
> 
> I think the fix is to avoid touching device on shutdown:
> we clear bus master anyway, so we won't get any more
> msi interrupts, and bus reset will clear the msi/msix
> state eventually anyway.
> 
> The patches seems to all work well for me.  Given they affect all pci devices,
> and the bug has been there since 2.6 times, I think there's no rush: we can
> merge them for 4.1.
> 
> At the same time, once merged, they will likely make a good
> stable candidate.
> 
> Michael S. Tsirkin (4):
>   pci: disable msi/msix at probe time
>   pci: don't disable msi/msix at shutdown
>   pci: make msi/msix shutdown functions static
>   virtio_pci: drop msi_off on probe
> 
>  include/linux/pci.h| 4 
>  drivers/pci/msi.c  | 4 ++--
>  drivers/pci/pci-driver.c   | 8 ++--
>  drivers/virtio/virtio_pci_common.c | 3 ---
>  4 files changed, 8 insertions(+), 11 deletions(-)
> 
> -- 
> MST
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] thermal: hisilicon: add new hisilicon thermal sensor driver

2015-03-22 Thread Leo Yan
On Fri, Mar 20, 2015 at 03:55:27PM +, Mark Rutland wrote:
> > > That may be the case in the code as it stands today, but per the binding
> > > the trip points are the temperatures at which an action is to be taken.
> > > 
> > > The thermal-zone has poilling-delay and polling-delay-passive, but
> > > there's no reason you couldn't also use the interrupt to handle the
> > > "hot" trip-point, adn the reset at the "critical" trip-point. All that's
> > > missing is the plumbing in order to do so.
> > > 
> > > So please co-ordinate with the thermal framework to do that.
> > 
> > Let's dig further more for this point, so that we can get more specific
> > gudiance and have a good preparation for next version's patch set.
> > 
> > After i reviewed the thermal framework code, currently have one smooth
> > way to co-ordinate the trip points w/t thermal framework: use the function
> > *thermal_zone_device_register()* to register sensor, and can use the
> > callback function .get_trip_temp to tell thermal framework for the
> > trip points' temperature.
> > 
> > For hisi thermal case, now the driver is using the function
> > *thermal_zone_of_sensor_register* to register sensor, but use this way
> > i have not found there have standard APIs which can be used by sensor
> > driver to get the trip points info from thermal framework.
> > 
> > I may miss something for thermal framework, so if have existed APIs to
> > get the trip points, could pls point out?
> 
> I am only familiar with the binding, not the Linux implementation -- The
> latter can change to accomodate your hardware without requiring binding
> changes. Please co-ordinate with the thermal maintainers.

Found the functions of_thermal_get_trip_points(tz) and of_thermal_get_ntrips(tz)
will help for this.

> > > > > > +   if (of_property_read_bool(np, 
> > > > > > "hisilicon,tsensor-bind-irq")) {
> > > > > > +
> > > > > > +   if (data->irq_bind_sensor != -1)
> > > > > > +   dev_warn(>dev, "irq has bound to 
> > > > > > index %d\n",
> > > > > > +data->irq_bind_sensor);
> > > > > > +
> > > > > > +   /* bind irq to this sensor */
> > > > > > +   data->irq_bind_sensor = index;
> > > > > > +   }
> > > > > 
> > > > > I don't see why this should be specified in the DT. Why do you believe
> > > > > it should?
> > > > 
> > > > The thermal sensor module has four sensors, but have only one
> > > > interrupt signal; This interrupt can only be used by one sensor;
> > > > So want to use dts to bind the interrupt with one selected sensor.
> > > 
> > > That's not all that great, though I'm not exactly sure how the kernel
> > > would select the best sensor to measure with. It would be good if you
> > > could talk to the thermal maintainers w.r.t. this.
> > 
> > This will be decided by the silicon, right? Every soc has different
> > combination with cpu/gpu/vpu, so which part is hottest, this maybe
> > highly dependent on individual SoC.
> > 
> > S/W just need provide the flexibility so that later can choose
> > the interrupt to bind with the sensor within the hottest part.
> 
> Then the property you care about is which sensor is closest to what is
> likely to be the hottest component. Given that, the kernel can decide
> how to use the interrupt.

Will modify the driver to dynamically bind the interrupt to hottest
sensor; Appreciate for good suggestion.

Thanks,
Leo Yan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RT 2/4] Revert "timers: do not raise softirq unconditionally"

2015-03-22 Thread Mike Galbraith
On Sat, 2015-03-21 at 19:02 +0100, Mike Galbraith wrote:
> On Thu, 2015-03-19 at 10:42 -0600, Thavatchai Makphaibulchoke wrote:
> > On 03/19/2015 10:26 AM, Steven Rostedt wrote:
> > > On Thu, 19 Mar 2015 09:17:09 +0100
> > > Mike Galbraith  wrote:
> > > 
> > > 
> > >> (aw crap, let's go shopping)... so why is the one in timer.c ok?
> > > 
> > > It's not. Sebastian, you said there were no other cases of rt_mutexes
> > > being taken in hard irq context. Looks like timer.c has one.
> > > 
> > > So perhaps the real fix is to get that special case of ownership in
> > > hard interrupt context?
> > > 
> > > -- Steve
> > > 
> > 
> > Steve, I'm still working on the fix we discussed using dummy irq_task.
> > I should be able to submit some time next week, if still interested.
> > 
> > Either that, or I think we should remove the function
> > spin_do_trylock_in_interrupt() to prevent any possibility of running
> > into similar problems in the future.
> 
> Why can't we just Let swapper be the owner when in irq with no dummy?
> 
> I have "don't raise timer unconditionally" re-applied, the check for a
> running callback bits of my nohz_full fixlet, and the below on top of
> that, and all _seems_ well.

But not so well on 64 core box.  That has nothing to do with hacklet
though, re-applying timers-do-not-raise-softirq-unconditionally.patch
without thta hangs the 64 core box during boot with no help from me
other than to patchlet to let nohz work at all, seems there's another
issue lurking there.  Hohum.  Without 'don't raise..", big box is fine.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes

2015-03-22 Thread Hugh Dickins
On Thu, 19 Mar 2015, Konstantin Khlebnikov wrote:
> On 21.02.2015 07:09, Hugh Dickins wrote:
> > 
> > The "team_usage" field added to struct page (in union with "private")
> > is somewhat vaguely named: because while the huge page is sparsely
> > occupied, it counts the occupancy; but once the huge page is fully
> > occupied, it will come to be used differently in a later patch, as
> > the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
> > never possible to map a sparsely occupied huge page, because that
> > would expose stale data to the user.
> 
> That might be a problem if this approach is supposed to be used for
> normal filesystems.

Yes, most filesystems have their own use for page->private.
My concern at this stage has just been to have a good implementation
for tmpfs, but Kirill and others are certainly interested in looking
beyond that.

> Instead of adding dedicated counter shmem could
> detect partially occupied page by scanning though all tail pages and
> checking PageUptodate() and bump mapcount for all tail pages prevent
> races between mmap and truncate. Overhead shouldn't be that big, also
> we can add fastpath - mark completely uptodate page with one of unused
> page flag (PG_private or something).

I do already use PageChecked (PG_owner_priv_1) for just that purpose:
noting all subpages Uptodate (and marked Dirty) when first mapping by
pmd (in 12/24).

But don't bump mapcount on the subpages, just the head: I don't mind
doing a pass down the subpages when it's first hugely mapped, but prefer
to avoid such a pass on every huge map and unmap - seems unnecessary.

The team_usage (== private) field ends up with three or four separate
counts (and an mlocked flag) packed into it: I expect we could trade
some of those counts for scans down the 512 subpages when necessary,
but I doubt it's a good tradeoff; and keeping atomicity would be
difficult (I've never wanted to have to take page_lock or somesuch
on every page in zap_pte_range).  Without atomicity the stats go wrong
(I think Kirill has a problem of that kind in his page_remove_rmap scan).

It will be interesting to see what Kirill does to maintain the stats
for huge pagecache: but he will have no difficulty in finding fields
to store counts, because he's got lots of spare fields in those 511
tail pages - that's a useful benefit of the compound page, but does
prevent the tails from being used in ordinary ways.  (I did try using
team_head[1].team_usage for more, but atomicity needs prevented it.)

> 
> Another (strange) idea is adding separate array of struct huge_page
> into each zone. They will work as headers for huge pages and hold
> that kind of fields. Pageblock flags also could be stored here.

It's not such a strange idea, it is a definite possibility.  Though
I've tended to think of them more as a separate array of struct pages,
one for each of the hugepages.

It's a complication I'll keep away from as long as I can, but something
like that will probably have to come.  Consider the ambiguity of the
head page, whose flags and counts may represent the 4k page mapped
by pte and the 2M page mapped by pmd: there's an absurdity to that,
one that I can live with for now, but expect some nasty case to demand
a change (the way I have it at present, just mlocking the 4k head is
enough to hold the 2M hugepage in memory: that's not good, but should
be quite easily fixed without needing the hugepage array itself).

And I think ideas may emerge from the persistent memory struct page
discussions, which feed in here.  One reason to hold back for now.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   >