Re: [PATCH] xfs: use GFP_NOFS argument in radix_tree_preload
Hi Dave, Thank you for letting us know. Since we are not an expert of XFS (nor want to be), we really want to let you guys know it's potential bug that you might miss (we are helping you!). And that's why Sanidhya asked (rather than sending a patch) at the first place. I agree that the comment is misleading and not correct, but probably encouraging a student who spend times to clean up your mistake might be better way to influence him rather than shouting :) Taesoo On 03/23/15 at 04:24pm, Dave Chinner wrote: > On Mon, Mar 23, 2015 at 01:06:23AM -0400, Sanidhya Kashyap wrote: > > From: Byoungyoung Lee > > > > Following the convention of other file systems, GFP_NOFS > > should be used as an argument for radix_tree_preload() instead > > of GFP_KERNEL. > > "convention of other filesystems" is not a reason for changing from > GFP_KERNEL to GFP_NOFS. There are rules for when GFP_NOFS needs to > be used, and so we only need to change the code if one of those > rules are triggered. i.e. inside a transaction, holding a lock that > memory reclaim might require to make progress (e.g. ip->i_ilock, > buffer locks, etc). The context in which the allocation is made will > tell you whether GFP_KERNEL is safe or not. > > So while the change probably needs to be made, it needs to be made > for the right reasons. I haven't looked at the code, but I have > a pretty good idea of the context the allocation is being made > under. I'd suggest documenting the call path down to > xfs_mru_cache_insert(), because that will tell you exactly what > context the allocaiton is being made in and hence tell everyone else > the real reason we need to make this change... > > Call me picky, pendantic and/or annoying, but if you are looking at > validating/correcting allocation flags then you need to understand > the rules and context in which the allocation is being made... > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/8] truncate: swap the order of conditionals in cancel_dirty_page()
cancel_dirty_page() currently performs TestClearPageDirty() and then tests whether the mapping exists and has cap_account_dirty. This patch swaps the order so that it performs the mapping tests first. If the mapping tests fail, the dirty is cleared with ClearPageDirty(). The order or the conditionals is swapped but the end result is the same. This will help inode foreign cgroup wb switching. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/truncate.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index fe2d769..9d40cd4 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -108,13 +108,13 @@ void do_invalidatepage(struct page *page, unsigned int offset, */ void cancel_dirty_page(struct page *page, unsigned int account_size) { - struct mem_cgroup *memcg; + struct address_space *mapping = page->mapping; - memcg = mem_cgroup_begin_page_stat(page); - if (TestClearPageDirty(page)) { - struct address_space *mapping = page->mapping; + if (mapping && mapping_cap_account_dirty(mapping)) { + struct mem_cgroup *memcg; - if (mapping && mapping_cap_account_dirty(mapping)) { + memcg = mem_cgroup_begin_page_stat(page); + if (TestClearPageDirty(page)) { struct bdi_writeback *wb = inode_to_wb(mapping->host); mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); @@ -123,8 +123,10 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) if (account_size) task_io_account_cancelled_write(account_size); } + mem_cgroup_end_page_stat(memcg); + } else { + ClearPageDirty(page); } - mem_cgroup_end_page_stat(memcg); } EXPORT_SYMBOL(cancel_dirty_page); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHSET 3/3 block/for-4.1/core] writeback: implement foreign cgroup inode bdi_writeback switching
Hello, The previous two patchsets [2][3] implemented cgroup writeback support and backpressure propagation through dirty throttling mechanism; however, the inode is assigned to the wb (bdi_writeback) matching the first dirtied page and stays there until released. This first-use policy can easily lead to gross misbehaviors - a single stray dirty page can cause gigatbytes to be written by the wrong cgroup. Also, while concurrently write sharing an inode is extremely rare and unsupported, inodes jumping cgroups over time are more common. This patchset implements foreign cgroup inode detection and wb switching. Each writeback run tracks the majority wb being written using a simple but fairly robust algorithm and when an inode persistently writes out more foreign cgroup pages than local ones, the inode is transferred to the majority winner. This patchset adds 8 bytes to inode making the total per-inode space overhead of cgroup writeback support 16 bytes on 64bit systems. The computational overhead should be negligible. If the writer changes from one cgroup to another entirely, the mechanism can render the correct switch verdict in several seconds of IO time in most cases and it can converge on the correct answer in reasonable amount of time even in more ambiguous cases. This patchset contains the following 8 patches. 0001-writeback-relocate-wb-_try-_get-wb_put-inode_-attach.patch 0002-writeback-make-writeback_control-track-the-inode-bei.patch 0003-writeback-implement-foreign-cgroup-inode-detection.patch 0004-truncate-swap-the-order-of-conditionals-in-cancel_di.patch 0005-writeback-implement-locked_-inode_to_wb_and_lock_lis.patch 0006-writeback-implement-I_WB_SWITCH-and-bdi_writeback-st.patch 0007-writeback-add-lockdep-annotation-to-inode_to_wb.patch 0008-writeback-implement-foreign-cgroup-inode-bdi_writeba.patch This patchset is on top of block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set") + [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation + [2] [PATCHSET 1/3 v2 block/for-4.1/core] writeback: cgroup writeback support + [3] [PATCHSET 2/3 block/for-4.1/core] writeback: cgroup writeback backpressure propagation and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-switch-20150322 diffstat follows. Thanks. fs/buffer.c | 26 +- fs/fs-writeback.c| 499 ++- fs/mpage.c |3 include/linux/backing-dev-defs.h | 50 +++ include/linux/backing-dev.h | 136 -- include/linux/fs.h | 11 include/linux/writeback.h| 123 + mm/backing-dev.c | 30 -- mm/filemap.c |2 mm/page-writeback.c | 16 + mm/truncate.c| 21 + 11 files changed, 773 insertions(+), 144 deletions(-) -- tejun [L] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email...@kernel.org [1] http://lkml.kernel.org/g/20150323041848.ga8...@htj.duckdns.org [2] http://lkml.kernel.org/g/1427086499-15657-1-git-send-email...@kernel.org [3] http://lkml.kernel.org/g/1427087267-16592-1-git-send-email...@kernel.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/8] writeback: implement [locked_]inode_to_wb_and_lock_list()
cgroup writeback currently assumes that inode to wb association doesn't change; however, with the planned foreign inode wb switching mechanism, the association will change dynamically. When an inode needs to be put on one of the IO lists of its wb, the current code simply calls inode_to_wb() and locks the returned wb; however, with the planned wb switching, the association may change before locking the wb and may even get released. This patch implements [locked_]inode_to_wb_and_lock_list() which pins the associated wb while holding i_lock, releases it, acquires wb->list_lock and verifies that the association hasn't changed inbetween. As the association will be protected by both locks among other things, this guarantees that the wb is the inode's associated wb until the list_lock is released. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c | 80 +++ 1 file changed, 75 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 478d26e..e888c4b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -246,6 +246,56 @@ void __inode_attach_wb(struct inode *inode, struct page *page) } /** + * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it + * @inode: inode of interest with i_lock held + * + * Returns @inode's wb with its list_lock held. @inode->i_lock must be + * held on entry and is released on return. The returned wb is guaranteed + * to stay @inode's associated wb until its list_lock is released. + */ +static struct bdi_writeback * +locked_inode_to_wb_and_lock_list(struct inode *inode) + __releases(>i_lock) + __acquires(>list_lock) +{ + while (true) { + struct bdi_writeback *wb = inode_to_wb(inode); + + /* +* inode_to_wb() association is protected by both +* @inode->i_lock and @wb->list_lock but list_lock nests +* outside i_lock. Drop i_lock and verify that the +* association hasn't changed after acquiring list_lock. +*/ + wb_get(wb); + spin_unlock(>i_lock); + spin_lock(>list_lock); + wb_put(wb); /* not gonna deref it anymore */ + + if (likely(wb == inode_to_wb(inode))) + return wb; /* @inode already has ref */ + + spin_unlock(>list_lock); + cpu_relax(); + spin_lock(>i_lock); + } +} + +/** + * inode_to_wb_and_lock_list - determine an inode's wb and lock it + * @inode: inode of interest + * + * Same as locked_inode_to_wb_and_lock_list() but @inode->i_lock isn't held + * on entry. + */ +static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode) + __acquires(>list_lock) +{ + spin_lock(>i_lock); + return locked_inode_to_wb_and_lock_list(inode); +} + +/** * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it * @wbc: writeback_control of interest * @inode: target inode @@ -591,6 +641,27 @@ restart: #else /* CONFIG_CGROUP_WRITEBACK */ +static struct bdi_writeback * +locked_inode_to_wb_and_lock_list(struct inode *inode) + __releases(>i_lock) + __acquires(>list_lock) +{ + struct bdi_writeback *wb = inode_to_wb(inode); + + spin_unlock(>i_lock); + spin_lock(>list_lock); + return wb; +} + +static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode) + __acquires(>list_lock) +{ + struct bdi_writeback *wb = inode_to_wb(inode); + + spin_lock(>list_lock); + return wb; +} + static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) { return nr_pages; @@ -666,9 +737,9 @@ void wb_start_background_writeback(struct bdi_writeback *wb) */ void inode_wb_list_del(struct inode *inode) { - struct bdi_writeback *wb = inode_to_wb(inode); + struct bdi_writeback *wb; - spin_lock(>list_lock); + wb = inode_to_wb_and_lock_list(inode); inode_wb_list_del_locked(inode, wb); spin_unlock(>list_lock); } @@ -1713,11 +1784,10 @@ void __mark_inode_dirty(struct inode *inode, int flags) * reposition it (that would break b_dirty time-ordering). */ if (!was_dirty) { - struct bdi_writeback *wb = inode_to_wb(inode); + struct bdi_writeback *wb; bool wakeup_bdi = false; - spin_unlock(>i_lock); - spin_lock(>list_lock); + wb = locked_inode_to_wb_and_lock_list(inode); WARN(bdi_cap_writeback_dirty(wb->bdi) && !test_bit(WB_registered, >state), -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a
Re: [PATCH] lguest: simplify lguest_iret
Denys Vlasenko writes: > Signed-off-by: Denys Vlasenko > CC: lgu...@lists.ozlabs.org > CC: x...@kernel.org > CC: linux-kernel@vger.kernel.org Oh, thanks, applied! And now it's down to one instruction, we could change try_deliver_interrupt() to handle this case (rather than ignoring the interrupt altogether): just jump to the handler and leave the stack set up. Here's a pair of inline patches which attempt to do that (untested!). Thanks, Rusty. lguest: suppress interrupts for single insn, not range. The last patch reduced our interrupt-suppression region to one address, so simplify the code somewhat. Also, remove the obsolete undefined instruction ranges and the comment which refers to lguest_guest.S instead of head_32.S. Signed-off-by: Rusty Russell diff --git a/arch/x86/include/asm/lguest.h b/arch/x86/include/asm/lguest.h index e2d4a4afa8c3..3bbc07a57a31 100644 --- a/arch/x86/include/asm/lguest.h +++ b/arch/x86/include/asm/lguest.h @@ -20,13 +20,10 @@ extern unsigned long switcher_addr; /* Found in switcher.S */ extern unsigned long default_idt_entries[]; -/* Declarations for definitions in lguest_guest.S */ -extern char lguest_noirq_start[], lguest_noirq_end[]; +/* Declarations for definitions in arch/x86/lguest/head_32.S */ +extern char lguest_noirq_iret[]; extern const char lgstart_cli[], lgend_cli[]; -extern const char lgstart_sti[], lgend_sti[]; -extern const char lgstart_popf[], lgend_popf[]; extern const char lgstart_pushf[], lgend_pushf[]; -extern const char lgstart_iret[], lgend_iret[]; extern void lguest_iret(void); extern void lguest_init(void); diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c index ac4453d8520e..4c8cf656f21f 100644 --- a/arch/x86/lguest/boot.c +++ b/arch/x86/lguest/boot.c @@ -87,8 +87,7 @@ struct lguest_data lguest_data = { .hcall_status = { [0 ... LHCALL_RING_SIZE-1] = 0xFF }, - .noirq_start = (u32)lguest_noirq_start, - .noirq_end = (u32)lguest_noirq_end, + .noirq_iret = (u32)lguest_noirq_iret, .kernel_address = PAGE_OFFSET, .blocked_interrupts = { 1 }, /* Block timer interrupts */ .syscall_vec = SYSCALL_VECTOR, diff --git a/arch/x86/lguest/head_32.S b/arch/x86/lguest/head_32.S index 583732cc5d18..128fe93161b4 100644 --- a/arch/x86/lguest/head_32.S +++ b/arch/x86/lguest/head_32.S @@ -133,9 +133,8 @@ ENTRY(lg_restore_fl) ret /*:*/ -/* These demark the EIP range where host should never deliver interrupts. */ -.global lguest_noirq_start -.global lguest_noirq_end +/* These demark the EIP where host should never deliver interrupts. */ +.global lguest_noirq_iret /*M:004 * When the Host reflects a trap or injects an interrupt into the Guest, it @@ -174,12 +173,11 @@ ENTRY(lg_restore_fl) * * The second is harder: copying eflags to lguest_data.irq_enabled will turn * interrupts on before we're finished, so we could be interrupted before we - * return to userspace or wherever. Our solution to this is to surround the - * code with lguest_noirq_start: and lguest_noirq_end: labels. We tell the + * return to userspace or wherever. Our solution to this is to tell the * Host that it is *never* to interrupt us there, even if interrupts seem to be * enabled. (It's not necessary to protect pop instruction, since - * data gets updated only after it completes, so we end up surrounding - * just one instruction, iret). + * data gets updated only after it completes, so we only need to protect + * one instruction, iret). */ ENTRY(lguest_iret) pushl 2*4(%esp) @@ -190,6 +188,5 @@ ENTRY(lguest_iret) * prefix makes sure we use the stack segment, which is still valid. */ popl%ss:lguest_data+LGUEST_DATA_irq_enabled -lguest_noirq_start: +lguest_noirq_iret: iret -lguest_noirq_end: diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c index 1219af493c0f..19a32280731d 100644 --- a/drivers/lguest/hypercalls.c +++ b/drivers/lguest/hypercalls.c @@ -211,10 +211,9 @@ static void initialize(struct lg_cpu *cpu) /* * The Guest tells us where we're not to deliver interrupts by putting -* the range of addresses into "struct lguest_data". +* the instruction address into "struct lguest_data". */ - if (get_user(cpu->lg->noirq_start, >lg->lguest_data->noirq_start) - || get_user(cpu->lg->noirq_end, >lg->lguest_data->noirq_end)) + if (get_user(cpu->lg->noirq_iret, >lg->lguest_data->noirq_iret)) kill_guest(cpu, "bad guest page %p", cpu->lg->lguest_data); /* diff --git a/drivers/lguest/interrupts_and_traps.c b/drivers/lguest/interrupts_and_traps.c index 70dfcdc29f1f..6d4c072b61e1 100644 --- a/drivers/lguest/interrupts_and_traps.c +++ b/drivers/lguest/interrupts_and_traps.c @@ -204,8 +204,7 @@ void try_deliver_interrupt(struct lg_cpu *cpu, unsigned int irq, bool more) * They may be in the middle of an iret, where they asked us never to *
[PATCH 2/8] writeback: make writeback_control track the inode being written back
Currently, for cgroup writeback, the IO submission paths directly associate the bio's with the blkcg from inode_to_wb_blkcg_css(); however, it'd be necessary to keep more writeback context to implement foreign inode writeback detection. wbc (writeback_control) is the natural fit for the extra context - it persists throughout the writeback of each inode and is passed all the way down to IO submission paths. This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and wbc_attach_fdatawrite_inode() which are used to associate wbc with the inode being written back. IO submission paths now use wbc_init_bio() instead of directly associating bio's with blkcg themselves. This leaves inode_to_wb_blkcg_css() w/o any user. The function is removed. wbc currently only tracks the associated wb (bdi_writeback). Future patches will add more for foreign inode detection. The association is established under i_lock which will be depended upon when migrating foreign inodes to other wb's. As currently, once established, inode to wb association never changes, going through wbc when initializing bio's doesn't cause any behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/buffer.c | 24 -- fs/fs-writeback.c | 37 +-- fs/mpage.c | 2 +- include/linux/backing-dev.h | 12 - include/linux/writeback.h | 61 + mm/filemap.c| 2 ++ 6 files changed, 110 insertions(+), 28 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index f2d594c..cb2c7ec 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -45,9 +45,9 @@ #include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); -static int submit_bh_blkcg(int rw, struct buffer_head *bh, - unsigned long bio_flags, - struct cgroup_subsys_state *blkcg_css); +static int submit_bh_wbc(int rw, struct buffer_head *bh, +unsigned long bio_flags, +struct writeback_control *wbc); #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) @@ -1709,7 +1709,6 @@ static int __block_write_full_page(struct inode *inode, struct page *page, unsigned int blocksize, bbits; int nr_underway = 0; int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); - struct cgroup_subsys_state *blkcg_css = inode_to_wb_blkcg_css(inode); head = create_page_buffers(page, inode, (1 << BH_Dirty)|(1 << BH_Uptodate)); @@ -1798,7 +1797,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, do { struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { - submit_bh_blkcg(write_op, bh, 0, blkcg_css); + submit_bh_wbc(write_op, bh, 0, wbc); nr_underway++; } bh = next; @@ -1852,7 +1851,7 @@ recover: struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { clear_buffer_dirty(bh); - submit_bh_blkcg(write_op, bh, 0, blkcg_css); + submit_bh_wbc(write_op, bh, 0, wbc); nr_underway++; } bh = next; @@ -3021,9 +3020,8 @@ void guard_bio_eod(int rw, struct bio *bio) } } -static int submit_bh_blkcg(int rw, struct buffer_head *bh, - unsigned long bio_flags, - struct cgroup_subsys_state *blkcg_css) +static int submit_bh_wbc(int rw, struct buffer_head *bh, +unsigned long bio_flags, struct writeback_control *wbc) { struct bio *bio; int ret = 0; @@ -3046,8 +3044,8 @@ static int submit_bh_blkcg(int rw, struct buffer_head *bh, */ bio = bio_alloc(GFP_NOIO, 1); - if (blkcg_css) - bio_associate_blkcg(bio, blkcg_css); + if (wbc) + wbc_init_bio(wbc, bio); bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio->bi_bdev = bh->b_bdev; @@ -3082,13 +3080,13 @@ static int submit_bh_blkcg(int rw, struct buffer_head *bh, int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) { - return submit_bh_blkcg(rw, bh, bio_flags, NULL); + return submit_bh_wbc(rw, bh, bio_flags, NULL); } EXPORT_SYMBOL_GPL(_submit_bh); int submit_bh(int rw, struct buffer_head *bh) { - return submit_bh_blkcg(rw, bh, 0, NULL); + return submit_bh_wbc(rw, bh, 0, NULL); } EXPORT_SYMBOL(submit_bh); diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index dfb7bb6..da87463 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -232,6 +232,37 @@ void
[PATCH 3/8] writeback: implement foreign cgroup inode detection
As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and will transfer the ownership to it. To avoid unnnecessary oscillation, the detection mechanism keeps track of history and gives out the switch verdict only if the foreign usage pattern is stable over a certain amount of time and/or writeback attempts. The detection mechanism has fairly low space and computation overhead. It adds 8 bytes to struct inode (one int and two u16's) and minimal amount of calculation per IO. The detection mechanism converges to the correct answer usually in several seconds of IO time when there's a clear majority dirtier. Even when there isn't, it can reach an acceptable answer fairly quickly under most circumstances. Please see wb_detach_inode() for more details. This patch only implements detection. Following patches will implement actual switching. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/buffer.c | 4 +- fs/fs-writeback.c | 168 +- fs/mpage.c| 1 + include/linux/fs.h| 5 ++ include/linux/writeback.h | 16 + 5 files changed, 191 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index cb2c7ec..a404d8e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3044,8 +3044,10 @@ static int submit_bh_wbc(int rw, struct buffer_head *bh, */ bio = bio_alloc(GFP_NOIO, 1); - if (wbc) + if (wbc) { wbc_init_bio(wbc, bio); + wbc_account_io(wbc, bh->b_page, bh->b_size); + } bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio->bi_bdev = bh->b_bdev; diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index da87463..478d26e 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -201,6 +201,20 @@ static void wb_wait_for_completion(struct backing_dev_info *bdi, #ifdef CONFIG_CGROUP_WRITEBACK +/* parameters for foreign inode detection, see wb_detach_inode() */ +#define WB_FRN_TIME_SHIFT 13 /* 1s = 2^13, upto 8 secs w/ 16bit */ +#define WB_FRN_TIME_AVG_SHIFT 3 /* avg = avg * 7/8 + new * 1/8 */ +#define WB_FRN_TIME_CUT_DIV2 /* ignore rounds < avg / 2 */ +#define WB_FRN_TIME_PERIOD (2 * (1 << WB_FRN_TIME_SHIFT)) /* 2s */ + +#define WB_FRN_HIST_SLOTS 16 /* inode->i_wb_frn_history is 16bit */ +#define WB_FRN_HIST_UNIT (WB_FRN_TIME_PERIOD / WB_FRN_HIST_SLOTS) + /* each slot's duration is 2s / 16 */ +#define WB_FRN_HIST_THR_SLOTS (WB_FRN_HIST_SLOTS / 2) + /* if foreign slots >= 8, switch */ +#define WB_FRN_HIST_MAX_SLOTS (WB_FRN_HIST_THR_SLOTS / 2 + 1) + /* one round can affect upto 5 slots */ + void __inode_attach_wb(struct inode *inode, struct page *page) { struct backing_dev_info *bdi = inode_to_bdi(inode); @@ -245,24 +259,174 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc, struct inode *inode) { wbc->wb = inode_to_wb(inode); + wbc->inode = inode; + + wbc->wb_id = wbc->wb->memcg_css->id; + wbc->wb_lcand_id = inode->i_wb_frn_winner; + wbc->wb_tcand_id = 0; + wbc->wb_bytes = 0; + wbc->wb_lcand_bytes = 0; + wbc->wb_tcand_bytes = 0; + wb_get(wbc->wb); spin_unlock(>i_lock); } /** - * wbc_detach_inode - disassociate wbc from its target inode - * @wbc: writeback_control of interest + * wbc_detach_inode - disassociate wbc from inode and perform foreign detection + * @wbc: writeback_control of the just finished writeback * * To be called after a writeback attempt of an inode finishes and undoes * wbc_attach_and_unlock_inode(). Can be called under any context. + * + * As concurrent write sharing of an inode is expected to be very rare and + * memcg only tracks page ownership on first-use basis severely confining + * the usefulness of such sharing, cgroup writeback tracks ownership + * per-inode. While the support for concurrent write sharing of an inode + * is deemed unnecessary, an inode being written to by different cgroups at + * different points in time is a lot more common, and, more importantly, + * charging only by first-use can too readily lead to grossly incorrect
[PATCH 1/8] writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
Currently, majority of cgroup writeback support including all the above functions are implemented in include/linux/backing-dev.h and mm/backing-dev.c; however, the portion closely related to writeback logic implemented in include/linux/writeback.h and mm/page-writeback.c will expand to support foreign writeback detection and correction. This patch moves wb[_try]_get() and wb_put() to include/linux/backing-dev-defs.h so that they can be used from writeback.h and inode_{attach|detach}_wb() to writeback.h and page-writeback.c. This is pure reorganization and doesn't introduce any functional changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c| 31 +++ include/linux/backing-dev-defs.h | 50 include/linux/backing-dev.h | 82 include/linux/writeback.h| 46 ++ mm/backing-dev.c | 30 --- 5 files changed, 127 insertions(+), 112 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 683bd92..dfb7bb6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "internal.h" /* @@ -200,6 +201,36 @@ static void wb_wait_for_completion(struct backing_dev_info *bdi, #ifdef CONFIG_CGROUP_WRITEBACK +void __inode_attach_wb(struct inode *inode, struct page *page) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = NULL; + + if (inode_cgwb_enabled(inode)) { + struct cgroup_subsys_state *memcg_css; + + if (page) { + memcg_css = mem_cgroup_css_from_page(page); + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + } else { + /* must pin memcg_css, see wb_get_create() */ + memcg_css = task_get_css(current, memory_cgrp_id); + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + css_put(memcg_css); + } + } + + if (!wb) + wb = >wb; + + /* +* There may be multiple instances of this function racing to +* update the same inode. Use cmpxchg() to tell the winner. +*/ + if (unlikely(cmpxchg(>i_wb, NULL, wb))) + wb_put(wb); +} + /** * mapping_congested - test whether a mapping is congested for a task * @mapping: address space to test for congestion diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 8d470b7..e047b49 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -186,4 +186,54 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync) set_wb_congested(bdi->wb.congested, sync); } +#ifdef CONFIG_CGROUP_WRITEBACK + +/** + * wb_tryget - try to increment a wb's refcount + * @wb: bdi_writeback to get + */ +static inline bool wb_tryget(struct bdi_writeback *wb) +{ + if (wb != >bdi->wb) + return percpu_ref_tryget(>refcnt); + return true; +} + +/** + * wb_get - increment a wb's refcount + * @wb: bdi_writeback to get + */ +static inline void wb_get(struct bdi_writeback *wb) +{ + if (wb != >bdi->wb) + percpu_ref_get(>refcnt); +} + +/** + * wb_put - decrement a wb's refcount + * @wb: bdi_writeback to put + */ +static inline void wb_put(struct bdi_writeback *wb) +{ + if (wb != >bdi->wb) + percpu_ref_put(>refcnt); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static inline bool wb_tryget(struct bdi_writeback *wb) +{ + return true; +} + +static inline void wb_get(struct bdi_writeback *wb) +{ +} + +static inline void wb_put(struct bdi_writeback *wb) +{ +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + #endif /* __LINUX_BACKING_DEV_DEFS_H */ diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index a9a843c..119f0af 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -243,7 +243,6 @@ void wb_congested_put(struct bdi_writeback_congested *congested); struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp); -void __inode_attach_wb(struct inode *inode, struct page *page); void wb_memcg_offline(struct mem_cgroup *memcg); void wb_blkcg_offline(struct blkcg *blkcg); int mapping_congested(struct address_space *mapping, struct task_struct *task, @@ -266,37 +265,6 @@ static inline bool inode_cgwb_enabled(struct inode *inode) } /** - * wb_tryget - try to increment a wb's refcount - * @wb: bdi_writeback to get - */ -static inline bool wb_tryget(struct bdi_writeback *wb) -{ - if (wb != >bdi->wb) - return percpu_ref_tryget(>refcnt); - return true; -} - -/** - *
Re: [PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim
On Mon, Mar 23, 2015 at 01:07:47AM -0400, Tejun Heo wrote: > Because writeback wasn't cgroup aware before, the usual dirty > throttling mechanism in balance_dirty_pages() didn't work for > processes under memcg limit. The writeback path didn't know how much > memory is available or how fast the dirty pages are being written out > for a given memcg and balance_dirty_pages() didn't have any measure of > IO back pressure for the memcg. > > To work around the issue, memcg implemented an ad-hoc dirty throttling > mechanism in the direct reclaim path by stalling on pages under > writeback which are encountered during direct reclaim scan. This is > rather ugly and crude - none of the configurability, fairness, or > bandwidth-proportional distribution of the normal path. > > The previous patches implemented proper memcg aware dirty throttling > and the ad-hoc mechanism is no longer necessary. Remove it. Oops, just realized that this can't be removed, at least yet. !unified path still depends on it. I'll update the patch to disable these checks only on the unified hierarchy. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/8] writeback: add lockdep annotation to inode_to_wb()
With the previous two patches, all operations which acquire wb from inode are either under one of inode->i_lock, mapping->tree_lock or wb->list_lock, or protected by stat transaction, which will be depended upon by foreign inode wb switching. This patch adds lockdep assertion to inode_to_wb() so that usages outside the above list locks can be caught easily. inode_wb_stat_unlocked_begin() is an exception as it's usually protected by combination of !I_WB_SWITCH and rcu_read_lock(). It's updated to dereference inode->i_wb directly. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- include/linux/backing-dev.h | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 040be1a..b9937e5 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -326,10 +326,18 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) * inode_to_wb - determine the wb of an inode * @inode: inode of interest * - * Returns the wb @inode is currently associated with. + * Returns the wb @inode is currently associated with. The caller must be + * holding either @inode->i_lock, @inode->i_mapping->tree_lock, or the + * associated wb's list_lock. */ static inline struct bdi_writeback *inode_to_wb(struct inode *inode) { +#ifdef CONFIG_LOCKDEP + WARN_ON_ONCE(debug_locks && +(!lockdep_is_held(>i_lock) && + !lockdep_is_held(>i_mapping->tree_lock) && + !lockdep_is_held(>i_wb->list_lock))); +#endif return inode->i_wb; } @@ -360,7 +368,12 @@ inode_wb_stat_unlocked_begin(struct inode *inode, bool *lockedp) if (unlikely(*lockedp)) spin_lock_irq(>i_mapping->tree_lock); - return inode_to_wb(inode); + + /* +* Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock. +* inode_to_wb() will bark. Deref directly. +*/ + return inode->i_wb; } /** -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/8] writeback: implement I_WB_SWITCH and bdi_writeback stat update transaction
The mechanism for detecting whether an inode should switch its wb (bdi_writeback) association is now in place. This patch build the framework for the actual switching. This patch adds a new inode flag I_WB_SWITCHING, which has two functions. First, the easy one, it ensures that there's only one switching in progress for a give inode. Second, it's used as a mechanism to synchronize wb stat updates. The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters but track the current number of dirty pages and pages under writeback respectively. As such, when an inode is moved from one wb to another, the inode's portion of those stats have to be transferred together; unfortunately, this is a bit tricky as those stat updates are percpu operations which are performed without holding any lock in some places. This patch solves the problem in a similar way as memcg. Each such lockless stat updates are wrapped in transaction surrounded by inode_wb_stat_unlocked_begin/end(). During normal operation, they map to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted, mapping->tree_lock is grabbed across the transaction. In turn, the switching path sets I_WB_SWITCHING and waits for a RCU grace period to pass before actually starting to switch, which guarantees that all stat update paths are synchronizing against mapping->tree_lock. Combined with the previous patch, this makes all wb list and stat operations to be protected by either of inode->i_lock, wb->list_lock, or mapping->tree_lock while wb switching is in progress. This patch still doesn't implement the actual switching. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c | 117 +--- include/linux/backing-dev.h | 53 include/linux/fs.h | 6 +++ mm/page-writeback.c | 16 -- mm/truncate.c | 7 ++- 5 files changed, 188 insertions(+), 11 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index e888c4b..7a1ab24 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -295,6 +295,115 @@ static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode) return locked_inode_to_wb_and_lock_list(inode); } +struct inode_switch_wb_context { + struct inode*inode; + struct bdi_writeback*new_wb; + + struct rcu_head rcu_head; + struct work_struct work; +}; + +static void inode_switch_wb_work_fn(struct work_struct *work) +{ + struct inode_switch_wb_context *isw = + container_of(work, struct inode_switch_wb_context, work); + struct inode *inode = isw->inode; + struct bdi_writeback *new_wb = isw->new_wb; + + /* +* By the time control reaches here, RCU grace period has passed +* since I_WB_SWITCH assertion and all wb stat update transactions +* between inode_wb_stat_unlocked_begin/end() are guaranteed to be +* synchronizing against mapping->tree_lock. +*/ + spin_lock(>i_lock); + + inode->i_wb_frn_winner = 0; + inode->i_wb_frn_avg_time = 0; + inode->i_wb_frn_history = 0; + + /* +* Paired with load_acquire in inode_wb_stat_unlocked_begin() and +* ensures that the new wb is visible if they see !I_WB_SWITCH. +*/ + smp_store_release(>i_state, inode->i_state & ~I_WB_SWITCH); + + spin_unlock(>i_lock); + + iput(inode); + wb_put(new_wb); + kfree(isw); +} + +static void inode_switch_wb_rcu_fn(struct rcu_head *rcu_head) +{ + struct inode_switch_wb_context *isw = + container_of(rcu_head, struct inode_switch_wb_context, rcu_head); + + /* needs to grab bh-unsafe locks, bounce to work item */ + INIT_WORK(>work, inode_switch_wb_work_fn); + schedule_work(>work); +} + +/** + * inode_switch_wb - change the wb association of an inode + * @inode: target inode + * @new_wb_id: ID of the new wb + * + * Switch @inode's wb association to the wb identified by @new_wb_id. The + * switching is performed asynchronously and may fail silently. + */ +static void inode_switch_wb(struct inode *inode, int new_wb_id) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct cgroup_subsys_state *memcg_css; + struct inode_switch_wb_context *isw; + + /* noop if seems to be already in progress */ + if (inode->i_state & I_WB_SWITCH) + return; + + isw = kzalloc(sizeof(*isw), GFP_ATOMIC); + if (!isw) + return; + + /* find and pin the new wb */ + rcu_read_lock(); + memcg_css = css_from_id(new_wb_id, _cgrp_subsys); + if (memcg_css) + isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + rcu_read_unlock(); + if (!isw->new_wb) + goto out_free; + + /* while holding I_WB_SWITCH, no one else can
[PATCH 8/8] writeback: implement foreign cgroup inode bdi_writeback switching
As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and transfers the ownership to it. The previous patches implemented the foreign condition detection mechanism and laid the groundwork. This patch implements the actual switching. With the previously implemented [unlocked_]inode_to_wb_and_list_lock() and wb stat transaction, grabbing wb->list_lock, inode->i_lock and mapping->tree_lock gives us full exclusion against all wb operations on the target inode. inode_switch_wb_work_fn() grabs all the locks and transfers the inode atomically along with its RECLAIMABLE and WRITEBACK stats. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c | 86 +-- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 7a1ab24..5fc7828 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -308,30 +308,112 @@ static void inode_switch_wb_work_fn(struct work_struct *work) struct inode_switch_wb_context *isw = container_of(work, struct inode_switch_wb_context, work); struct inode *inode = isw->inode; + struct address_space *mapping = inode->i_mapping; + struct bdi_writeback *old_wb = inode->i_wb; struct bdi_writeback *new_wb = isw->new_wb; + struct radix_tree_iter iter; + bool switched = false; + void **slot; /* * By the time control reaches here, RCU grace period has passed * since I_WB_SWITCH assertion and all wb stat update transactions * between inode_wb_stat_unlocked_begin/end() are guaranteed to be * synchronizing against mapping->tree_lock. +* +* Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock +* gives us exclusion against all wb related operations on @inode +* including IO list manipulations and stat updates. */ + if (old_wb < new_wb) { + spin_lock(_wb->list_lock); + spin_lock_nested(_wb->list_lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock(_wb->list_lock); + spin_lock_nested(_wb->list_lock, SINGLE_DEPTH_NESTING); + } spin_lock(>i_lock); + spin_lock_irq(>tree_lock); + + /* +* Once I_FREEING is visible under i_lock, the eviction path owns +* the inode and we shouldn't modify ->i_wb_list. +*/ + if (unlikely(inode->i_state & I_FREEING)) + goto skip_switch; + /* +* Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points +* to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to +* pages actually under underwriteback. +*/ + radix_tree_for_each_tagged(slot, >page_tree, , 0, + PAGECACHE_TAG_DIRTY) { + struct page *page = radix_tree_deref_slot_protected(slot, + >tree_lock); + if (likely(page) && PageDirty(page)) { + __dec_wb_stat(old_wb, WB_RECLAIMABLE); + __inc_wb_stat(new_wb, WB_RECLAIMABLE); + } + } + + radix_tree_for_each_tagged(slot, >page_tree, , 0, + PAGECACHE_TAG_WRITEBACK) { + struct page *page = radix_tree_deref_slot_protected(slot, + >tree_lock); + if (likely(page)) { + WARN_ON_ONCE(!PageWriteback(page)); + __dec_wb_stat(old_wb, WB_WRITEBACK); + __inc_wb_stat(new_wb, WB_WRITEBACK); + } + } + + wb_get(new_wb); + + /* +* Transfer to @new_wb's IO list if necessary. The specific list +* @inode was on is ignored and the inode is put on ->b_dirty which +* is always correct including from ->b_dirty_time. The transfer +* preserves @inode->dirtied_when ordering. +*/ + if (!list_empty(>i_wb_list)) { + struct inode *pos; + + inode_wb_list_del_locked(inode, old_wb); + inode->i_wb = new_wb; + list_for_each_entry(pos, _wb->b_dirty, i_wb_list) + if
Re: [PATCH] lguest: rename i386_head.S in the comments
Alexander Kuleshov writes: > i386_head.S renamed to the head_32.S, let's update it in > the comments too. Thanks, applied! Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/7 linux-next] ASoC: ak4554: constify of_device_id array
On Wed, Mar 18, 2015 at 05:49:00PM +0100, Fabian Frederick wrote: > of_device_id is always used as const. > (See driver.of_match_table and open firmware functions) Applied, thanks. signature.asc Description: Digital signature
Re: [PATCH] xfs: use GFP_NOFS argument in radix_tree_preload
On Mon, Mar 23, 2015 at 01:06:23AM -0400, Sanidhya Kashyap wrote: > From: Byoungyoung Lee > > Following the convention of other file systems, GFP_NOFS > should be used as an argument for radix_tree_preload() instead > of GFP_KERNEL. "convention of other filesystems" is not a reason for changing from GFP_KERNEL to GFP_NOFS. There are rules for when GFP_NOFS needs to be used, and so we only need to change the code if one of those rules are triggered. i.e. inside a transaction, holding a lock that memory reclaim might require to make progress (e.g. ip->i_ilock, buffer locks, etc). The context in which the allocation is made will tell you whether GFP_KERNEL is safe or not. So while the change probably needs to be made, it needs to be made for the right reasons. I haven't looked at the code, but I have a pretty good idea of the context the allocation is being made under. I'd suggest documenting the call path down to xfs_mru_cache_insert(), because that will tell you exactly what context the allocaiton is being made in and hence tell everyone else the real reason we need to make this change... Call me picky, pendantic and/or annoying, but if you are looking at validating/correcting allocation flags then you need to understand the rules and context in which the allocation is being made... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/48] memcg: add per cgroup dirty page accounting
From: Greg Thelen When modifying PG_Dirty on cached file pages, update the new MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where global NR_FILE_DIRTY is managed. The new memcg stat is visible in the per memcg memory.stat cgroupfs file. The most recent past attempt at this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632 The new accounting supports future efforts to add per cgroup dirty page throttling and writeback. It also helps an administrator break down a container's memory usage and provides evidence to understand memcg oom kills (the new dirty count is included in memcg oom kill messages). The ability to move page accounting between memcg (memory.move_charge_at_immigrate) makes this accounting more complicated than the global counter. The existing mem_cgroup_{begin,end}_page_stat() lock is used to serialize move accounting with stat updates. Typical update operation: memcg = mem_cgroup_begin_page_stat(page) if (TestSetPageDirty()) { [...] mem_cgroup_update_page_stat(memcg) } mem_cgroup_end_page_stat(memcg) Summary of mem_cgroup_end_page_stat() overhead: - Without CONFIG_MEMCG it's a no-op - With CONFIG_MEMCG and no inter memcg task movement, it's just rcu_read_lock() - With CONFIG_MEMCG and inter memcg task movement, it's rcu_read_lock() + spin_lock_irqsave() A memcg parameter is added to several routines because their callers now grab mem_cgroup_begin_page_stat() which returns the memcg later needed by for mem_cgroup_update_page_stat(). Because mem_cgroup_begin_page_stat() may disable interrupts, some adjustments are needed: - move __mark_inode_dirty() from __set_page_dirty() to its caller. __mark_inode_dirty() locking does not want interrupts disabled. - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in __delete_from_page_cache(), replace_page_cache_page(), invalidate_complete_page2(), and __remove_mapping(). textdata bss dechex filename 8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before 8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after +192 text bytes 8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before 8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after +773 text bytes Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for all metrics, they're all wall clock or cycle counts. The read and write fault benchmarks just measure fault time, they do not include I/O time. * CONFIG_MEMCG not set: baseline patched kbuild 1m25.03(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples) dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03% dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99% dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77% read fault cycles 254.0(+-0.000% 10 samples)253.0(+-0.000% 10 samples) write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples) * CONFIG_MEMCG=y root_memcg: baseline patched kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples) dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90% dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33% dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00% read fault cycles 266.0(+-0.000% 10 samples)266.0(+-0.000% 10 samples) write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples) * CONFIG_MEMCG=y non-root_memcg: baseline patched kbuild 1m26.12(+-0.273% 3 samples) 1m25.76(+-0.127% 3 samples) dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82% dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27% dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52% read fault cycles 265.7(+-0.172% 10 samples)267.0(+-0.000% 10 samples) write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples) As expected anon page faults are not affected by this patch. Signed-off-by: Sha Zhengju Signed-off-by: Greg Thelen --- Documentation/cgroups/memory.txt | 1 + fs/buffer.c | 34 +++--- fs/xfs/xfs_aops.c| 12 ++-- include/linux/memcontrol.h | 1 + include/linux/mm.h | 3 ++- include/linux/pagemap.h | 3 ++- mm/filemap.c
[PATCH 08/48] blkcg: implement bio_associate_blkcg()
Currently, a bio can only be associated with the io_context and blkcg of %current using bio_associate_current(). This is too restrictive for cgroup writeback support. Implement bio_associate_blkcg() which associates a bio with the specified blkcg. bio_associate_blkcg() leaves the io_context unassociated. bio_associate_current() is updated so that it considers a bio as already associated if it has a blkcg_css, instead of an io_context, associated with it. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Vivek Goyal --- block/bio.c | 24 +++- include/linux/bio.h | 3 +++ 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index 968683e..ab7517d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1971,6 +1971,28 @@ struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_ EXPORT_SYMBOL(bioset_create_nobvec); #ifdef CONFIG_BLK_CGROUP + +/** + * bio_associate_blkcg - associate a bio with the specified blkcg + * @bio: target bio + * @blkcg_css: css of the blkcg to associate + * + * Associate @bio with the blkcg specified by @blkcg_css. Block layer will + * treat @bio as if it were issued by a task which belongs to the blkcg. + * + * This function takes an extra reference of @blkcg_css which will be put + * when @bio is released. The caller must own @bio and is responsible for + * synchronizing calls to this function. + */ +int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css) +{ + if (unlikely(bio->bi_css)) + return -EBUSY; + css_get(blkcg_css); + bio->bi_css = blkcg_css; + return 0; +} + /** * bio_associate_current - associate a bio with %current * @bio: target bio @@ -1988,7 +2010,7 @@ int bio_associate_current(struct bio *bio) { struct io_context *ioc; - if (bio->bi_ioc) + if (bio->bi_css) return -EBUSY; ioc = current->io_context; diff --git a/include/linux/bio.h b/include/linux/bio.h index da3a127..cbc5d1d 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -469,9 +469,12 @@ extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int); extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css); int bio_associate_current(struct bio *bio); void bio_disassociate_task(struct bio *bio); #else /* CONFIG_BLK_CGROUP */ +static inline int bio_associate_blkcg(struct bio *bio, + struct cgroup_subsys_state *blkcg_css) { return 0; } static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } static inline void bio_disassociate_task(struct bio *bio) { } #endif /* CONFIG_BLK_CGROUP */ -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/48] cgroup, block: implement task_get_css() and use it in bio_associate_current()
bio_associate_current() currently open codes task_css() and css_tryget_online() to find and pin $current's blkcg css. Abstract it into task_get_css() which is implemented from cgroup side. As a task is always associated with an online css for every subsystem except while the css_set update is propagating, task_get_css() retries till css_tryget_online() succeeds. This is a cleanup and shouldn't lead to noticeable behavior changes. Signed-off-by: Tejun Heo Cc: Li Zefan Cc: Jens Axboe Cc: Vivek Goyal --- block/bio.c| 11 +-- include/linux/cgroup.h | 25 + 2 files changed, 26 insertions(+), 10 deletions(-) diff --git a/block/bio.c b/block/bio.c index f66a4ea..968683e 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1987,7 +1987,6 @@ EXPORT_SYMBOL(bioset_create_nobvec); int bio_associate_current(struct bio *bio) { struct io_context *ioc; - struct cgroup_subsys_state *css; if (bio->bi_ioc) return -EBUSY; @@ -1996,17 +1995,9 @@ int bio_associate_current(struct bio *bio) if (!ioc) return -ENOENT; - /* acquire active ref on @ioc and associate */ get_io_context_active(ioc); bio->bi_ioc = ioc; - - /* associate blkcg if exists */ - rcu_read_lock(); - css = task_css(current, blkio_cgrp_id); - if (css && css_tryget_online(css)) - bio->bi_css = css; - rcu_read_unlock(); - + bio->bi_css = task_get_css(current, blkio_cgrp_id); return 0; } diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index b9cb94c..e7da0aa 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -774,6 +774,31 @@ static inline struct cgroup_subsys_state *task_css(struct task_struct *task, } /** + * task_get_css - find and get the css for (task, subsys) + * @task: the target task + * @subsys_id: the target subsystem ID + * + * Find the css for the (@task, @subsys_id) combination, increment a + * reference on and return it. This function is guaranteed to return a + * valid css. + */ +static inline struct cgroup_subsys_state * +task_get_css(struct task_struct *task, int subsys_id) +{ + struct cgroup_subsys_state *css; + + rcu_read_lock(); + while (true) { + css = task_css(task, subsys_id); + if (likely(css_tryget_online(css))) + break; + cpu_relax(); + } + rcu_read_unlock(); + return css; +} + +/** * task_css_is_root - test whether a task belongs to the root css * @task: the target task * @subsys_id: the target subsystem ID -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/48] blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h
cgroup aware writeback support will require exposing some of blkcg details. In preprataion, move block/blk-cgroup.h to include/linux/blk-cgroup.h. This patch is pure file move. Signed-off-by: Tejun Heo Cc: Vivek Goyal --- block/blk-cgroup.c | 2 +- block/blk-cgroup.h | 603 - block/blk-core.c | 2 +- block/blk-sysfs.c | 2 +- block/blk-throttle.c | 2 +- block/cfq-iosched.c| 2 +- block/elevator.c | 2 +- include/linux/blk-cgroup.h | 603 + 8 files changed, 609 insertions(+), 609 deletions(-) delete mode 100644 block/blk-cgroup.h create mode 100644 include/linux/blk-cgroup.h diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 0ac817b..c3226ce 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -19,7 +19,7 @@ #include #include #include -#include "blk-cgroup.h" +#include #include "blk.h" #define MAX_KEY_LEN 100 diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h deleted file mode 100644 index c567865..000 --- a/block/blk-cgroup.h +++ /dev/null @@ -1,603 +0,0 @@ -#ifndef _BLK_CGROUP_H -#define _BLK_CGROUP_H -/* - * Common Block IO controller cgroup interface - * - * Based on ideas and code from CFQ, CFS and BFQ: - * Copyright (C) 2003 Jens Axboe - * - * Copyright (C) 2008 Fabio Checconi - * Paolo Valente - * - * Copyright (C) 2009 Vivek Goyal - * Nauman Rafique - */ - -#include -#include -#include -#include -#include -#include - -/* Max limits for throttle policy */ -#define THROTL_IOPS_MAXUINT_MAX - -/* CFQ specific, out here for blkcg->cfq_weight */ -#define CFQ_WEIGHT_MIN 10 -#define CFQ_WEIGHT_MAX 1000 -#define CFQ_WEIGHT_DEFAULT 500 - -#ifdef CONFIG_BLK_CGROUP - -enum blkg_rwstat_type { - BLKG_RWSTAT_READ, - BLKG_RWSTAT_WRITE, - BLKG_RWSTAT_SYNC, - BLKG_RWSTAT_ASYNC, - - BLKG_RWSTAT_NR, - BLKG_RWSTAT_TOTAL = BLKG_RWSTAT_NR, -}; - -struct blkcg_gq; - -struct blkcg { - struct cgroup_subsys_state css; - spinlock_t lock; - - struct radix_tree_root blkg_tree; - struct blkcg_gq *blkg_hint; - struct hlist_head blkg_list; - - /* TODO: per-policy storage in blkcg */ - unsigned intcfq_weight; /* belongs to cfq */ - unsigned intcfq_leaf_weight; -}; - -struct blkg_stat { - struct u64_stats_sync syncp; - uint64_tcnt; -}; - -struct blkg_rwstat { - struct u64_stats_sync syncp; - uint64_tcnt[BLKG_RWSTAT_NR]; -}; - -/* - * A blkcg_gq (blkg) is association between a block cgroup (blkcg) and a - * request_queue (q). This is used by blkcg policies which need to track - * information per blkcg - q pair. - * - * There can be multiple active blkcg policies and each has its private - * data on each blkg, the size of which is determined by - * blkcg_policy->pd_size. blkcg core allocates and frees such areas - * together with blkg and invokes pd_init/exit_fn() methods. - * - * Such private data must embed struct blkg_policy_data (pd) at the - * beginning and pd_size can't be smaller than pd. - */ -struct blkg_policy_data { - /* the blkg and policy id this per-policy data belongs to */ - struct blkcg_gq *blkg; - int plid; - - /* used during policy activation */ - struct list_headalloc_node; -}; - -/* association between a blk cgroup and a request queue */ -struct blkcg_gq { - /* Pointer to the associated request_queue */ - struct request_queue*q; - struct list_headq_node; - struct hlist_node blkcg_node; - struct blkcg*blkcg; - - /* all non-root blkcg_gq's are guaranteed to have access to parent */ - struct blkcg_gq *parent; - - /* request allocation list for this blkcg-q pair */ - struct request_list rl; - - /* reference count */ - atomic_trefcnt; - - /* is this blkg online? protected by both blkcg and q locks */ - boolonline; - - struct blkg_policy_data *pd[BLKCG_MAX_POLS]; - - struct rcu_head rcu_head; -}; - -typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg); -typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg); -typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg); -typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg); -typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg); - -struct blkcg_policy { - int plid; - /* policy specific private data
[PATCH 03/48] update !CONFIG_BLK_CGROUP dummies in include/linux/blk-cgroup.h
The header file will be used more widely with the pending cgroup writeback support and the current set of dummy declarations aren't enough to handle different config combinations. Update as follows. * Drop the struct cgroup declaration. None of the dummy defs need it. * Define blkcg as an empty struct instead of just declaring it. * Wrap dummy function defs in CONFIG_BLOCK. Some functions use block data types and none of them are to be used w/o block enabled. Signed-off-by: Tejun Heo --- include/linux/blk-cgroup.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index c567865..51f95b3 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -558,8 +558,8 @@ static inline void blkg_rwstat_merge(struct blkg_rwstat *to, #else /* CONFIG_BLK_CGROUP */ -struct cgroup; -struct blkcg; +struct blkcg { +}; struct blkg_policy_data { }; @@ -570,6 +570,8 @@ struct blkcg_gq { struct blkcg_policy { }; +#ifdef CONFIG_BLOCK + static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; } static inline int blkcg_init_queue(struct request_queue *q) { return 0; } static inline void blkcg_drain_queue(struct request_queue *q) { } @@ -599,5 +601,6 @@ static inline struct request_list *blk_rq_rl(struct request *rq) { return >q #define blk_queue_for_each_rl(rl, q) \ for ((rl) = &(q)->root_rl; (rl); (rl) = NULL) +#endif /* CONFIG_BLOCK */ #endif /* CONFIG_BLK_CGROUP */ #endif /* _BLK_CGROUP_H */ -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/48] blkcg: add blkcg_root_css
Add global constant blkcg_root_css which points to _root.css. This will be used by cgroup writeback support. If blkcg is disabled, it's defined as ERR_PTR(-EINVAL). v2: The declarations moved to include/linux/blk-cgroup.h as suggested by Vivek. Signed-off-by: Tejun Heo Cc: Vivek Goyal Cc: Jens Axboe --- block/blk-cgroup.c | 2 ++ include/linux/blk-cgroup.h | 3 +++ 2 files changed, 5 insertions(+) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index c3226ce..9e0fe38 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -30,6 +30,8 @@ struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT, .cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, }; EXPORT_SYMBOL_GPL(blkcg_root); +struct cgroup_subsys_state * const blkcg_root_css = _root.css; + static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS]; static bool blkcg_policy_enabled(struct request_queue *q, diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 51f95b3..65f0c17 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -134,6 +134,7 @@ struct blkcg_policy { }; extern struct blkcg blkcg_root; +extern struct cgroup_subsys_state * const blkcg_root_css; struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q); struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, @@ -570,6 +571,8 @@ struct blkcg_gq { struct blkcg_policy { }; +#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) + #ifdef CONFIG_BLOCK static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/48] writeback: move backing_dev_info->state into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: drbd-...@lists.linbit.com Cc: Neil Brown Cc: Alasdair Kergon Cc: Mike Snitzer --- block/blk-core.c | 1 - drivers/block/drbd/drbd_main.c | 10 +- drivers/md/dm.c| 2 +- drivers/md/raid1.c | 4 ++-- drivers/md/raid10.c| 2 +- fs/fs-writeback.c | 14 +++--- include/linux/backing-dev.h| 24 mm/backing-dev.c | 20 ++-- 8 files changed, 38 insertions(+), 39 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 56da125..fa1314e 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -606,7 +606,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; - q->backing_dev_info.state = 0; q->backing_dev_info.capabilities = 0; q->backing_dev_info.name = "block"; q->node = node_id; diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 1fc8342..61b00aa 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -2360,7 +2360,7 @@ static void drbd_cleanup(void) * @congested_data:User data * @bdi_bits: Bits the BDI flusher thread is currently interested in * - * Returns 1flags)) { - r |= (1 << BDI_async_congested); + r |= (1 << WB_async_congested); /* Without good local data, we would need to read from remote, * and that would need the worker thread as well, which is * currently blocked waiting for that usermode helper to * finish. */ if (!get_ldev_if_state(device, D_UP_TO_DATE)) - r |= (1 << BDI_sync_congested); + r |= (1 << WB_sync_congested); else put_ldev(device); r &= bdi_bits; @@ -2400,9 +2400,9 @@ static int drbd_congested(void *congested_data, int bdi_bits) reason = 'b'; } - if (bdi_bits & (1 << BDI_async_congested) && + if (bdi_bits & (1 << WB_async_congested) && test_bit(NET_CONGESTED, _peer_device(device)->connection->flags)) { - r |= (1 << BDI_async_congested); + r |= (1 << WB_async_congested); reason = reason == 'b' ? 'a' : 'n'; } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 73f2880..c076982 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2039,7 +2039,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits) * the query about congestion status of request_queue */ if (dm_request_based(md)) - r = md->queue->backing_dev_info.state & + r = md->queue->backing_dev_info.wb.state & bdi_bits; else r = dm_table_any_congested(map, bdi_bits); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index d34e238..2fca392 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -739,7 +739,7 @@ static int raid1_congested(struct mddev *mddev, int bits) struct r1conf *conf = mddev->private; int i, ret = 0; - if ((bits & (1 << BDI_async_congested)) && + if ((bits & (1 << WB_async_congested)) && conf->pending_count >= max_queued_requests) return 1; @@ -754,7 +754,7 @@ static int raid1_congested(struct mddev *mddev, int bits) /* Note the '|| 1' - when read_balance prefers * non-congested targets, it can be removed */ - if ((bits & (1 backing_dev_info, bits); diff --git a/drivers/md/raid10.c
[PATCH 12/48] writeback: move bandwidth related fields from backing_dev_info into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bandwidth related fields from backing_dev_info into bdi_writeback. * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp, write_bandwidth, avg_write_bandwidth, dirty_ratelimit, balanced_dirty_ratelimit, completions and dirty_exceeded. * writeback_chunk_size() and over_bgroup_thresh() now take @wb instead of @bdi. * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...) bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...) bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...) bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...) [__]bdi_update_bandwidth(bdi, ...)-> [__]wb_update_bandwidth(wb, ...) bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...) bdi_dirty_limits(bdi, ...)-> wb_dirty_limits(wb, ...) * Init/exits of the relocated fields are moved to bdi_wb_init/exit() respectively. Note that explicit zeroing is dropped in the process as wb's are cleared in entirety anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Jaegeuk Kim Cc: Steven Whitehouse --- fs/f2fs/node.c | 4 +- fs/f2fs/segment.h| 2 +- fs/fs-writeback.c| 17 ++- fs/gfs2/super.c | 2 +- include/linux/backing-dev.h | 20 +-- include/linux/writeback.h| 19 ++- include/trace/events/writeback.h | 8 +- mm/backing-dev.c | 45 +++ mm/page-writeback.c | 262 --- 9 files changed, 187 insertions(+), 192 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 97bd9d3..a97da4e 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -51,7 +51,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type) PAGE_CACHE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2); } else if (type == DIRTY_DENTS) { - if (sbi->sb->s_bdi->dirty_exceeded) + if (sbi->sb->s_bdi->wb.dirty_exceeded) return false; mem_size = get_pages(sbi, F2FS_DIRTY_DENTS); res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); @@ -63,7 +63,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type) sizeof(struct ino_entry)) >> PAGE_CACHE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); } else { - if (sbi->sb->s_bdi->dirty_exceeded) + if (sbi->sb->s_bdi->wb.dirty_exceeded) return false; } return res; diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index 7fd3511..3a5bfcf 100644 --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -712,7 +712,7 @@ static inline unsigned int max_hw_blocks(struct f2fs_sb_info *sbi) */ static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type) { - if (sbi->sb->s_bdi->dirty_exceeded) + if (sbi->sb->s_bdi->wb.dirty_exceeded) return 0; if (type == DATA) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 992a065..4fcf2385 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -606,7 +606,7 @@ out: return ret; } -static long writeback_chunk_size(struct backing_dev_info *bdi, +static long writeback_chunk_size(struct bdi_writeback *wb, struct wb_writeback_work *work) { long pages; @@ -627,7 +627,7 @@ static long writeback_chunk_size(struct backing_dev_info *bdi, if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages) pages = LONG_MAX; else { - pages = min(bdi->avg_write_bandwidth / 2, + pages = min(wb->avg_write_bandwidth / 2, global_dirty_limit / DIRTY_SCOPE); pages = min(pages, work->nr_pages); pages = round_down(pages + MIN_WRITEBACK_PAGES, @@ -725,7 +725,7 @@ static long writeback_sb_inodes(struct super_block *sb, inode->i_state |= I_SYNC; spin_unlock(>i_lock); - write_chunk = writeback_chunk_size(wb->bdi, work); + write_chunk = writeback_chunk_size(wb, work); wbc.nr_to_write = write_chunk;
[PATCH 09/48] memcg: implement mem_cgroup_css_from_page()
Implement mem_cgroup_css_from_page() which returns the cgroup_subsys_state of the memcg associated with a given page. This will be used by cgroup writeback support. Signed-off-by: Tejun Heo Cc: Johannes Weiner Cc: Michal Hocko --- include/linux/memcontrol.h | 1 + mm/memcontrol.c| 14 ++ 2 files changed, 15 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 294498f..637ef62 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -115,6 +115,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm, } extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg); +extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page); struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, struct mem_cgroup *, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fda7025..74241b3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -591,6 +591,20 @@ struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg) return >css; } +/** + * mem_cgroup_css_from_page - css of the memcg associated with a page + * @page: page of interest + * + * This function is guaranteed to return a valid cgroup_subsys_state and + * the returned css remains accessible until @page is released. + */ +struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page) +{ + if (page->mem_cgroup) + return >mem_cgroup->css; + return _mem_cgroup->css; +} + static struct mem_cgroup_per_zone * mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page) { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/48] writeback: s/bdi/wb/ in mm/page-writeback.c
Writeback operations will now be per wb (bdi_writeback) instead of bdi. Replace the relevant bdi references in symbol names and comments with wb. This patch is purely cosmetic and doesn't make any functional changes. Signed-off-by: Tejun Heo Cc: Wu Fengguang Cc: Jan Kara Cc: Jens Axboe --- mm/page-writeback.c | 270 ++-- 1 file changed, 134 insertions(+), 136 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 29fb4f3..c615a15 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -595,7 +595,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * * (o) global/bdi setpoints * - * We want the dirty pages be balanced around the global/bdi setpoints. + * We want the dirty pages be balanced around the global/wb setpoints. * When the number of dirty pages is higher/lower than the setpoint, the * dirty position control ratio (and hence task dirty ratelimit) will be * decreased/increased to bring the dirty pages back to the setpoint. @@ -605,8 +605,8 @@ static long long pos_ratio_polynom(unsigned long setpoint, * if (dirty < setpoint) scale up pos_ratio * if (dirty > setpoint) scale down pos_ratio * - * if (bdi_dirty < bdi_setpoint) scale up pos_ratio - * if (bdi_dirty > bdi_setpoint) scale down pos_ratio + * if (wb_dirty < wb_setpoint) scale up pos_ratio + * if (wb_dirty > wb_setpoint) scale down pos_ratio * * task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT * @@ -631,7 +631,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * 0 +.--.--*-> * freerun^ setpoint^ limit^ dirty pages * - * (o) bdi control line + * (o) wb control line * * ^ pos_ratio * | @@ -657,27 +657,27 @@ static long long pos_ratio_polynom(unsigned long setpoint, * | . . * | . . * 0 +--.---.-> - *bdi_setpoint^x_intercept^ + *wb_setpoint^x_intercept^ * - * The bdi control line won't drop below pos_ratio=1/4, so that bdi_dirty can + * The wb control line won't drop below pos_ratio=1/4, so that wb_dirty can * be smoothly throttled down to normal if it starts high in situations like * - start writing to a slow SD card and a fast disk at the same time. The SD - * card's bdi_dirty may rush to many times higher than bdi_setpoint. - * - the bdi dirty thresh drops quickly due to change of JBOD workload + * card's wb_dirty may rush to many times higher than wb_setpoint. + * - the wb dirty thresh drops quickly due to change of JBOD workload */ static unsigned long wb_position_ratio(struct bdi_writeback *wb, unsigned long thresh, unsigned long bg_thresh, unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty) + unsigned long wb_thresh, + unsigned long wb_dirty) { unsigned long write_bw = wb->avg_write_bandwidth; unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); unsigned long limit = hard_dirty_limit(thresh); unsigned long x_intercept; unsigned long setpoint; /* dirty pages' target balance point */ - unsigned long bdi_setpoint; + unsigned long wb_setpoint; unsigned long span; long long pos_ratio;/* for scaling up/down the rate limit */ long x; @@ -696,146 +696,145 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, /* * The strictlimit feature is a tool preventing mistrusted filesystems * from growing a large number of dirty pages before throttling. For -* such filesystems balance_dirty_pages always checks bdi counters -* against bdi limits. Even if global "nr_dirty" is under "freerun". +* such filesystems balance_dirty_pages always checks wb counters +* against wb limits. Even if global "nr_dirty" is under "freerun". * This is especially important for fuse which sets bdi->max_ratio to * 1% by default. Without strictlimit feature, fuse writeback may * consume arbitrary amount of RAM because it is accounted in * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty". * * Here, in wb_position_ratio(), we calculate pos_ratio based on -* two values: bdi_dirty and bdi_thresh. Let's consider an example: +* two values: wb_dirty and wb_thresh. Let's consider an example:
[PATCH 16/48] writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. Signed-off-by: Tejun Heo Cc: Jens Axboe --- block/blk-integrity.c| 1 + block/blk-sysfs.c| 1 + block/bounce.c | 1 + block/genhd.c| 1 + drivers/block/drbd/drbd_int.h| 1 + drivers/block/pktcdvd.c | 1 + drivers/char/raw.c | 1 + drivers/md/bcache/request.c | 1 + drivers/md/dm.h | 1 + drivers/md/md.h | 1 + drivers/mtd/devices/block2mtd.c | 1 + fs/block_dev.c | 1 + fs/ext4/extents.c| 1 + fs/ext4/mballoc.c| 1 + fs/ext4/super.c | 1 + fs/f2fs/segment.h| 1 + fs/hfs/super.c | 1 + fs/hfsplus/super.c | 1 + fs/nfs/filelayout/filelayout.c | 1 + fs/ocfs2/file.c | 1 + fs/reiserfs/super.c | 1 + fs/ufs/super.c | 1 + fs/xfs/xfs_file.c| 1 + include/linux/backing-dev-defs.h | 106 +++ include/linux/backing-dev.h | 102 + include/linux/blkdev.h | 2 +- mm/madvise.c | 1 + 27 files changed, 132 insertions(+), 102 deletions(-) create mode 100644 include/linux/backing-dev-defs.h diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 79ffb48..f548b64 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -21,6 +21,7 @@ */ #include +#include #include #include #include diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 5677eb7..1b60941 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include diff --git a/block/bounce.c b/block/bounce.c index ab21ba2..c616a60 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include diff --git a/block/genhd.c b/block/genhd.c index 0a536dc..d46ba56 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h index b905e98..efd19c2 100644 --- a/drivers/block/drbd/drbd_int.h +++ b/drivers/block/drbd/drbd_int.h @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 09e628da..4c20c22 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -61,6 +61,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/char/raw.c b/drivers/char/raw.c index 6e29bf2..ee47e59 100644 --- a/drivers/char/raw.c +++ b/drivers/char/raw.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ab43fad..9c083b9 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -15,6 +15,7 @@ #include #include #include +#include #include diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 59f53e7..ae4a3ca 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/md/md.h b/drivers/md/md.h index 318ca8f..641abb5 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -16,6 +16,7 @@ #define _MD_MD_H #include +#include #include #include #include diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c index 66f0405..e22e40f 100644 --- a/drivers/mtd/devices/block2mtd.c +++ b/drivers/mtd/devices/block2mtd.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/fs/block_dev.c b/fs/block_dev.c index 975266b..e4f5f71 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index bed4308..21a7bcb 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -39,6 +39,7 @@ #include #include #include +#include #include "ext4_jbd2.h" #include "ext4_extents.h" #include "xattr.h" diff --git
[PATCH 15/48] writeback: reorganize mm/backing-dev.c
Move wb_shutdown(), bdi_register(), bdi_register_dev(), bdi_prune_sb(), bdi_remove_from_list() and bdi_unregister() so that init / exit functions are grouped together. This will make updating init / exit paths for cgroup writeback support easier. This is pure source file reorganization. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang --- mm/backing-dev.c | 174 +++ 1 file changed, 87 insertions(+), 87 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 597f0ce..ff85ecb 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -286,93 +286,6 @@ void wb_wakeup_delayed(struct bdi_writeback *wb) } /* - * Remove bdi from bdi_list, and ensure that it is no longer visible - */ -static void bdi_remove_from_list(struct backing_dev_info *bdi) -{ - spin_lock_bh(_lock); - list_del_rcu(>bdi_list); - spin_unlock_bh(_lock); - - synchronize_rcu_expedited(); -} - -int bdi_register(struct backing_dev_info *bdi, struct device *parent, - const char *fmt, ...) -{ - va_list args; - struct device *dev; - - if (bdi->dev) /* The driver needs to use separate queues per device */ - return 0; - - va_start(args, fmt); - dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args); - va_end(args); - if (IS_ERR(dev)) - return PTR_ERR(dev); - - bdi->dev = dev; - - bdi_debug_register(bdi, dev_name(dev)); - set_bit(WB_registered, >wb.state); - - spin_lock_bh(_lock); - list_add_tail_rcu(>bdi_list, _list); - spin_unlock_bh(_lock); - - trace_writeback_bdi_register(bdi); - return 0; -} -EXPORT_SYMBOL(bdi_register); - -int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev) -{ - return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev)); -} -EXPORT_SYMBOL(bdi_register_dev); - -/* - * Remove bdi from the global list and shutdown any threads we have running - */ -static void wb_shutdown(struct bdi_writeback *wb) -{ - /* Make sure nobody queues further work */ - spin_lock_bh(>work_lock); - if (!test_and_clear_bit(WB_registered, >state)) { - spin_unlock_bh(>work_lock); - return; - } - spin_unlock_bh(>work_lock); - - /* -* Drain work list and shutdown the delayed_work. !WB_registered -* tells wb_workfn() that @wb is dying and its work_list needs to -* be drained no matter what. -*/ - mod_delayed_work(bdi_wq, >dwork, 0); - flush_delayed_work(>dwork); - WARN_ON(!list_empty(>work_list)); -} - -/* - * Called when the device behind @bdi has been removed or ejected. - * - * We can't really do much here except for reducing the dirty ratio at - * the moment. In the future we should be able to set a flag so that - * the filesystem can handle errors at mark_inode_dirty time instead - * of only at writeback time. - */ -void bdi_unregister(struct backing_dev_info *bdi) -{ - if (WARN_ON_ONCE(!bdi->dev)) - return; - - bdi_set_min_ratio(bdi, 0); -} -EXPORT_SYMBOL(bdi_unregister); - -/* * Initial write bandwidth: 100 MB/s */ #define INIT_BW(100 << (20 - PAGE_SHIFT)) @@ -418,6 +331,29 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) return 0; } +/* + * Remove bdi from the global list and shutdown any threads we have running + */ +static void wb_shutdown(struct bdi_writeback *wb) +{ + /* Make sure nobody queues further work */ + spin_lock_bh(>work_lock); + if (!test_and_clear_bit(WB_registered, >state)) { + spin_unlock_bh(>work_lock); + return; + } + spin_unlock_bh(>work_lock); + + /* +* Drain work list and shutdown the delayed_work. !WB_registered +* tells wb_workfn() that @wb is dying and its work_list needs to +* be drained no matter what. +*/ + mod_delayed_work(bdi_wq, >dwork, 0); + flush_delayed_work(>dwork); + WARN_ON(!list_empty(>work_list)); +} + static void wb_exit(struct bdi_writeback *wb) { int i; @@ -449,6 +385,70 @@ int bdi_init(struct backing_dev_info *bdi) } EXPORT_SYMBOL(bdi_init); +int bdi_register(struct backing_dev_info *bdi, struct device *parent, + const char *fmt, ...) +{ + va_list args; + struct device *dev; + + if (bdi->dev) /* The driver needs to use separate queues per device */ + return 0; + + va_start(args, fmt); + dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args); + va_end(args); + if (IS_ERR(dev)) + return PTR_ERR(dev); + + bdi->dev = dev; + + bdi_debug_register(bdi, dev_name(dev)); + set_bit(WB_registered, >wb.state); + + spin_lock_bh(_lock); + list_add_tail_rcu(>bdi_list, _list); +
[PATCH 14/48] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->wb_lock and ->worklist into wb. * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While moving, rename it to wb->work_lock as wb->wb_lock is confusing. Also, move wb->dwork downwards so that it's colocated with the new ->work_lock and ->work_list fields. * bdi_writeback_workfn()-> wb_workfn() bdi_wakeup_thread_delayed(bdi)-> wb_wakeup_delayed(wb) bdi_wakeup_thread(bdi)-> wb_wakeup(wb) bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...) __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...) get_next_work_item(bdi) -> get_next_work_item(wb) * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb. The function contained parts which belong to the containing bdi rather than the wb itself - testing cap_writeback_dirty and bdi_remove_from_list() invocation. Those are moved to bdi_unregister(). * bdi_wb_{init|exit}() are renamed to wb_{init|exit}(). Initializations of the moved bdi->wb_lock and ->work_list are relocated from bdi_init() to wb_init(). * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang --- fs/fs-writeback.c | 84 + include/linux/backing-dev.h | 12 +++ mm/backing-dev.c| 59 +++ 3 files changed, 74 insertions(+), 81 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 4fcf2385..7c2f0bd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -97,34 +97,33 @@ static inline struct inode *wb_inode(struct list_head *head) EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage); -static void bdi_wakeup_thread(struct backing_dev_info *bdi) +static void wb_wakeup(struct bdi_writeback *wb) { - spin_lock_bh(>wb_lock); - if (test_bit(WB_registered, >wb.state)) - mod_delayed_work(bdi_wq, >wb.dwork, 0); - spin_unlock_bh(>wb_lock); + spin_lock_bh(>work_lock); + if (test_bit(WB_registered, >state)) + mod_delayed_work(bdi_wq, >dwork, 0); + spin_unlock_bh(>work_lock); } -static void bdi_queue_work(struct backing_dev_info *bdi, - struct wb_writeback_work *work) +static void wb_queue_work(struct bdi_writeback *wb, + struct wb_writeback_work *work) { - trace_writeback_queue(bdi, work); + trace_writeback_queue(wb->bdi, work); - spin_lock_bh(>wb_lock); - if (!test_bit(WB_registered, >wb.state)) { + spin_lock_bh(>work_lock); + if (!test_bit(WB_registered, >state)) { if (work->done) complete(work->done); goto out_unlock; } - list_add_tail(>list, >work_list); - mod_delayed_work(bdi_wq, >wb.dwork, 0); + list_add_tail(>list, >work_list); + mod_delayed_work(bdi_wq, >dwork, 0); out_unlock: - spin_unlock_bh(>wb_lock); + spin_unlock_bh(>work_lock); } -static void -__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - bool range_cyclic, enum wb_reason reason) +static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages, +bool range_cyclic, enum wb_reason reason) { struct wb_writeback_work *work; @@ -134,8 +133,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, */ work = kzalloc(sizeof(*work), GFP_ATOMIC); if (!work) { - trace_writeback_nowork(bdi); - bdi_wakeup_thread(bdi); + trace_writeback_nowork(wb->bdi); + wb_wakeup(wb); return; } @@ -144,7 +143,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, work->range_cyclic = range_cyclic; work->reason= reason; - bdi_queue_work(bdi, work); + wb_queue_work(wb, work); } /** @@ -162,7 +161,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, enum wb_reason reason) { - __bdi_start_writeback(bdi, nr_pages, true, reason); + __wb_start_writeback(>wb, nr_pages, true, reason); } /** @@ -182,7 +181,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi) * writeback as soon as there is
[PATCH 07/48] blkcg: implement task_get_blkcg_css()
Implement a wrapper around task_get_css() to acquire the blkcg css for a given task. The wrapper is necessary for cgroup writeback support as there will be places outside blkcg proper trying to acquire blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP. Signed-off-by: Tejun Heo --- include/linux/blk-cgroup.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 65f0c17..4dc643f 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -195,6 +195,12 @@ static inline struct blkcg *bio_blkcg(struct bio *bio) return task_blkcg(current); } +static inline struct cgroup_subsys_state * +task_get_blkcg_css(struct task_struct *task) +{ + return task_get_css(task, blkio_cgrp_id); +} + /** * blkcg_parent - get the parent of a blkcg * @blkcg: blkcg of interest @@ -573,6 +579,12 @@ struct blkcg_policy { #define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) +static inline struct cgroup_subsys_state * +task_get_blkcg_css(struct task_struct *task) +{ + return NULL; +} + #ifdef CONFIG_BLOCK static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 20/48] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK
cgroup writeback requires support from both bdi and filesystem sides. Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if both MEMCG and BLK_CGROUP are enabled. inode_cgwb_enabled() which determines whether a given inode's both bdi and fs support cgroup writeback is added. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- block/blk-core.c| 2 +- include/linux/backing-dev.h | 32 +++- include/linux/fs.h | 1 + init/Kconfig| 5 + 4 files changed, 38 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index fa1314e..c44018a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -606,7 +606,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; - q->backing_dev_info.capabilities = 0; + q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK; q->backing_dev_info.name = "block"; q->node = node_id; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index bfdaa18..6bb3123 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -134,12 +134,15 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); * BDI_CAP_NO_WRITEBACK: Don't write pages back * BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages * BDI_CAP_STRICTLIMIT:Keep number of dirty pages below bdi threshold. + * + * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback. */ #define BDI_CAP_NO_ACCT_DIRTY 0x0001 #define BDI_CAP_NO_WRITEBACK 0x0002 #define BDI_CAP_NO_ACCT_WB 0x0004 #define BDI_CAP_STABLE_WRITES 0x0008 #define BDI_CAP_STRICTLIMIT0x0010 +#define BDI_CAP_CGROUP_WRITEBACK 0x0020 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \ (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB) @@ -229,4 +232,31 @@ static inline int bdi_sched_wait(void *word) return 0; } -#endif /* _LINUX_BACKING_DEV_H */ +#ifdef CONFIG_CGROUP_WRITEBACK + +/** + * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode + * @inode: inode of interest + * + * cgroup writeback requires support from both the bdi and filesystem. + * Test whether @inode has both. + */ +static inline bool inode_cgwb_enabled(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + return bdi_cap_account_dirty(bdi) && + (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) && + (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static inline bool inode_cgwb_enabled(struct inode *inode) +{ + return false; +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + +#endif /* _LINUX_BACKING_DEV_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index ccf4b64..bc72737 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1862,6 +1862,7 @@ struct file_system_type { #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */ +#define FS_CGROUP_WRITEBACK32 /* Supports cgroup-aware writeback */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); diff --git a/init/Kconfig b/init/Kconfig index f5dbc6d..9f17798 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1132,6 +1132,11 @@ config DEBUG_BLK_CGROUP Enable some debugging help. Currently it exports additional stat files in a cgroup which can be useful for debugging. +config CGROUP_WRITEBACK + bool + depends on MEMCG && BLK_CGROUP + default y + endif # CGROUPS config CHECKPOINT_RESTORE -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 17/48] bdi: make inode_to_bdi() inline
Now that bdi definitions are moved to backing-dev-defs.h, backing-dev.h can include blkdev.h and inline inode_to_bdi() without worrying about introducing circular include dependency. The function gets called from hot paths and fairly trivial. This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the function calls inline. blockdev_superblock and noop_backing_dev_info are EXPORT_GPL'd to allow the inline functions to be used from modules. While at it, maske sb_is_blkdev_sb() return bool instead of int. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Christoph Hellwig --- fs/block_dev.c | 8 ++-- fs/fs-writeback.c | 16 include/linux/backing-dev.h | 18 -- include/linux/fs.h | 8 +++- mm/backing-dev.c| 1 + 5 files changed, 26 insertions(+), 25 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index e4f5f71..875d41a 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -549,7 +549,8 @@ static struct file_system_type bd_type = { .kill_sb= kill_anon_super, }; -static struct super_block *blockdev_superblock __read_mostly; +struct super_block *blockdev_superblock __read_mostly; +EXPORT_SYMBOL_GPL(blockdev_superblock); void __init bdev_cache_init(void) { @@ -690,11 +691,6 @@ static struct block_device *bd_acquire(struct inode *inode) return bdev; } -int sb_is_blkdev_sb(struct super_block *sb) -{ - return sb == blockdev_superblock; -} - /* Call when you free inode */ void bd_forget(struct inode *inode) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 7c2f0bd..4fd264d 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -66,22 +66,6 @@ int writeback_in_progress(struct backing_dev_info *bdi) } EXPORT_SYMBOL(writeback_in_progress); -struct backing_dev_info *inode_to_bdi(struct inode *inode) -{ - struct super_block *sb; - - if (!inode) - return _backing_dev_info; - - sb = inode->i_sb; -#ifdef CONFIG_BLOCK - if (sb_is_blkdev_sb(sb)) - return blk_get_backing_dev_info(I_BDEV(inode)); -#endif - return sb->s_bdi; -} -EXPORT_SYMBOL_GPL(inode_to_bdi); - static inline struct inode *wb_inode(struct list_head *head) { return list_entry(head, struct inode, i_wb_list); diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 5e39f7a..7857820 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -11,11 +11,10 @@ #include #include #include +#include #include #include -struct backing_dev_info *inode_to_bdi(struct inode *inode); - int __must_check bdi_init(struct backing_dev_info *bdi); void bdi_destroy(struct backing_dev_info *bdi); @@ -149,6 +148,21 @@ extern struct backing_dev_info noop_backing_dev_info; int writeback_in_progress(struct backing_dev_info *bdi); +static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) +{ + struct super_block *sb; + + if (!inode) + return _backing_dev_info; + + sb = inode->i_sb; +#ifdef CONFIG_BLOCK + if (sb_is_blkdev_sb(sb)) + return blk_get_backing_dev_info(I_BDEV(inode)); +#endif + return sb->s_bdi; +} + static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits) { if (bdi->congested_fn) diff --git a/include/linux/fs.h b/include/linux/fs.h index b4d71b5..ccf4b64 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2205,7 +2205,13 @@ extern struct super_block *freeze_bdev(struct block_device *); extern void emergency_thaw_all(void); extern int thaw_bdev(struct block_device *bdev, struct super_block *sb); extern int fsync_bdev(struct block_device *); -extern int sb_is_blkdev_sb(struct super_block *sb); + +extern struct super_block *blockdev_superblock; + +static inline bool sb_is_blkdev_sb(struct super_block *sb) +{ + return sb == blockdev_superblock; +} #else static inline void bd_forget(struct inode *inode) {} static inline int sync_blockdev(struct block_device *bdev) { return 0; } diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ff85ecb..b0707d1 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = { .name = "noop", .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, }; +EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 19/48] bdi: separate out congested state into a separate struct
Currently, a wb's (bdi_writeback) congestion state is carried in its ->state field; however, cgroup writeback support will require multiple wb's sharing the same congestion state. This patch separates out congestion state into its own struct - struct bdi_writeback_congested. A new field wb field, wb_congested, points to its associated congested struct. The default wb, bdi->wb, always points to bdi->wb_congested. While this patch adds a layer of indirection, it doesn't introduce any behavior changes. Signed-off-by: Tejun Heo --- include/linux/backing-dev-defs.h | 14 -- include/linux/backing-dev.h | 2 +- mm/backing-dev.c | 7 +-- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index aa18c4b..9e9eafa 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -16,12 +16,15 @@ struct dentry; * Bits in bdi_writeback.state */ enum wb_state { - WB_async_congested, /* The async (write) queue is getting full */ - WB_sync_congested, /* The sync queue is getting full */ WB_registered, /* bdi_register() was done */ WB_writeback_running, /* Writeback is in progress */ }; +enum wb_congested_state { + WB_async_congested, /* The async (write) queue is getting full */ + WB_sync_congested, /* The sync queue is getting full */ +}; + typedef int (congested_fn)(void *, int); enum wb_stat_item { @@ -34,6 +37,10 @@ enum wb_stat_item { #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) +struct bdi_writeback_congested { + unsigned long state;/* WB_[a]sync_congested flags */ +}; + struct bdi_writeback { struct backing_dev_info *bdi; /* our parent bdi */ @@ -48,6 +55,8 @@ struct bdi_writeback { struct percpu_counter stat[NR_WB_STAT_ITEMS]; + struct bdi_writeback_congested *congested; + unsigned long bw_time_stamp;/* last time write bw is updated */ unsigned long dirtied_stamp; unsigned long written_stamp;/* pages written at bw_time_stamp */ @@ -84,6 +93,7 @@ struct backing_dev_info { unsigned int max_ratio, max_prop_frac; struct bdi_writeback wb; /* default writeback info for this bdi */ + struct bdi_writeback_congested wb_congested; struct device *dev; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 7857820..bfdaa18 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -167,7 +167,7 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits) { if (bdi->congested_fn) return bdi->congested_fn(bdi->congested_data, bdi_bits); - return (bdi->wb.state & bdi_bits); + return (bdi->wb.congested->state & bdi_bits); } static inline int bdi_read_congested(struct backing_dev_info *bdi) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 805b287..5ec7658 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -383,6 +383,9 @@ int bdi_init(struct backing_dev_info *bdi) if (err) return err; + bdi->wb_congested.state = 0; + bdi->wb.congested = >wb_congested; + return 0; } EXPORT_SYMBOL(bdi_init); @@ -504,7 +507,7 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync) wait_queue_head_t *wqh = _wqh[sync]; bit = sync ? WB_sync_congested : WB_async_congested; - if (test_and_clear_bit(bit, >wb.state)) + if (test_and_clear_bit(bit, >wb.congested->state)) atomic_dec(_bdi_congested[sync]); smp_mb__after_atomic(); if (waitqueue_active(wqh)) @@ -517,7 +520,7 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) enum wb_state bit; bit = sync ? WB_sync_congested : WB_async_congested; - if (!test_and_set_bit(bit, >wb.state)) + if (!test_and_set_bit(bit, >wb.congested->state)) atomic_inc(_bdi_congested[sync]); } EXPORT_SYMBOL(set_bdi_congested); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHSET 2/3 block/for-4.1/core] writeback: cgroup writeback backpressure propagation
Hello, While the previous patchset[2] implemented cgroup writeback support, the IO back pressure propagation mechanism implemented in balance_dirty_pages() and its subroutines isn't yet aware of cgroup writeback. Processes belonging to a memcg may have access to only subset of total memory available in the system and not factoring this into dirty throttling rendered it completely ineffective for processes under memcg limits and memcg ended up building a separate ad-hoc degenerate mechanism directly into vmscan code to limit page dirtying. This patchset refactors the dirty throttling logic implemented in balance_dirty_pages() and its subroutines os that it can handle both global and memcg memory domains. Dirty throttling mechanism is applied against both the global and memcg constraints and the more restricted of the two is used for actual throttling. This makes the dirty throttling mechanism operational for memcg domains including writeback-bandwidth-proportional dirty page distribution inside them. This patchset contains the following 18 patches. 0001-memcg-make-mem_cgroup_read_-stat-event-iterate-possi.patch 0002-writeback-reorganize-__-wb_update_bandwidth.patch 0003-writeback-implement-wb_domain.patch 0004-writeback-move-global_dirty_limit-into-wb_domain.patch 0005-writeback-consolidate-dirty-throttle-parameters-into.patch 0006-writeback-add-dirty_throttle_control-wb_bg_thresh.patch 0007-writeback-make-__wb_dirty_limit-take-dirty_throttle_.patch 0008-writeback-add-dirty_throttle_control-pos_ratio.patch 0009-writeback-add-dirty_throttle_control-wb_completions.patch 0010-writeback-add-dirty_throttle_control-dom.patch 0011-writeback-make-__wb_writeout_inc-and-hard_dirty_limi.patch 0012-writeback-separate-out-domain_dirty_limits.patch 0013-writeback-move-over_bground_thresh-to-mm-page-writeb.patch 0014-writeback-update-wb_over_bg_thresh-to-use-wb_domain-.patch 0015-writeback-implement-memcg-wb_domain.patch 0016-writeback-reset-wb_domain-dirty_limit-_tstmp-when-me.patch 0017-writeback-implement-memcg-writeback-domain-based-thr.patch 0018-mm-vmscan-remove-memcg-stalling-on-writeback-pages-d.patch 0001-0002 are prep patches. 0003-0014 refactors dirty throttling logic so that it operates on wb_domain. 0015-0018 implement memcg wb_domain. This patchset is on top of block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set") + [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation + [2] [PATCHSET 1/3 v2 block/for-4.1/core] writeback: cgroup writeback support and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-backpressure-20150322 diffstat follows. Thanks. fs/fs-writeback.c| 32 - include/linux/backing-dev-defs.h |1 include/linux/memcontrol.h | 21 + include/linux/writeback.h| 82 +++- include/trace/events/writeback.h |7 mm/backing-dev.c |9 mm/memcontrol.c | 145 +-- mm/page-writeback.c | 722 +-- mm/vmscan.c | 109 + 9 files changed, 716 insertions(+), 412 deletions(-) -- tejun [L] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email...@kernel.org [1] http://lkml.kernel.org/g/20150323041848.ga8...@htj.duckdns.org [2] http://lkml.kernel.org/g/1427086499-15657-1-git-send-email...@kernel.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 21/48] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- block/blk-cgroup.c | 7 +- fs/fs-writeback.c| 8 +- fs/inode.c | 1 + include/linux/backing-dev-defs.h | 59 +- include/linux/backing-dev.h | 195 +++ include/linux/blk-cgroup.h | 4 + include/linux/fs.h | 4 + include/linux/memcontrol.h | 4 + mm/backing-dev.c | 398 +++ mm/memcontrol.c | 19 +- mm/page-writeback.c | 11 +- 11 files changed, 699 insertions(+), 11 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 9e0fe38..d2b1cbf 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -811,6 +812,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css) } spin_unlock_irq(>lock); + + wb_blkcg_offline(blkcg); } static void blkcg_css_free(struct cgroup_subsys_state *css) @@ -841,7 +844,9 @@ done: spin_lock_init(>lock); INIT_RADIX_TREE(>blkg_tree, GFP_ATOMIC); INIT_HLIST_HEAD(>blkg_list); - +#ifdef CONFIG_CGROUP_WRITEBACK + INIT_LIST_HEAD(>cgwb_list); +#endif return >css; } diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 4fd264d..48db5e6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -173,11 +173,11 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi) */ void inode_wb_list_del(struct inode *inode) { - struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = inode_to_wb(inode); - spin_lock(>wb.list_lock); + spin_lock(>list_lock); list_del_init(>i_wb_list); - spin_unlock(>wb.list_lock); + spin_unlock(>list_lock); } /* @@ -1200,6 +1200,8 @@ void __mark_inode_dirty(struct inode *inode, int flags) if ((inode->i_state & flags) != flags) { const int was_dirty = inode->i_state & I_DIRTY; + inode_attach_wb(inode, NULL); + if (flags & I_DIRTY_INODE) inode->i_state &= ~I_DIRTY_TIME; inode->i_state |= flags; diff --git a/fs/inode.c b/fs/inode.c index f00b16f..55cedf8 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu); void __destroy_inode(struct inode *inode) { BUG_ON(inode_has_buffers(inode)); + inode_detach_wb(inode); security_inode_free(inode); fsnotify_inode_delete(inode); locks_free_lock_context(inode->i_flctx); diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 9e9eafa..a1e9c40 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -2,8 +2,11 @@ #define __LINUX_BACKING_DEV_DEFS_H #include +#include +#include #include #include +#include #include #include #include @@ -37,10 +40,43 @@ enum wb_stat_item { #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) +/* + * For cgroup writeback, multiple wb's may map to the same blkcg. Those + * wb's can operate mostly independently but should share the congested + * state. To facilitate such sharing, the congested state is tracked using + * the following struct which is created on demand, indexed by blkcg ID on + * its bdi, and refcounted. + */ struct bdi_writeback_congested { unsigned long state;/* WB_[a]sync_congested flags */ + +#ifdef CONFIG_CGROUP_WRITEBACK + struct backing_dev_info *bdi; /* the associated bdi */ +
[PATCH 03/18] writeback: implement wb_domain
Dirtyable memory is distributed to a wb (bdi_writeback) according to the relative bandwidth the wb is writing out in the whole system. This distribution is global - each wb is measured against all other wb's and gets the proportinately sized portion of the memory in the whole system. For cgroup writeback, the amount of dirtyable memory is scoped by memcg and thus each wb would need to be measured and controlled in its memcg. IOW, a wb will belong to two writeback domains - the global and memcg domains. Currently, what constitutes the global writeback domain are scattered across a number of global states. This patch starts collecting them into struct wb_domain. * fprop_global which serves as the basis for proportional bandwidth measurement and its period timer are moved into struct wb_domain. * global_wb_domain hosts the states for the global domain. * While at it, flatten wb_writeout_fraction() into its callers. This thin wrapper doesn't provide any actual benefits while getting in the way. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- include/linux/writeback.h | 32 + mm/page-writeback.c | 72 ++- 2 files changed, 59 insertions(+), 45 deletions(-) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 82e0e39..5af0a57e 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -7,6 +7,7 @@ #include #include #include +#include DECLARE_PER_CPU(int, dirty_throttle_leaks); @@ -87,6 +88,36 @@ struct writeback_control { }; /* + * A wb_domain represents a domain that wb's (bdi_writeback's) belong to + * and are measured against each other in. There always is one global + * domain, global_wb_domain, that every wb in the system is a member of. + * This allows measuring the relative bandwidth of each wb to distribute + * dirtyable memory accordingly. + */ +struct wb_domain { + /* +* Scale the writeback cache size proportional to the relative +* writeout speed. +* +* We do this by keeping a floating proportion between BDIs, based +* on page writeback completions [end_page_writeback()]. Those +* devices that write out pages fastest will get the larger share, +* while the slower will get a smaller share. +* +* We use page writeout completions because we are interested in +* getting rid of dirty pages. Having them written out is the +* primary goal. +* +* We introduce a concept of time, a period over which we measure +* these events, because demand can/will vary over time. The length +* of this period itself is measured in page writeback completions. +*/ + struct fprop_global completions; + struct timer_list period_timer; /* timer for aging of completions */ + unsigned long period_time; +}; + +/* * fs/fs-writeback.c */ struct bdi_writeback; @@ -120,6 +151,7 @@ static inline void laptop_sync_completion(void) { } #endif void throttle_vm_writeout(gfp_t gfp_mask); bool zone_dirty_ok(struct zone *zone); +int wb_domain_init(struct wb_domain *dom, gfp_t gfp); extern unsigned long global_dirty_limit; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d9ebabe..3c6ccc7 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -124,29 +124,7 @@ EXPORT_SYMBOL(laptop_mode); unsigned long global_dirty_limit; -/* - * Scale the writeback cache size proportional to the relative writeout speeds. - * - * We do this by keeping a floating proportion between BDIs, based on page - * writeback completions [end_page_writeback()]. Those devices that write out - * pages fastest will get the larger share, while the slower will get a smaller - * share. - * - * We use page writeout completions because we are interested in getting rid of - * dirty pages. Having them written out is the primary goal. - * - * We introduce a concept of time, a period over which we measure these events, - * because demand can/will vary over time. The length of this period itself is - * measured in page writeback completions. - * - */ -static struct fprop_global writeout_completions; - -static void writeout_period(unsigned long t); -/* Timer for aging of writeout_completions */ -static struct timer_list writeout_period_timer = - TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0); -static unsigned long writeout_period_time = 0; +static struct wb_domain global_wb_domain; /* * Length of period for aging writeout fractions of bdis. This is an @@ -433,24 +411,26 @@ static unsigned long wp_next_time(unsigned long cur_time) } /* - * Increment the BDI's writeout completion count and the global writeout + * Increment the wb's writeout completion count and the global writeout * completion count. Called from
[PATCH 01/18] memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online
cpu_possible_mask represents the CPUs which are actually possible during that boot instance. For systems which don't support CPU hotplug, this will match cpu_online_mask exactly in most cases. Even for systems which support CPU hotplug, the number of possible CPU slots is highly unlikely to diverge greatly from the number of online CPUs. The only cases where the difference between possible and online caused problems were when the boot code failed to initialize the possible mask and left it fully set at NR_CPUS - 1. As such, most per-cpu constructs allocate for all possible CPUs and often iterate over the possibles, which also has the benefit of avoiding the blocking CPU hotplug synchronization. memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and mem_cgroup_read_events(), which iterates over online CPUs and handles CPU hotplug operations explicitly. This complexity doesn't actually buy anything. Switch to iterating over the possibles and drop the explicit CPU hotplug handling. Eventually, we want to convert memcg to use percpu_counter instead of its own custom implementation which also benefits from quick access w/o summing for cases where larger error margin is acceptable. This will allow mem_cgroup_read_stat() to be called from non-sleepable contexts which will be used by cgroup writeback. Signed-off-by: Tejun Heo Cc: Johannes Weiner Cc: Michal Hocko --- mm/memcontrol.c | 51 ++- 1 file changed, 2 insertions(+), 49 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a6fa6fe..ab483e9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -323,11 +323,6 @@ struct mem_cgroup { * percpu counter. */ struct mem_cgroup_stat_cpu __percpu *stat; - /* -* used when a cpu is offlined or other synchronizations -* See mem_cgroup_read_stat(). -*/ - struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) @@ -808,15 +803,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg, long val = 0; int cpu; - get_online_cpus(); - for_each_online_cpu(cpu) + for_each_possible_cpu(cpu) val += per_cpu(memcg->stat->count[idx], cpu); -#ifdef CONFIG_HOTPLUG_CPU - spin_lock(>pcp_counter_lock); - val += memcg->nocpu_base.count[idx]; - spin_unlock(>pcp_counter_lock); -#endif - put_online_cpus(); return val; } @@ -826,15 +814,8 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, unsigned long val = 0; int cpu; - get_online_cpus(); - for_each_online_cpu(cpu) + for_each_possible_cpu(cpu) val += per_cpu(memcg->stat->events[idx], cpu); -#ifdef CONFIG_HOTPLUG_CPU - spin_lock(>pcp_counter_lock); - val += memcg->nocpu_base.events[idx]; - spin_unlock(>pcp_counter_lock); -#endif - put_online_cpus(); return val; } @@ -2182,37 +2163,12 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) mutex_unlock(_charge_mutex); } -/* - * This function drains percpu counter value from DEAD cpu and - * move it to local cpu. Note that this function can be preempted. - */ -static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu) -{ - int i; - - spin_lock(>pcp_counter_lock); - for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { - long x = per_cpu(memcg->stat->count[i], cpu); - - per_cpu(memcg->stat->count[i], cpu) = 0; - memcg->nocpu_base.count[i] += x; - } - for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { - unsigned long x = per_cpu(memcg->stat->events[i], cpu); - - per_cpu(memcg->stat->events[i], cpu) = 0; - memcg->nocpu_base.events[i] += x; - } - spin_unlock(>pcp_counter_lock); -} - static int memcg_cpu_hotplug_callback(struct notifier_block *nb, unsigned long action, void *hcpu) { int cpu = (unsigned long)hcpu; struct memcg_stock_pcp *stock; - struct mem_cgroup *iter; if (action == CPU_ONLINE) return NOTIFY_OK; @@ -2220,9 +2176,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, if (action != CPU_DEAD && action != CPU_DEAD_FROZEN) return NOTIFY_OK; - for_each_mem_cgroup(iter) - mem_cgroup_drain_pcp_counter(iter, cpu); - stock = _cpu(memcg_stock, cpu); drain_stock(stock); return NOTIFY_OK; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/18] writeback: consolidate dirty throttle parameters into dirty_throttle_control
Dirty throttling implemented in balance_dirty_pages() and its subroutines makes use of a number of parameters which are passed around individually. This renders these functions somewhat unwieldy and makes it difficult to add or change the involved parameters. Also some functions use different or conflicting naming schemes for the same parameters making the code confusing to follow. This patch consolidates the main parameters into struct dirty_throttle_control so that they can be passed around easily and adding new paramters isn't painful. This also unifies how a given parameter is named and accessed. The drawback of using this type of control structure rather than explicit paramters is that it isn't immediately obvious which function accesses and modifies what; however, it's fairly clear that the benefits outweigh in this case. GDTC_INIT() macro is provided to ease initializing dirty_throttle_control for the global_wb_domain and balance_dirty_pages() uses a separate pointer to point to its global dirty_throttle_control. This is to make it uniform with memcg domain handling which will be added later. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 212 +--- 1 file changed, 101 insertions(+), 111 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 06c5d3a..b8e95a4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -124,6 +124,20 @@ EXPORT_SYMBOL(laptop_mode); struct wb_domain global_wb_domain; +/* consolidated parameters for balance_dirty_pages() and its subroutines */ +struct dirty_throttle_control { + struct bdi_writeback*wb; + + unsigned long dirty; /* file_dirty + write + nfs */ + unsigned long thresh; /* dirty threshold */ + unsigned long bg_thresh; /* dirty background threshold */ + + unsigned long wb_dirty; /* per-wb counterparts */ + unsigned long wb_thresh; +}; + +#define GDTC_INIT(__wb).wb = (__wb) + /* * Length of period for aging writeout fractions of bdis. This is an * arbitrarily chosen number. The longer the period, the slower fractions will @@ -695,16 +709,13 @@ static long long pos_ratio_polynom(unsigned long setpoint, * card's wb_dirty may rush to many times higher than wb_setpoint. * - the wb dirty thresh drops quickly due to change of JBOD workload */ -static unsigned long wb_position_ratio(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long wb_thresh, - unsigned long wb_dirty) +static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) { + struct bdi_writeback *wb = dtc->wb; unsigned long write_bw = wb->avg_write_bandwidth; - unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); - unsigned long limit = hard_dirty_limit(thresh); + unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); + unsigned long limit = hard_dirty_limit(dtc->thresh); + unsigned long wb_thresh = dtc->wb_thresh; unsigned long x_intercept; unsigned long setpoint; /* dirty pages' target balance point */ unsigned long wb_setpoint; @@ -712,7 +723,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, long long pos_ratio;/* for scaling up/down the rate limit */ long x; - if (unlikely(dirty >= limit)) + if (unlikely(dtc->dirty >= limit)) return 0; /* @@ -721,7 +732,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, * See comment for pos_ratio_polynom(). */ setpoint = (freerun + limit) / 2; - pos_ratio = pos_ratio_polynom(setpoint, dirty, limit); + pos_ratio = pos_ratio_polynom(setpoint, dtc->dirty, limit); /* * The strictlimit feature is a tool preventing mistrusted filesystems @@ -752,20 +763,21 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb, long long wb_pos_ratio; unsigned long wb_bg_thresh; - if (wb_dirty < 8) + if (dtc->wb_dirty < 8) return min_t(long long, pos_ratio * 2, 2 << RATELIMIT_CALC_SHIFT); - if (wb_dirty >= wb_thresh) + if (dtc->wb_dirty >= wb_thresh) return 0; - wb_bg_thresh = div_u64((u64)wb_thresh * bg_thresh, thresh); + wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh, +
[PATCH 06/18] writeback: add dirty_throttle_control->wb_bg_thresh
wb_bg_thresh is currently treated as a second-class citizen. It's only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages() doesn't calculate it unless the cap is set. When the cap is set, the calculated value is not passed around but instead recalculated whenever it's used. wb_position_ratio() calculates it by scaling wb_thresh proportional to bg_thresh / thresh. wb_update_dirty_ratelimit() uses wb_dirty_limit() on bg_thresh, which should generally lead to a similar result as the proportional scaling but can also be way off in the presence of max/min_ratio settings. Avoiding wb_bg_thresh calculation saves us one u64 multiplication and divsion when BDI_CAP_STRICTLIMIT is not set. Given that balance_dirty_pages() is already ratelimited, this doesn't justify the incurred extra complexity. This patch adds wb_bg_thresh to dirty_throttle_control and makes wb_dirty_limits() always calculate it and updates the users to use the pre-calculated value. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 27 +++ 1 file changed, 11 insertions(+), 16 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b8e95a4..00218e9 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -134,6 +134,7 @@ struct dirty_throttle_control { unsigned long wb_dirty; /* per-wb counterparts */ unsigned long wb_thresh; + unsigned long wb_bg_thresh; }; #define GDTC_INIT(__wb).wb = (__wb) @@ -761,7 +762,6 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { long long wb_pos_ratio; - unsigned long wb_bg_thresh; if (dtc->wb_dirty < 8) return min_t(long long, pos_ratio * 2, @@ -770,9 +770,8 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) if (dtc->wb_dirty >= wb_thresh) return 0; - wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh, - dtc->thresh); - wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh); + wb_setpoint = dirty_freerun_ceiling(wb_thresh, + dtc->wb_bg_thresh); if (wb_setpoint == 0 || wb_setpoint == wb_thresh) return 0; @@ -1104,15 +1103,14 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, * * We rampup dirty_ratelimit forcibly if wb_dirty is low because * it's possible that wb_thresh is close to zero due to inactivity -* of backing device (see the implementation of wb_dirty_limit()). +* of backing device. */ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { dirty = dtc->wb_dirty; if (dtc->wb_dirty < 8) setpoint = dtc->wb_dirty + 1; else - setpoint = (dtc->wb_thresh + - wb_dirty_limit(wb, dtc->bg_thresh)) / 2; + setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2; } if (dirty < setpoint) { @@ -1307,8 +1305,7 @@ static long wb_min_pause(struct bdi_writeback *wb, return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t; } -static inline void wb_dirty_limits(struct dirty_throttle_control *dtc, - unsigned long *wb_bg_thresh) +static inline void wb_dirty_limits(struct dirty_throttle_control *dtc) { struct bdi_writeback *wb = dtc->wb; unsigned long wb_reclaimable; @@ -1327,11 +1324,8 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc, * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ dtc->wb_thresh = wb_dirty_limit(dtc->wb, dtc->thresh); - - if (wb_bg_thresh) - *wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh * - dtc->bg_thresh, - dtc->thresh) : 0; + dtc->wb_bg_thresh = dtc->thresh ? + div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0; /* * In order to avoid the stacked BDI deadlock we need @@ -1396,10 +1390,11 @@ static void balance_dirty_pages(struct address_space *mapping, global_dirty_limits(>bg_thresh, >thresh); if (unlikely(strictlimit)) { - wb_dirty_limits(gdtc, _thresh); + wb_dirty_limits(gdtc); dirty = gdtc->wb_dirty; thresh = gdtc->wb_thresh; + bg_thresh = gdtc->wb_bg_thresh;
[PATCH 07/18] writeback: make __wb_dirty_limit() take dirty_throttle_control
wb_dirty_limit() calculates wb_dirty by scaling thresh according to the wb's portion in the system-wide write bandwidth. cgroup writeback support would need to calculate wb_dirty against memcg domain too. This patch renames wb_dirty_limit() to __wb_dirty_limit() and makes it take dirty_throttle_control so that the function can later be updated to calculate against different domains according to dirty_throttle_control. wb_dirty_limit() is now a thin wrapper around __wb_dirty_limit(). Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 00218e9..a4b6dab 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -557,9 +557,8 @@ static unsigned long hard_dirty_limit(unsigned long thresh) } /** - * wb_dirty_limit - @wb's share of dirty throttling threshold - * @wb: bdi_writeback to query - * @dirty: global dirty limit in pages + * __wb_dirty_limit - @wb's share of dirty throttling threshold + * @dtc: dirty_throttle_context of interest * * Returns @wb's dirty limit in pages. The term "dirty" in the context of * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. @@ -578,9 +577,10 @@ static unsigned long hard_dirty_limit(unsigned long thresh) * The wb's share of dirty limit will be adapting to its throughput and * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. */ -unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) +static unsigned long __wb_dirty_limit(struct dirty_throttle_control *dtc) { struct wb_domain *dom = _wb_domain; + unsigned long dirty = dtc->dirty; u64 wb_dirty; long numerator, denominator; unsigned long wb_min_ratio, wb_max_ratio; @@ -588,14 +588,14 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) /* * Calculate this BDI's share of the dirty ratio. */ - fprop_fraction_percpu(>completions, >completions, + fprop_fraction_percpu(>completions, >wb->completions, , ); wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100; wb_dirty *= numerator; do_div(wb_dirty, denominator); - wb_min_max_ratio(wb, _min_ratio, _max_ratio); + wb_min_max_ratio(dtc->wb, _min_ratio, _max_ratio); wb_dirty += (dirty * wb_min_ratio) / 100; if (wb_dirty > (dirty * wb_max_ratio) / 100) @@ -604,6 +604,13 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) return wb_dirty; } +unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) +{ + struct dirty_throttle_control gdtc = { GDTC_INIT(wb), .dirty = dirty }; + + return __wb_dirty_limit(); +} + /* * setpoint - dirty 3 *f(dirty) := 1.0 + () @@ -1323,7 +1330,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc) * wb_position_ratio() will let the dirtier task progress * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ - dtc->wb_thresh = wb_dirty_limit(dtc->wb, dtc->thresh); + dtc->wb_thresh = __wb_dirty_limit(dtc); dtc->wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/18] writeback: add dirty_throttle_control->wb_completions
wb->completions measures the wb's proportional write bandwidth in global_wb_domain and thus naturally tied to the wb_domain. This patch adds dirty_throttle_control->wb_completions which is initialized to wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use it instead of dereferencing wb->completions directly. This will allow dirty_throttle_control to represent different wb_domains and the matching wb completions. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ac2d7b1..1f216cf 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -127,6 +127,7 @@ struct wb_domain global_wb_domain; /* consolidated parameters for balance_dirty_pages() and its subroutines */ struct dirty_throttle_control { struct bdi_writeback*wb; + struct fprop_local_percpu *wb_completions; unsigned long dirty; /* file_dirty + write + nfs */ unsigned long thresh; /* dirty threshold */ @@ -139,7 +140,8 @@ struct dirty_throttle_control { unsigned long pos_ratio; }; -#define GDTC_INIT(__wb).wb = (__wb) +#define GDTC_INIT(__wb).wb = (__wb), \ + .wb_completions = &(__wb)->completions /* * Length of period for aging writeout fractions of bdis. This is an @@ -590,7 +592,7 @@ static unsigned long __wb_dirty_limit(struct dirty_throttle_control *dtc) /* * Calculate this BDI's share of the dirty ratio. */ - fprop_fraction_percpu(>completions, >wb->completions, + fprop_fraction_percpu(>completions, dtc->wb_completions, , ); wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/18] writeback: move global_dirty_limit into wb_domain
This patch is a part of the series to define wb_domain which represents a domain that wb's (bdi_writeback's) belong to and are measured against each other in. This will enable IO backpressure propagation for cgroup writeback. global_dirty_limit exists to regulate the global dirty threshold which is a property of the wb_domain. This patch moves hard_dirty_limit, dirty_lock, and update_time into wb_domain. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c| 2 +- include/linux/writeback.h| 17 ++- include/trace/events/writeback.h | 7 +++--- mm/page-writeback.c | 46 4 files changed, 44 insertions(+), 28 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 3d9b360..6232ae9 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -878,7 +878,7 @@ static long writeback_chunk_size(struct bdi_writeback *wb, pages = LONG_MAX; else { pages = min(wb->avg_write_bandwidth / 2, - global_dirty_limit / DIRTY_SCOPE); + global_wb_domain.dirty_limit / DIRTY_SCOPE); pages = min(pages, work->nr_pages); pages = round_down(pages + MIN_WRITEBACK_PAGES, MIN_WRITEBACK_PAGES); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 5af0a57e..ff627d6 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -95,6 +95,8 @@ struct writeback_control { * dirtyable memory accordingly. */ struct wb_domain { + spinlock_t lock; + /* * Scale the writeback cache size proportional to the relative * writeout speed. @@ -115,6 +117,19 @@ struct wb_domain { struct fprop_global completions; struct timer_list period_timer; /* timer for aging of completions */ unsigned long period_time; + + /* +* The dirtyable memory and dirty threshold could be suddenly +* knocked down by a large amount (eg. on the startup of KVM in a +* swapless system). This may throw the system into deep dirty +* exceeded state and throttle heavy/light dirtiers alike. To +* retain good responsiveness, maintain global_dirty_limit for +* tracking slowly down to the knocked down dirty threshold. +* +* Both fields are protected by ->lock. +*/ + unsigned long dirty_limit_tstamp; + unsigned long dirty_limit; }; /* @@ -153,7 +168,7 @@ void throttle_vm_writeout(gfp_t gfp_mask); bool zone_dirty_ok(struct zone *zone); int wb_domain_init(struct wb_domain *dom, gfp_t gfp); -extern unsigned long global_dirty_limit; +extern struct wb_domain global_wb_domain; /* These are exported to sysctl. */ extern int dirty_background_ratio; diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 5c9a68c..d5ac3dd 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -344,7 +344,7 @@ TRACE_EVENT(global_dirty_state, __entry->nr_written = global_page_state(NR_WRITTEN); __entry->background_thresh = background_thresh; __entry->dirty_thresh = dirty_thresh; - __entry->dirty_limit = global_dirty_limit; + __entry->dirty_limit= global_wb_domain.dirty_limit; ), TP_printk("dirty=%lu writeback=%lu unstable=%lu " @@ -446,8 +446,9 @@ TRACE_EVENT(balance_dirty_pages, unsigned long freerun = (thresh + bg_thresh) / 2; strlcpy(__entry->bdi, dev_name(bdi->dev), 32); - __entry->limit = global_dirty_limit; - __entry->setpoint = (global_dirty_limit + freerun) / 2; + __entry->limit = global_wb_domain.dirty_limit; + __entry->setpoint = (global_wb_domain.dirty_limit + + freerun) / 2; __entry->dirty = dirty; __entry->bdi_setpoint = __entry->setpoint * bdi_thresh / (thresh + 1); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3c6ccc7..06c5d3a 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -122,9 +122,7 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -unsigned long global_dirty_limit; - -static struct wb_domain global_wb_domain; +struct wb_domain global_wb_domain; /* * Length of period for aging writeout fractions of bdis. This is an @@ -470,9 +468,15 @@ static void writeout_period(unsigned long t) int wb_domain_init(struct wb_domain *dom, gfp_t gfp) { memset(dom, 0, sizeof(*dom)); + + spin_lock_init(>lock); +
[PATCH] ubifs: return err value than 0
Currently, ubifs_readpage only returns 0 even if ubifs_bulk_read() fails. Like the other file systems, the error value should be propagated further instead of 0. Another check that is missing is ENOMEM for kmalloc. Signed-off-by: Sanidhya Kashyap --- fs/ubifs/file.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index e627c0a..28fe892 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -863,8 +863,10 @@ static int ubifs_bulk_read(struct page *page) bu = >bu; else { bu = kmalloc(sizeof(struct bu_info), GFP_NOFS | __GFP_NOWARN); - if (!bu) + if (!bu) { + err = -ENOMEM; goto out_unlock; + } bu->buf = NULL; allocated = 1; @@ -887,11 +889,14 @@ out_unlock: static int ubifs_readpage(struct file *file, struct page *page) { - if (ubifs_bulk_read(page)) - return 0; + int err = 0; + + err = ubifs_bulk_read(page); + if (err) + return err; do_readpage(page); unlock_page(page); - return 0; + return err; } static int do_writepage(struct page *page, int len) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/18] writeback: add dirty_throttle_control->dom
Currently all dirty throttle operations use global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch adds dirty_throttle_control->dom and updates functions which are directly using globabl_wb_domain to use it instead. As this makes global_update_bandwidth() a misnomer, the function is renamed to domain_update_bandwidth(). This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1f216cf..840b8f2 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -126,6 +126,9 @@ struct wb_domain global_wb_domain; /* consolidated parameters for balance_dirty_pages() and its subroutines */ struct dirty_throttle_control { +#ifdef CONFIG_CGROUP_WRITEBACK + struct wb_domain*dom; +#endif struct bdi_writeback*wb; struct fprop_local_percpu *wb_completions; @@ -140,7 +143,7 @@ struct dirty_throttle_control { unsigned long pos_ratio; }; -#define GDTC_INIT(__wb).wb = (__wb), \ +#define DTC_INIT_COMMON(__wb) .wb = (__wb), \ .wb_completions = &(__wb)->completions /* @@ -152,6 +155,14 @@ struct dirty_throttle_control { #ifdef CONFIG_CGROUP_WRITEBACK +#define GDTC_INIT(__wb).dom = _wb_domain, \ + DTC_INIT_COMMON(__wb) + +static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) +{ + return dtc->dom; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -181,6 +192,13 @@ static void wb_min_max_ratio(struct bdi_writeback *wb, #else /* CONFIG_CGROUP_WRITEBACK */ +#define GDTC_INIT(__wb)DTC_INIT_COMMON(__wb) + +static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) +{ + return _wb_domain; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -583,7 +601,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh) */ static unsigned long __wb_dirty_limit(struct dirty_throttle_control *dtc) { - struct wb_domain *dom = _wb_domain; + struct wb_domain *dom = dtc_dom(dtc); unsigned long dirty = dtc->dirty; u64 wb_dirty; long numerator, denominator; @@ -952,7 +970,7 @@ out: static void update_dirty_limit(struct dirty_throttle_control *dtc) { - struct wb_domain *dom = _wb_domain; + struct wb_domain *dom = dtc_dom(dtc); unsigned long thresh = dtc->thresh; unsigned long limit = dom->dirty_limit; @@ -979,10 +997,10 @@ update: dom->dirty_limit = limit; } -static void global_update_bandwidth(struct dirty_throttle_control *dtc, +static void domain_update_bandwidth(struct dirty_throttle_control *dtc, unsigned long now) { - struct wb_domain *dom = _wb_domain; + struct wb_domain *dom = dtc_dom(dtc); /* * check locklessly first to optimize away locking for the most time @@ -1190,7 +1208,7 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *dtc, goto snapshot; if (update_ratelimit) { - global_update_bandwidth(dtc, now); + domain_update_bandwidth(dtc, now); wb_update_dirty_ratelimit(dtc, dirtied, elapsed); } wb_update_write_bandwidth(wb, elapsed, written); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/18] writeback: separate out domain_dirty_limits()
global_dirty_limits() calculates thresh and bg_thresh (confusingly called *pdirty and *pbackground in the function) assuming global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch separates out domain_dirty_limits() which takes dirty_throttle_control out of global_dirty_limits(). As thresh and bg_thresh calculation needs the amount of dirtyable memory in the domain, dirty_throttle_control->avail is added. The new function calculates the two thresholds and store them directly in the dirty_throttle_control. Also, as memcg domains can't follow vm_dirty_bytes and dirty_background_bytes settings directly. If those are set and domain_dirty_limits() is invoked for a !global domain, the settings are translated to ratios by scaling them against globally available memory. dirty_throttle_control->gdtc is added to enable this when CONFIG_CGROUP_WRITEBACK. global_dirty_limits() is now a thin wrapper around domain_dirty_limits() and balance_dirty_pages() is updated to use the new function too. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 111 1 file changed, 86 insertions(+), 25 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index e6c7572..7e9922f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -128,10 +128,12 @@ struct wb_domain global_wb_domain; struct dirty_throttle_control { #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain*dom; + struct dirty_throttle_control *gdtc;/* only set in memcg dtc's */ #endif struct bdi_writeback*wb; struct fprop_local_percpu *wb_completions; + unsigned long avail; /* dirtyable */ unsigned long dirty; /* file_dirty + write + nfs */ unsigned long thresh; /* dirty threshold */ unsigned long bg_thresh; /* dirty background threshold */ @@ -157,12 +159,18 @@ struct dirty_throttle_control { #define GDTC_INIT(__wb).dom = _wb_domain, \ DTC_INIT_COMMON(__wb) +#define GDTC_INIT_NO_WB.dom = _wb_domain static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) { return dtc->dom; } +static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) +{ + return mdtc->gdtc; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -193,12 +201,18 @@ static void wb_min_max_ratio(struct bdi_writeback *wb, #else /* CONFIG_CGROUP_WRITEBACK */ #define GDTC_INIT(__wb)DTC_INIT_COMMON(__wb) +#define GDTC_INIT_NO_WB static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) { return _wb_domain; } +static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) +{ + return NULL; +} + static void wb_min_max_ratio(struct bdi_writeback *wb, unsigned long *minp, unsigned long *maxp) { @@ -303,42 +317,88 @@ static unsigned long global_dirtyable_memory(void) return x + 1; /* Ensure that we never return 0 */ } -/* - * global_dirty_limits - background-writeback and dirty-throttling thresholds +/** + * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain + * @dtc: dirty_throttle_control of interest * - * Calculate the dirty thresholds based on sysctl parameters - * - vm.dirty_background_ratio or vm.dirty_background_bytes - * - vm.dirty_ratio or vm.dirty_bytes - * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and + * Calculate @dtc->thresh and ->bg_thresh considering + * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller + * must ensure that @dtc->avail is set before calling this function. The + * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and * real-time tasks. */ -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) -{ - const unsigned long available_memory = global_dirtyable_memory(); - unsigned long background; - unsigned long dirty; +static void domain_dirty_limits(struct dirty_throttle_control *dtc) +{ + const unsigned long available_memory = dtc->avail; + struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc); + unsigned long bytes = vm_dirty_bytes; + unsigned long bg_bytes = dirty_background_bytes; + unsigned long ratio = vm_dirty_ratio; + unsigned long bg_ratio = dirty_background_ratio; + unsigned long thresh; + unsigned long bg_thresh; struct task_struct *tsk; - if (vm_dirty_bytes) - dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); + /* gdtc is
[PATCH 08/18] writeback: add dirty_throttle_control->pos_ratio
wb_position_ratio() is used to calculate pos_ratio, which is used for two purposes. wb_update_dirty_ratelimit() uses it to adjust wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to immediately adjust dirty_ratelimit right before applying it to determine pause duration. While wb_update_dirty_ratelimit() is separately rate limited from balance_dirty_pages(), on the run where the ratelimit is updated, we end up calculating pos_ratio twice with the same parameters. This patch adds dirty_throttle_control->pos_ratio. balance_dirty_pages() calculates it once per run and wb_update_dirty_ratelimit() uses the value stored in dirty_throttle_control. This removes the duplicate calculation and also will help implementing memcg wb_domain. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 36 +--- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a4b6dab..ac2d7b1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -135,6 +135,8 @@ struct dirty_throttle_control { unsigned long wb_dirty; /* per-wb counterparts */ unsigned long wb_thresh; unsigned long wb_bg_thresh; + + unsigned long pos_ratio; }; #define GDTC_INIT(__wb).wb = (__wb) @@ -717,7 +719,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * card's wb_dirty may rush to many times higher than wb_setpoint. * - the wb dirty thresh drops quickly due to change of JBOD workload */ -static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) +static void wb_position_ratio(struct dirty_throttle_control *dtc) { struct bdi_writeback *wb = dtc->wb; unsigned long write_bw = wb->avg_write_bandwidth; @@ -731,8 +733,10 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) long long pos_ratio;/* for scaling up/down the rate limit */ long x; + dtc->pos_ratio = 0; + if (unlikely(dtc->dirty >= limit)) - return 0; + return; /* * global setpoint @@ -770,18 +774,20 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { long long wb_pos_ratio; - if (dtc->wb_dirty < 8) - return min_t(long long, pos_ratio * 2, -2 << RATELIMIT_CALC_SHIFT); + if (dtc->wb_dirty < 8) { + dtc->pos_ratio = min_t(long long, pos_ratio * 2, + 2 << RATELIMIT_CALC_SHIFT); + return; + } if (dtc->wb_dirty >= wb_thresh) - return 0; + return; wb_setpoint = dirty_freerun_ceiling(wb_thresh, dtc->wb_bg_thresh); if (wb_setpoint == 0 || wb_setpoint == wb_thresh) - return 0; + return; wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty, wb_thresh); @@ -807,7 +813,8 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) * is 2. We might want to tweak this if we observe the control * system is too slow to adapt. */ - return min(pos_ratio, wb_pos_ratio); + dtc->pos_ratio = min(pos_ratio, wb_pos_ratio); + return; } /* @@ -888,7 +895,7 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc) pos_ratio *= 8; } - return pos_ratio; + dtc->pos_ratio = pos_ratio; } static void wb_update_write_bandwidth(struct bdi_writeback *wb, @@ -1009,7 +1016,6 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, unsigned long dirty_rate; unsigned long task_ratelimit; unsigned long balanced_dirty_ratelimit; - unsigned long pos_ratio; unsigned long step; unsigned long x; @@ -1019,12 +1025,11 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, */ dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed; - pos_ratio = wb_position_ratio(dtc); /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ task_ratelimit = (u64)dirty_ratelimit * - pos_ratio >> RATELIMIT_CALC_SHIFT; + dtc->pos_ratio >> RATELIMIT_CALC_SHIFT; task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */
[PATCH 17/18] writeback: implement memcg writeback domain based throttling
While cgroup writeback support now connects memcg and blkcg so that writeback IOs are properly attributed and controlled, the IO back pressure propagation mechanism implemented in balance_dirty_pages() and its subroutines wasn't aware of cgroup writeback. Processes belonging to a memcg may have access to only subset of total memory available in the system and not factoring this into dirty throttling rendered it completely ineffective for processes under memcg limits and memcg ended up building a separate ad-hoc degenerate mechanism directly into vmscan code to limit page dirtying. The previous patches updated balance_dirty_pages() and its subroutines so that they can deal with multiple wb_domain's (writeback domains) and defined per-memcg wb_domain. Processes belonging to a non-root memcg are bound to two wb_domains, global wb_domain and memcg wb_domain, and should be throttled according to IO pressures from both domains. This patch updates dirty throttling code so that it repeats similar calculations for the two domains - the differences between the two are few and minor - and applies the lower of the two sets of resulting constraints. wb_over_bg_thresh(), which controls when background writeback terminates, is also updated to consider both global and memcg wb_domains. It returns true if dirty is over bg_thresh for either domain. This makes the dirty throttling mechanism operational for memcg domains including writeback-bandwidth-proportional dirty page distribution inside them but the ad-hoc memcg throttling mechanism in vmscan is still in place. The next patch will rip it out. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- include/linux/memcontrol.h | 9 +++ mm/memcontrol.c| 43 mm/page-writeback.c| 158 ++--- 3 files changed, 188 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e3177be..c3eb19e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -392,6 +392,8 @@ enum { struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); +void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail, +unsigned long *pdirty, unsigned long *pwriteback); #else /* CONFIG_CGROUP_WRITEBACK */ @@ -400,6 +402,13 @@ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) return NULL; } +static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, + unsigned long *pavail, + unsigned long *pdirty, + unsigned long *pwriteback) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 108acfc..d76f61c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4111,6 +4111,49 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) return >cgwb_domain; } +/** + * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg + * @wb: bdi_writeback in question + * @pavail: out parameter for number of available pages + * @pdirty: out parameter for number of dirty pages + * @pwriteback: out parameter for number of pages under writeback + * + * Determine the numbers of available, dirty, and writeback pages in @wb's + * memcg. Dirty and writeback are self-explanatory. Available is a bit + * more involved. + * + * A memcg's headroom is "min(max, high) - used". The available memory is + * calculated as the lowest headroom of itself and the ancestors plus the + * number of pages already being used for file pages. Note that this + * doesn't consider the actual amount of available memory in the system. + * The caller should further cap *@pavail accordingly. + */ +void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail, +unsigned long *pdirty, unsigned long *pwriteback) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + struct mem_cgroup *parent; + unsigned long head_room = PAGE_COUNTER_MAX; + unsigned long file_pages; + + *pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY); + + /* this should eventually include NR_UNSTABLE_NFS */ + *pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); + + file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | + (1 << LRU_ACTIVE_FILE)); + while ((parent = parent_mem_cgroup(memcg))) { + unsigned long ceiling = min(memcg->memory.limit, memcg->high); + unsigned long used = page_counter_read(>memory); + + head_room = min(head_room, ceiling - min(ceiling, used)); + memcg = parent; + } +
[PATCH 13/18] writeback: move over_bground_thresh() to mm/page-writeback.c
and rename it to wb_over_bg_thresh(). The function is closely tied to the dirty throttling mechanism implemented in page-writeback.c. This relocation will allow future updates necessary for cgroup writeback support. While at it, add function comment. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c | 20 ++-- include/linux/writeback.h | 1 + mm/page-writeback.c | 23 +++ 3 files changed, 26 insertions(+), 18 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 6232ae9..683bd92 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1062,22 +1062,6 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, return nr_pages - work.nr_pages; } -static bool over_bground_thresh(struct bdi_writeback *wb) -{ - unsigned long background_thresh, dirty_thresh; - - global_dirty_limits(_thresh, _thresh); - - if (global_page_state(NR_FILE_DIRTY) + - global_page_state(NR_UNSTABLE_NFS) > background_thresh) - return true; - - if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh)) - return true; - - return false; -} - /* * Explicit flushing or periodic writeback of "old" data. * @@ -1127,7 +,7 @@ static long wb_writeback(struct bdi_writeback *wb, * For background writeout, stop when we are below the * background dirty threshold */ - if (work->for_background && !over_bground_thresh(wb)) + if (work->for_background && !wb_over_bg_thresh(wb)) break; /* @@ -1218,7 +1202,7 @@ static unsigned long get_nr_dirty_pages(void) static long wb_check_background_flush(struct bdi_writeback *wb) { - if (over_bground_thresh(wb)) { + if (wb_over_bg_thresh(wb)) { struct wb_writeback_work work = { .nr_pages = LONG_MAX, diff --git a/include/linux/writeback.h b/include/linux/writeback.h index ff627d6..fa6c3b4 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -204,6 +204,7 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty); void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time); void page_writeback_init(void); void balance_dirty_pages_ratelimited(struct address_space *mapping); +bool wb_over_bg_thresh(struct bdi_writeback *wb); typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, void *data); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7e9922f..99f8d02 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1740,6 +1740,29 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping) } EXPORT_SYMBOL(balance_dirty_pages_ratelimited); +/** + * wb_over_bg_thresh - does @wb need to be written back? + * @wb: bdi_writeback of interest + * + * Determines whether background writeback should keep writing @wb or it's + * clean enough. Returns %true if writeback should continue. + */ +bool wb_over_bg_thresh(struct bdi_writeback *wb) +{ + unsigned long background_thresh, dirty_thresh; + + global_dirty_limits(_thresh, _thresh); + + if (global_page_state(NR_FILE_DIRTY) + + global_page_state(NR_UNSTABLE_NFS) > background_thresh) + return true; + + if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh)) + return true; + + return false; +} + void throttle_vm_writeout(gfp_t gfp_mask) { unsigned long background_thresh; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 18/18] mm: vmscan: remove memcg stalling on writeback pages during direct reclaim
Because writeback wasn't cgroup aware before, the usual dirty throttling mechanism in balance_dirty_pages() didn't work for processes under memcg limit. The writeback path didn't know how much memory is available or how fast the dirty pages are being written out for a given memcg and balance_dirty_pages() didn't have any measure of IO back pressure for the memcg. To work around the issue, memcg implemented an ad-hoc dirty throttling mechanism in the direct reclaim path by stalling on pages under writeback which are encountered during direct reclaim scan. This is rather ugly and crude - none of the configurability, fairness, or bandwidth-proportional distribution of the normal path. The previous patches implemented proper memcg aware dirty throttling and the ad-hoc mechanism is no longer necessary. Remove it. Note: I removed the parts which seemed obvious and it behaves fine while testing but my understanding of this code path is rudimentary and it's quite possible that I got something wrong. Please let me know if I got some wrong or more global_reclaim() sites should be updated. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen Cc: Vladimir Davydov --- mm/vmscan.c | 109 ++-- 1 file changed, 33 insertions(+), 76 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9f8d3c0..d084c95 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -929,53 +929,24 @@ static unsigned long shrink_page_list(struct list_head *page_list, nr_congested++; /* -* If a page at the tail of the LRU is under writeback, there -* are three cases to consider. -* -* 1) If reclaim is encountering an excessive number of pages -*under writeback and this page is both under writeback and -*PageReclaim then it indicates that pages are being queued -*for IO but are being recycled through the LRU before the -*IO can complete. Waiting on the page itself risks an -*indefinite stall if it is impossible to writeback the -*page due to IO error or disconnected storage so instead -*note that the LRU is being scanned too quickly and the -*caller can stall after page list has been processed. -* -* 2) Global reclaim encounters a page, memcg encounters a -*page that is not marked for immediate reclaim or -*the caller does not have __GFP_IO. In this case mark -*the page for immediate reclaim and continue scanning. -* -*__GFP_IO is checked because a loop driver thread might -*enter reclaim, and deadlock if it waits on a page for -*which it is needed to do the write (loop masks off -*__GFP_IO|__GFP_FS for this reason); but more thought -*would probably show more reasons. -* -*Don't require __GFP_FS, since we're not going into the -*FS, just waiting on its writeback completion. Worryingly, -*ext4 gfs2 and xfs allocate pages with -*grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing -*may_enter_fs here is liable to OOM on them. -* -* 3) memcg encounters a page that is not already marked -*PageReclaim. memcg does not have any dirty pages -*throttling so we could easily OOM just because too many -*pages are in writeback and there is nothing else to -*reclaim. Wait for the writeback to complete. +* A page at the tail of the LRU is under writeback. If +* reclaim is encountering an excessive number of pages +* under writeback and this page is both under writeback +* and PageReclaim then it indicates that pages are being +* queued for IO but are being recycled through the LRU +* before the IO can complete. Waiting on the page itself +* risks an indefinite stall if it is impossible to +* writeback the page due to IO error or disconnected +* storage so instead note that the LRU is being scanned +* too quickly and the caller can stall after page list has +* been processed. */ if (PageWriteback(page)) { - /* Case 1 above */ if (current_is_kswapd() && PageReclaim(page) && test_bit(ZONE_WRITEBACK, >flags)) {
[PATCH 14/18] writeback: update wb_over_bg_thresh() to use wb_domain aware operations
wb_over_bg_thresh() currently uses global_dirty_limits() and wb_dirty_limit() both of which are wrappers around operations which take dirty_throttle_control. For cgroup writeback support, the function will be updated to also consider memcg wb_domains which requires the context information carried in dirty_throttle_control. This patch updates wb_over_bg_thresh() so that it uses the underlying wb_domain aware operations directly and builds the global dirty_throttle_control in the process. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 99f8d02..2626e6c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1749,15 +1749,22 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited); */ bool wb_over_bg_thresh(struct bdi_writeback *wb) { - unsigned long background_thresh, dirty_thresh; + struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; + struct dirty_throttle_control * const gdtc = _stor; - global_dirty_limits(_thresh, _thresh); + /* +* Similar to balance_dirty_pages() but ignores pages being written +* as we're trying to decide whether to put more under writeback. +*/ + gdtc->avail = global_dirtyable_memory(); + gdtc->dirty = global_page_state(NR_FILE_DIRTY) + + global_page_state(NR_UNSTABLE_NFS); + domain_dirty_limits(gdtc); - if (global_page_state(NR_FILE_DIRTY) + - global_page_state(NR_UNSTABLE_NFS) > background_thresh) + if (gdtc->dirty > gdtc->bg_thresh) return true; - if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh)) + if (wb_stat(wb, WB_RECLAIMABLE) > __wb_dirty_limit(gdtc)) return true; return false; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/18] writeback: reorganize [__]wb_update_bandwidth()
__wb_update_bandwidth() is called from two places - fs/fs-writeback.c::balance_dirty_pages() and mm/page-writeback.c::wb_writeback(). The latter updates only the write bandwidth while the former also deals with the dirty ratelimit. The two callsites are distinguished by whether @thresh parameter is zero or not, which is cryptic. In addition, the two files define their own different versions of wb_update_bandwidth() on top of __wb_update_bandwidth(), which is confusing to say the least. This patch cleans up [__]wb_update_bandwidth() in the following ways. * __wb_update_bandwidth() now takes explicit @update_ratelimit parameter to gate dirty ratelimit handling. * mm/page-writeback.c::wb_update_bandwidth() is flattened into its caller - balance_dirty_pages(). * fs/fs-writeback.c::wb_update_bandwidth() is moved to mm/page-writeback.c and __wb_update_bandwidth() is made static. * While at it, add a lockdep assertion to __wb_update_bandwidth(). Except for the lockdep addition, this is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c | 10 -- include/linux/writeback.h | 9 + mm/page-writeback.c | 45 ++--- 3 files changed, 23 insertions(+), 41 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 890cff1..3d9b360 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1079,16 +1079,6 @@ static bool over_bground_thresh(struct bdi_writeback *wb) } /* - * Called under wb->list_lock. If there are multiple wb per bdi, - * only the flusher working on the first wb should do it. - */ -static void wb_update_bandwidth(struct bdi_writeback *wb, - unsigned long start_time) -{ - __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time); -} - -/* * Explicit flushing or periodic writeback of "old" data. * * Define "old": the first time one of an inode's pages is dirtied, we mark the diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 75349bb..82e0e39 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -154,14 +154,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int, void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty); unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty); -void __wb_update_bandwidth(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long start_time); - +void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time); void page_writeback_init(void); void balance_dirty_pages_ratelimited(struct address_space *mapping); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index fd441ea..d9ebabe 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1160,19 +1160,22 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb, trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit); } -void __wb_update_bandwidth(struct bdi_writeback *wb, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long wb_thresh, - unsigned long wb_dirty, - unsigned long start_time) +static void __wb_update_bandwidth(struct bdi_writeback *wb, + unsigned long thresh, + unsigned long bg_thresh, + unsigned long dirty, + unsigned long wb_thresh, + unsigned long wb_dirty, + unsigned long start_time, + bool update_ratelimit) { unsigned long now = jiffies; unsigned long elapsed = now - wb->bw_time_stamp; unsigned long dirtied; unsigned long written; + lockdep_assert_held(>list_lock); + /* * rate-limit, only update once every 200ms. */ @@ -1189,7 +1192,7 @@ void __wb_update_bandwidth(struct bdi_writeback *wb, if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time)) goto snapshot; - if (thresh) { + if (update_ratelimit) { global_update_bandwidth(thresh, dirty, now); wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty, wb_thresh, wb_dirty, @@ -1203,20 +1206,9 @@ snapshot: wb->bw_time_stamp = now; } -static void wb_update_bandwidth(struct bdi_writeback *wb, -
[PATCH 11/18] writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a parameter
Currently __wb_writeout_inc() and hard_dirty_limit() assume global_wb_domain; however, cgroup writeback support requires considering per-memcg wb_domain too. This patch separates out domain-specific part of __wb_writeout_inc() into wb_domain_writeout_inc() which takes wb_domain as a parameter and adds the parameter to hard_dirty_limit(). This will allow these two functions to handle per-memcg wb_domains. This patch doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- mm/page-writeback.c | 37 + 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 840b8f2..e6c7572 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -445,17 +445,12 @@ static unsigned long wp_next_time(unsigned long cur_time) return cur_time; } -/* - * Increment the wb's writeout completion count and the global writeout - * completion count. Called from test_clear_page_writeback(). - */ -static inline void __wb_writeout_inc(struct bdi_writeback *wb) +static void wb_domain_writeout_inc(struct wb_domain *dom, + struct fprop_local_percpu *completions, + unsigned int max_prop_frac) { - struct wb_domain *dom = _wb_domain; - - __inc_wb_stat(wb, WB_WRITTEN); - __fprop_inc_percpu_max(>completions, >completions, - wb->bdi->max_prop_frac); + __fprop_inc_percpu_max(>completions, completions, + max_prop_frac); /* First event after period switching was turned off? */ if (!unlikely(dom->period_time)) { /* @@ -469,6 +464,17 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb) } } +/* + * Increment @wb's writeout completion count and the global writeout + * completion count. Called from test_clear_page_writeback(). + */ +static inline void __wb_writeout_inc(struct bdi_writeback *wb) +{ + __inc_wb_stat(wb, WB_WRITTEN); + wb_domain_writeout_inc(_wb_domain, >completions, + wb->bdi->max_prop_frac); +} + void wb_writeout_inc(struct bdi_writeback *wb) { unsigned long flags; @@ -571,10 +577,9 @@ static unsigned long dirty_freerun_ceiling(unsigned long thresh, return (thresh + bg_thresh) / 2; } -static unsigned long hard_dirty_limit(unsigned long thresh) +static unsigned long hard_dirty_limit(struct wb_domain *dom, + unsigned long thresh) { - struct wb_domain *dom = _wb_domain; - return max(thresh, dom->dirty_limit); } @@ -744,7 +749,7 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc) struct bdi_writeback *wb = dtc->wb; unsigned long write_bw = wb->avg_write_bandwidth; unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); - unsigned long limit = hard_dirty_limit(dtc->thresh); + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); unsigned long wb_thresh = dtc->wb_thresh; unsigned long x_intercept; unsigned long setpoint; /* dirty pages' target balance point */ @@ -1029,7 +1034,7 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, struct bdi_writeback *wb = dtc->wb; unsigned long dirty = dtc->dirty; unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); - unsigned long limit = hard_dirty_limit(dtc->thresh); + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); unsigned long setpoint = (freerun + limit) / 2; unsigned long write_bw = wb->avg_write_bandwidth; unsigned long dirty_ratelimit = wb->dirty_ratelimit; @@ -1681,7 +1686,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) for ( ; ; ) { global_dirty_limits(_thresh, _thresh); - dirty_thresh = hard_dirty_limit(dirty_thresh); + dirty_thresh = hard_dirty_limit(_wb_domain, dirty_thresh); /* * Boost the allowable dirty threshold a bit for page -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 16/18] writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
The amount of available memory to a memcg wb_domain can change as memcg configuration changes. A domain's ->dirty_limit exists to smooth out sudden drops in dirty threshold; however, when a domain's size actually drops significantly, it hinders the dirty throttling from adjusting to the new configuration leading to unexpected behaviors including unnecessary OOM kills. This patch resolves the issue by adding wb_domain_size_changed() which resets ->dirty_limit[_tstmp] and making memcg call it on configuration changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- include/linux/writeback.h | 20 mm/memcontrol.c | 12 2 files changed, 32 insertions(+) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index e421625..9ae0648 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -132,6 +132,26 @@ struct wb_domain { unsigned long dirty_limit; }; +/** + * wb_domain_size_changed - memory available to a wb_domain has changed + * @dom: wb_domain of interest + * + * This function should be called when the amount of memory available to + * @dom has changed. It resets @dom's dirty limit parameters to prevent + * the past values which don't match the current configuration from skewing + * dirty throttling. Without this, when memory size of a wb_domain is + * greatly reduced, the dirty throttling logic may allow too many pages to + * be dirtied leading to consecutive unnecessary OOMs and may get stuck in + * that situation. + */ +static inline void wb_domain_size_changed(struct wb_domain *dom) +{ + spin_lock(>lock); + dom->dirty_limit_tstamp = jiffies; + dom->dirty_limit = 0; + spin_unlock(>lock); +} + /* * fs/fs-writeback.c */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2a74cf3..108acfc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4096,6 +4096,11 @@ static void memcg_wb_domain_exit(struct mem_cgroup *memcg) wb_domain_exit(>cgwb_domain); } +static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +{ + wb_domain_size_changed(>cgwb_domain); +} + struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) { struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); @@ -4117,6 +4122,10 @@ static void memcg_wb_domain_exit(struct mem_cgroup *memcg) { } +static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ /* @@ -4715,6 +4724,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg->low = 0; memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; + memcg_wb_domain_size_changed(memcg); } #ifdef CONFIG_MMU @@ -5347,6 +5357,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, memcg->high = high; + memcg_wb_domain_size_changed(memcg); return nbytes; } @@ -5379,6 +5390,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (err) return err; + memcg_wb_domain_size_changed(memcg); return nbytes; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/18] writeback: implement memcg wb_domain
Dirtyable memory is distributed to a wb (bdi_writeback) according to the relative bandwidth the wb is writing out in the whole system. This distribution is global - each wb is measured against all other wb's and gets the proportinately sized portion of the memory in the whole system. For cgroup writeback, the amount of dirtyable memory is scoped by memcg and thus each wb would need to be measured and controlled in its memcg. IOW, a wb will belong to two writeback domains - the global and memcg domains. The previous patches laid the groundwork to support the two wb_domains and this patch implements memcg wb_domain. memcg->cgwb_domain is initialized on css online and destroyed on css release, wb->memcg_completions is added, and __wb_writeout_inc() is updated to increment completions against both global and memcg wb_domains. The following patches will update balance_dirty_pages() and its subroutines to actually consider memcg wb_domain for throttling. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- include/linux/backing-dev-defs.h | 1 + include/linux/memcontrol.h | 12 +++- include/linux/writeback.h| 3 +++ mm/backing-dev.c | 9 - mm/memcontrol.c | 39 +++ mm/page-writeback.c | 25 + 6 files changed, 87 insertions(+), 2 deletions(-) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 97a92fa..8d470b7 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -118,6 +118,7 @@ struct bdi_writeback { #ifdef CONFIG_CGROUP_WRITEBACK struct percpu_ref refcnt; /* used only for !root wb's */ + struct fprop_local_percpu memcg_completions; struct cgroup_subsys_state *memcg_css; /* the associated memcg */ struct cgroup_subsys_state *blkcg_css; /* and blkcg */ struct list_head memcg_node;/* anchored at memcg->cgwb_list */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 662a953..e3177be 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -389,8 +389,18 @@ enum { }; #ifdef CONFIG_CGROUP_WRITEBACK + struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); -#endif +struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) +{ + return NULL; +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index fa6c3b4..e421625 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -167,6 +167,9 @@ static inline void laptop_sync_completion(void) { } void throttle_vm_writeout(gfp_t gfp_mask); bool zone_dirty_ok(struct zone *zone); int wb_domain_init(struct wb_domain *dom, gfp_t gfp); +#ifdef CONFIG_CGROUP_WRITEBACK +void wb_domain_exit(struct wb_domain *dom); +#endif extern struct wb_domain global_wb_domain; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 331e4d7..8828edf 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -483,6 +483,7 @@ static void cgwb_release_workfn(struct work_struct *work) css_put(wb->blkcg_css); wb_congested_put(wb->congested); + fprop_local_destroy_percpu(>memcg_completions); percpu_ref_exit(>refcnt); wb_exit(wb); kfree_rcu(wb, rcu); @@ -549,9 +550,13 @@ static int cgwb_create(struct backing_dev_info *bdi, if (ret) goto err_wb_exit; + ret = fprop_local_init_percpu(>memcg_completions, gfp); + if (ret) + goto err_ref_exit; + wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp); if (!wb->congested) - goto err_ref_exit; + goto err_fprop_exit; wb->memcg_css = memcg_css; wb->blkcg_css = blkcg_css; @@ -588,6 +593,8 @@ static int cgwb_create(struct backing_dev_info *bdi, err_put_congested: wb_congested_put(wb->congested); +err_fprop_exit: + fprop_local_destroy_percpu(>memcg_completions); err_ref_exit: percpu_ref_exit(>refcnt); err_wb_exit: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab483e9..2a74cf3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -344,6 +344,7 @@ struct mem_cgroup { #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; + struct wb_domain cgwb_domain; #endif /* List of events which userspace want to receive */ @@ -4085,6 +4086,37 @@ struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg) return >cgwb_list; } +static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) +{ + return wb_domain_init(>cgwb_domain, gfp); +} + +static void memcg_wb_domain_exit(struct
[PATCH 25/48] writeback: make congestion functions per bdi_writeback
Currently, all congestion functions take bdi (backing_dev_info) and always operate on the root wb (bdi->wb) and the congestion state from the block layer is propagated only for the root blkcg. This patch introduces {set|clear}_wb_congested() and wb_congested() which take a bdi_writeback_congested and bdi_writeback respectively. The bdi counteparts are now wrappers invoking the wb based functions on @bdi->wb. While converting clear_bdi_congested() to clear_wb_congested(), the local variable declaration order between @wqh and @bit is swapped for cosmetic reason. This patch just adds the new wb based functions. The following patches will apply them. v2: Updated for bdi_writeback_congested. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- include/linux/backing-dev-defs.h | 14 +++-- include/linux/backing-dev.h | 45 +++- mm/backing-dev.c | 22 ++-- 3 files changed, 49 insertions(+), 32 deletions(-) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index a1e9c40..eb38676 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -163,7 +163,17 @@ enum { BLK_RW_SYNC = 1, }; -void clear_bdi_congested(struct backing_dev_info *bdi, int sync); -void set_bdi_congested(struct backing_dev_info *bdi, int sync); +void clear_wb_congested(struct bdi_writeback_congested *congested, int sync); +void set_wb_congested(struct bdi_writeback_congested *congested, int sync); + +static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync) +{ + clear_wb_congested(bdi->wb.congested, sync); +} + +static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync) +{ + set_wb_congested(bdi->wb.congested, sync); +} #endif /* __LINUX_BACKING_DEV_DEFS_H */ diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 8ae59df..2c498a2 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -167,27 +167,13 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) return sb->s_bdi; } -static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits) +static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) { - if (bdi->congested_fn) - return bdi->congested_fn(bdi->congested_data, bdi_bits); - return (bdi->wb.congested->state & bdi_bits); -} - -static inline int bdi_read_congested(struct backing_dev_info *bdi) -{ - return bdi_congested(bdi, 1 << WB_sync_congested); -} - -static inline int bdi_write_congested(struct backing_dev_info *bdi) -{ - return bdi_congested(bdi, 1 << WB_async_congested); -} + struct backing_dev_info *bdi = wb->bdi; -static inline int bdi_rw_congested(struct backing_dev_info *bdi) -{ - return bdi_congested(bdi, (1 << WB_sync_congested) | - (1 << WB_async_congested)); + if (bdi->congested_fn) + return bdi->congested_fn(bdi->congested_data, cong_bits); + return wb->congested->state & cong_bits; } long congestion_wait(int sync, long timeout); @@ -454,4 +440,25 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg) #endif /* CONFIG_CGROUP_WRITEBACK */ +static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits) +{ + return wb_congested(>wb, cong_bits); +} + +static inline int bdi_read_congested(struct backing_dev_info *bdi) +{ + return bdi_congested(bdi, 1 << WB_sync_congested); +} + +static inline int bdi_write_congested(struct backing_dev_info *bdi) +{ + return bdi_congested(bdi, 1 << WB_async_congested); +} + +static inline int bdi_rw_congested(struct backing_dev_info *bdi) +{ + return bdi_congested(bdi, (1 << WB_sync_congested) | + (1 << WB_async_congested)); +} + #endif /* _LINUX_BACKING_DEV_H */ diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 9d5a75e..7721e7a 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -897,31 +897,31 @@ static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) }; -static atomic_t nr_bdi_congested[2]; +static atomic_t nr_wb_congested[2]; -void clear_bdi_congested(struct backing_dev_info *bdi, int sync) +void clear_wb_congested(struct bdi_writeback_congested *congested, int sync) { - enum wb_state bit; wait_queue_head_t *wqh = _wqh[sync]; + enum wb_state bit; bit = sync ? WB_sync_congested : WB_async_congested; - if (test_and_clear_bit(bit, >wb.congested->state)) - atomic_dec(_bdi_congested[sync]); + if (test_and_clear_bit(bit, >state)) + atomic_dec(_wb_congested[sync]); smp_mb__after_atomic(); if (waitqueue_active(wqh)) wake_up(wqh); }
[PATCH 24/48] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback
Currently, balance_dirty_pages() always work on bdi->wb. This patch updates it to work on the wb (bdi_writeback) matching memcg and blkcg of the current task as that's what the inode is being dirtied against. balance_dirty_pages_ratelimited() now pins the current wb and passes it to balance_dirty_pages(). As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to visible behavior differences. v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- mm/page-writeback.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d5635cf..bfbd8d2 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1337,6 +1337,7 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb, * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, + struct bdi_writeback *wb, unsigned long pages_dirtied) { unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ @@ -1352,8 +1353,7 @@ static void balance_dirty_pages(struct address_space *mapping, unsigned long task_ratelimit; unsigned long dirty_ratelimit; unsigned long pos_ratio; - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); - struct bdi_writeback *wb = >wb; + struct backing_dev_info *bdi = wb->bdi; bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; @@ -1575,14 +1575,20 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0; */ void balance_dirty_pages_ratelimited(struct address_space *mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); - struct bdi_writeback *wb = >wb; + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = NULL; int ratelimit; int *p; if (!bdi_cap_account_dirty(bdi)) return; + if (inode_cgwb_enabled(inode)) + wb = wb_get_create_current(bdi, GFP_KERNEL); + if (!wb) + wb = >wb; + ratelimit = current->nr_dirtied_pause; if (wb->dirty_exceeded) ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); @@ -1616,7 +1622,9 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping) preempt_enable(); if (unlikely(current->nr_dirtied >= ratelimit)) - balance_dirty_pages(mapping, current->nr_dirtied); + balance_dirty_pages(mapping, wb, current->nr_dirtied); + + wb_put(wb); } EXPORT_SYMBOL(balance_dirty_pages_ratelimited); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 22/48] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested
A blkg (blkcg_gq) can be congested and decongested independently from other blkgs on the same request_queue. Accordingly, for cgroup writeback support, the congestion status at bdi (backing_dev_info) should be split and updated separately from matching blkg's. This patch prepares by adding blkg->wb_congested and associating a blkg with its matching per-blkcg bdi_writeback_congested on creation. v2: Updated to associate bdi_writeback_congested instead of bdi_writeback. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Vivek Goyal --- block/blk-cgroup.c | 17 +++-- include/linux/blk-cgroup.h | 6 ++ 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index d2b1cbf..8b6372b 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -182,6 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct blkcg_gq *new_blkg) { struct blkcg_gq *blkg; + struct bdi_writeback_congested *wb_congested; int i, ret; WARN_ON_ONCE(!rcu_read_lock_held()); @@ -193,22 +194,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, goto err_free_blkg; } + wb_congested = wb_congested_get_create(>backing_dev_info, + blkcg->css.id, GFP_ATOMIC); + if (!wb_congested) { + ret = -ENOMEM; + goto err_put_css; + } + /* allocate */ if (!new_blkg) { new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC); if (unlikely(!new_blkg)) { ret = -ENOMEM; - goto err_put_css; + goto err_put_congested; } } blkg = new_blkg; + blkg->wb_congested = wb_congested; /* link parent */ if (blkcg_parent(blkcg)) { blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false); if (WARN_ON_ONCE(!blkg->parent)) { ret = -EINVAL; - goto err_put_css; + goto err_put_congested; } blkg_get(blkg->parent); } @@ -250,6 +259,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, blkg_put(blkg); return ERR_PTR(ret); +err_put_congested: + wb_congested_put(wb_congested); err_put_css: css_put(>css); err_free_blkg: @@ -405,6 +416,8 @@ void __blkg_release_rcu(struct rcu_head *rcu_head) if (blkg->parent) blkg_put(blkg->parent); + wb_congested_put(blkg->wb_congested); + blkg_free(blkg); } EXPORT_SYMBOL_GPL(__blkg_release_rcu); diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 3033eb1..07a32b8 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -99,6 +99,12 @@ struct blkcg_gq { struct hlist_node blkcg_node; struct blkcg*blkcg; + /* +* Each blkg gets congested separately and the congestion state is +* propagated to the matching bdi_writeback_congested. +*/ + struct bdi_writeback_congested *wb_congested; + /* all non-root blkcg_gq's are guaranteed to have access to parent */ struct blkcg_gq *parent; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 28/48] writeback: implement and use mapping_congested()
In several places, bdi_congested() and its wrappers are used to determine whether more IOs should be issued. With cgroup writeback support, this question can't be answered solely based on the bdi (backing_dev_info). It's dependent on whether the filesystem and bdi support cgroup writeback and the blkcg the asking task belongs to. This patch implements mapping_congested() and its wrappers which take @mapping and @task and determines the congestion state considering cgroup writeback for the combination. The new functions replace bdi_*congested() calls in places where the query is about specific mapping and task. There are several filesystem users which also fit this criteria but they should be updated when each filesystem implements cgroup writeback support. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Vivek Goyal --- fs/fs-writeback.c | 39 +++ include/linux/backing-dev.h | 27 +++ mm/fadvise.c| 2 +- mm/readahead.c | 2 +- mm/vmscan.c | 12 ++-- 5 files changed, 74 insertions(+), 8 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 48db5e6..015f359 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -130,6 +130,45 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages, wb_queue_work(wb, work); } +#ifdef CONFIG_CGROUP_WRITEBACK + +/** + * mapping_congested - test whether a mapping is congested for a task + * @mapping: address space to test for congestion + * @task: task to test congestion for + * @cong_bits: mask of WB_[a]sync_congested bits to test + * + * Tests whether @mapping is congested for @task. @cong_bits is the mask + * of congestion bits to test and the return value is the mask of set bits. + * + * If cgroup writeback is enabled for @mapping, its congestion state for + * @task is determined by whether the cgwb (cgroup bdi_writeback) for the + * blkcg of %current on @mapping->backing_dev_info is congested; otherwise, + * the root's congestion state is used. + */ +int mapping_congested(struct address_space *mapping, + struct task_struct *task, int cong_bits) +{ + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb; + int ret = 0; + + if (!inode || !inode_cgwb_enabled(inode)) + return wb_congested(>wb, cong_bits); + + rcu_read_lock(); + wb = wb_find_current(bdi); + if (wb) + ret = wb_congested(wb, cong_bits); + rcu_read_unlock(); + + return ret; +} +EXPORT_SYMBOL_GPL(mapping_congested); + +#endif /* CONFIG_CGROUP_WRITEBACK */ + /** * bdi_start_writeback - start writeback * @bdi: the backing device to write from diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 2c498a2..cfa23ab 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -230,6 +230,8 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, void __inode_attach_wb(struct inode *inode, struct page *page); void wb_memcg_offline(struct mem_cgroup *memcg); void wb_blkcg_offline(struct blkcg *blkcg); +int mapping_congested(struct address_space *mapping, struct task_struct *task, + int cong_bits); /** * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode @@ -438,8 +440,33 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg) { } +static inline int mapping_congested(struct address_space *mapping, + struct task_struct *task, int cong_bits) +{ + return wb_congested(_to_bdi(mapping->host)->wb, cong_bits); +} + #endif /* CONFIG_CGROUP_WRITEBACK */ +static inline int mapping_read_congested(struct address_space *mapping, +struct task_struct *task) +{ + return mapping_congested(mapping, task, 1 << WB_sync_congested); +} + +static inline int mapping_write_congested(struct address_space *mapping, + struct task_struct *task) +{ + return mapping_congested(mapping, task, 1 << WB_async_congested); +} + +static inline int mapping_rw_congested(struct address_space *mapping, + struct task_struct *task) +{ + return mapping_congested(mapping, task, (1 << WB_sync_congested) | + (1 << WB_async_congested)); +} + static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits) { return wb_congested(>wb, cong_bits); diff --git a/mm/fadvise.c b/mm/fadvise.c index 4a3907c..174727c 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -115,7 +115,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) case POSIX_FADV_NOREUSE: break; case POSIX_FADV_DONTNEED: -
[PATCH 30/48] writeback: implement backing_dev_info->tot_write_bandwidth
cgroup writeback support needs to keep track of the sum of avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to distribute write workload. This patch adds bdi->tot_write_bandwidth and updates inode_wb_list_move_locked(), inode_wb_list_del_locked() and wb_update_write_bandwidth() to adjust it as wb's gain and lose dirty inodes and its avg_write_bandwidth gets updated. As the update events are not synchronized with each other, bdi->tot_write_bandwidth is an atomic_long_t. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c| 7 ++- include/linux/backing-dev-defs.h | 2 ++ mm/page-writeback.c | 3 +++ 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index dc4e399..9d85f59 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -87,6 +87,8 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb) return false; } else { set_bit(WB_has_dirty_io, >state); + atomic_long_add(wb->avg_write_bandwidth, + >bdi->tot_write_bandwidth); return true; } } @@ -94,8 +96,11 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb) static void wb_io_lists_depopulated(struct bdi_writeback *wb) { if (wb_has_dirty_io(wb) && list_empty(>b_dirty) && - list_empty(>b_io) && list_empty(>b_more_io)) + list_empty(>b_io) && list_empty(>b_more_io)) { clear_bit(WB_has_dirty_io, >state); + atomic_long_sub(wb->avg_write_bandwidth, + >bdi->tot_write_bandwidth); + } } /** diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 7a94b78..d631a61 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -142,6 +142,8 @@ struct backing_dev_info { unsigned int min_ratio; unsigned int max_ratio, max_prop_frac; + atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */ + struct bdi_writeback wb; /* the root writeback info for this bdi */ struct bdi_writeback_congested wb_congested; /* its congested state */ #ifdef CONFIG_CGROUP_WRITEBACK diff --git a/mm/page-writeback.c b/mm/page-writeback.c index bfbd8d2..813e820 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -881,6 +881,9 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb, avg += (old - avg) >> 3; out: + if (wb_has_dirty_io(wb)) + atomic_long_add(avg - wb->avg_write_bandwidth, + >bdi->tot_write_bandwidth); wb->write_bandwidth = bw; wb->avg_write_bandwidth = avg; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 29/48] writeback: implement WB_has_dirty_io wb_state flag
Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback) has any dirty inode by testing all three IO lists on each invocation without actively keeping track. For cgroup writeback support, a single bdi will host multiple wb's each of which will host dirty inodes separately and we'll need to make bdi_has_dirty_io(), which currently only represents the root wb, aggregate has_dirty_io from all member wb's, which requires tracking transitions in has_dirty_io state on each wb. This patch introduces inode_wb_list_{move|del}_locked() to consolidate IO list operations leaving queue_io() the only other function which directly manipulates IO lists (via move_expired_inodes()). All three functions are updated to call wb_io_lists_[de]populated() which keep track of whether the wb has dirty inodes or not and record it using the new WB_has_dirty_io flag. inode_wb_list_moved_locked()'s return value indicates whether the wb had no dirty inodes before. mark_inode_dirty() is restructured so that the return value of inode_wb_list_move_locked() can be used for deciding whether to wake up the wb. While at it, change {bdi|wb}_has_dirty_io()'s return values to bool. These functions were returning 0 and 1 before. Also, add a comment explaining the synchronization of wb_state flags. v2: Updated to accommodate b_dirty_time. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c| 104 ++- include/linux/backing-dev-defs.h | 1 + include/linux/backing-dev.h | 8 ++- mm/backing-dev.c | 2 +- 4 files changed, 86 insertions(+), 29 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 015f359..dc4e399 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -81,6 +81,66 @@ static inline struct inode *wb_inode(struct list_head *head) EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage); +static bool wb_io_lists_populated(struct bdi_writeback *wb) +{ + if (wb_has_dirty_io(wb)) { + return false; + } else { + set_bit(WB_has_dirty_io, >state); + return true; + } +} + +static void wb_io_lists_depopulated(struct bdi_writeback *wb) +{ + if (wb_has_dirty_io(wb) && list_empty(>b_dirty) && + list_empty(>b_io) && list_empty(>b_more_io)) + clear_bit(WB_has_dirty_io, >state); +} + +/** + * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list + * @inode: inode to be moved + * @wb: target bdi_writeback + * @head: one of @wb->b_{dirty|io|more_io} + * + * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io. + * Returns %true if @inode is the first occupant of the !dirty_time IO + * lists; otherwise, %false. + */ +static bool inode_wb_list_move_locked(struct inode *inode, + struct bdi_writeback *wb, + struct list_head *head) +{ + assert_spin_locked(>list_lock); + + list_move(>i_wb_list, head); + + /* dirty_time doesn't count as dirty_io until expiration */ + if (head != >b_dirty_time) + return wb_io_lists_populated(wb); + + wb_io_lists_depopulated(wb); + return false; +} + +/** + * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list + * @inode: inode to be removed + * @wb: bdi_writeback @inode is being removed from + * + * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and + * clear %WB_has_dirty_io if all are empty afterwards. + */ +static void inode_wb_list_del_locked(struct inode *inode, +struct bdi_writeback *wb) +{ + assert_spin_locked(>list_lock); + + list_del_init(>i_wb_list); + wb_io_lists_depopulated(wb); +} + static void wb_wakeup(struct bdi_writeback *wb) { spin_lock_bh(>work_lock); @@ -215,7 +275,7 @@ void inode_wb_list_del(struct inode *inode) struct bdi_writeback *wb = inode_to_wb(inode); spin_lock(>list_lock); - list_del_init(>i_wb_list); + inode_wb_list_del_locked(inode, wb); spin_unlock(>list_lock); } @@ -230,7 +290,6 @@ void inode_wb_list_del(struct inode *inode) */ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) { - assert_spin_locked(>list_lock); if (!list_empty(>b_dirty)) { struct inode *tail; @@ -238,7 +297,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) if (time_before(inode->dirtied_when, tail->dirtied_when)) inode->dirtied_when = jiffies; } - list_move(>i_wb_list, >b_dirty); + inode_wb_list_move_locked(inode, wb, >b_dirty); } /* @@ -246,8 +305,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) */ static void requeue_io(struct inode *inode, struct bdi_writeback *wb) { - assert_spin_locked(>list_lock); -
[PATCH] xfs: use GFP_NOFS argument in radix_tree_preload
From: Byoungyoung Lee Following the convention of other file systems, GFP_NOFS should be used as an argument for radix_tree_preload() instead of GFP_KERNEL. Signed-off-by: Byoungyoung Lee Signed-off-by: Sanidhya Kashyap --- fs/xfs/xfs_mru_cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_mru_cache.c b/fs/xfs/xfs_mru_cache.c index 30ecca3..f8a674d 100644 --- a/fs/xfs/xfs_mru_cache.c +++ b/fs/xfs/xfs_mru_cache.c @@ -437,7 +437,7 @@ xfs_mru_cache_insert( if (!mru || !mru->lists) return -EINVAL; - if (radix_tree_preload(GFP_KERNEL)) + if (radix_tree_preload(GFP_NOFS)) return -ENOMEM; INIT_LIST_HEAD(>list_node); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 26/48] writeback, blkcg: restructure blk_{set|clear}_queue_congested()
blk_{set|clear}_queue_congested() take @q and set or clear, respectively, the congestion state of its bdi's root wb. Because bdi used to be able to handle congestion state only on the root wb, the callers of those functions tested whether the congestion is on the root blkcg and skipped if not. This is cumbersome and makes implementation of per cgroup bdi_writeback congestion state propagation difficult. This patch renames blk_{set|clear}_queue_congested() to blk_{set|clear}_congested(), and makes them take request_list instead of request_queue and test whether the specified request_list is the root one before updating bdi_writeback congestion state. This makes the tests in the callers unnecessary and simplifies them. As there are no external users of these functions, the definitions are moved from include/linux/blkdev.h to block/blk-core.c. This patch doesn't introduce any noticeable behavior difference. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Vivek Goyal --- block/blk-core.c | 62 ++ include/linux/blkdev.h | 19 2 files changed, 37 insertions(+), 44 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index c44018a..cad26e3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -63,6 +63,28 @@ struct kmem_cache *blk_requestq_cachep; */ static struct workqueue_struct *kblockd_workqueue; +static void blk_clear_congested(struct request_list *rl, int sync) +{ + if (rl != >q->root_rl) + return; +#ifdef CONFIG_CGROUP_WRITEBACK + clear_wb_congested(rl->blkg->wb_congested, sync); +#else + clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync); +#endif +} + +static void blk_set_congested(struct request_list *rl, int sync) +{ + if (rl != >q->root_rl) + return; +#ifdef CONFIG_CGROUP_WRITEBACK + set_wb_congested(rl->blkg->wb_congested, sync); +#else + set_wb_congested(rl->q->backing_dev_info.wb.congested, sync); +#endif +} + void blk_queue_congestion_threshold(struct request_queue *q) { int nr; @@ -827,13 +849,8 @@ static void __freed_request(struct request_list *rl, int sync) { struct request_queue *q = rl->q; - /* -* bdi isn't aware of blkcg yet. As all async IOs end up root -* blkcg anyway, just use root blkcg state. -*/ - if (rl == >root_rl && - rl->count[sync] < queue_congestion_off_threshold(q)) - blk_clear_queue_congested(q, sync); + if (rl->count[sync] < queue_congestion_off_threshold(q)) + blk_clear_congested(rl, sync); if (rl->count[sync] + 1 <= q->nr_requests) { if (waitqueue_active(>wait[sync])) @@ -866,25 +883,25 @@ static void freed_request(struct request_list *rl, unsigned int flags) int blk_update_nr_requests(struct request_queue *q, unsigned int nr) { struct request_list *rl; + int on_thresh, off_thresh; spin_lock_irq(q->queue_lock); q->nr_requests = nr; blk_queue_congestion_threshold(q); + on_thresh = queue_congestion_on_threshold(q); + off_thresh = queue_congestion_off_threshold(q); - /* congestion isn't cgroup aware and follows root blkcg for now */ - rl = >root_rl; - - if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q)) - blk_set_queue_congested(q, BLK_RW_SYNC); - else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q)) - blk_clear_queue_congested(q, BLK_RW_SYNC); + blk_queue_for_each_rl(rl, q) { + if (rl->count[BLK_RW_SYNC] >= on_thresh) + blk_set_congested(rl, BLK_RW_SYNC); + else if (rl->count[BLK_RW_SYNC] < off_thresh) + blk_clear_congested(rl, BLK_RW_SYNC); - if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q)) - blk_set_queue_congested(q, BLK_RW_ASYNC); - else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q)) - blk_clear_queue_congested(q, BLK_RW_ASYNC); + if (rl->count[BLK_RW_ASYNC] >= on_thresh) + blk_set_congested(rl, BLK_RW_ASYNC); + else if (rl->count[BLK_RW_ASYNC] < off_thresh) + blk_clear_congested(rl, BLK_RW_ASYNC); - blk_queue_for_each_rl(rl, q) { if (rl->count[BLK_RW_SYNC] >= q->nr_requests) { blk_set_rl_full(rl, BLK_RW_SYNC); } else { @@ -994,12 +1011,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags, } } } - /* -* bdi isn't aware of blkcg yet. As all async IOs end up -* root blkcg anyway, just use root blkcg state. -*/ - if (rl == >root_rl) -
[PATCH 27/48] writeback, blkcg: propagate non-root blkcg congestion state
Now that bdi layer can handle per-blkcg bdi_writeback_congested state, blk_{set|clear}_congested() can propagate non-root blkcg congestion state to them. This can be easily achieved by disabling the root_rl tests in blk_{set|clear}_congested(). Note that we still need those tests when !CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg wb's congestion state for events happening on other blkcgs. v2: Updated for bdi_writeback_congested. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Vivek Goyal --- block/blk-core.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index cad26e3..95488fb 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -65,23 +65,26 @@ static struct workqueue_struct *kblockd_workqueue; static void blk_clear_congested(struct request_list *rl, int sync) { - if (rl != >q->root_rl) - return; #ifdef CONFIG_CGROUP_WRITEBACK clear_wb_congested(rl->blkg->wb_congested, sync); #else - clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync); + /* +* If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't +* flip its congestion state for events on other blkcgs. +*/ + if (rl == >q->root_rl) + clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync); #endif } static void blk_set_congested(struct request_list *rl, int sync) { - if (rl != >q->root_rl) - return; #ifdef CONFIG_CGROUP_WRITEBACK set_wb_congested(rl->blkg->wb_congested, sync); #else - set_wb_congested(rl->q->backing_dev_info.wb.congested, sync); + /* see blk_clear_congested() */ + if (rl == >q->root_rl) + set_wb_congested(rl->q->backing_dev_info.wb.congested, sync); #endif } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 32/48] writeback: don't issue wb_writeback_work if clean
There are several places in fs/fs-writeback.c which queues wb_writeback_work without checking whether the target wb (bdi_writeback) has dirty inodes or not. The only thing wb_writeback_work does is writing back the dirty inodes for the target wb and queueing a work item for a clean wb is essentially noop. There are some side effects such as bandwidth stats being updated and triggering tracepoints but these don't affect the operation in any meaningful way. This patch makes all writeback_inodes_sb_nr() and sync_inodes_sb() skip wb_queue_work() if the target bdi is clean. Also, it moves dirtiness check from wakeup_flusher_threads() to __wb_start_writeback() so that all its callers benefit from the check. While the overhead incurred by scheduling a noop work isn't currently significant, the overhead may be higher with cgroup writeback support as we may end up issuing noop work items to a lot of clean wb's. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 7f44c02..3ceacbb 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -177,6 +177,9 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages, { struct wb_writeback_work *work; + if (!wb_has_dirty_io(wb)) + return; + /* * This is WB_SYNC_NONE writeback, so if allocation fails just * wakeup the thread for old dirty data writeback @@ -1207,11 +1210,8 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason) nr_pages = get_nr_dirty_pages(); rcu_read_lock(); - list_for_each_entry_rcu(bdi, _list, bdi_list) { - if (!bdi_has_dirty_io(bdi)) - continue; + list_for_each_entry_rcu(bdi, _list, bdi_list) __wb_start_writeback(>wb, nr_pages, false, reason); - } rcu_read_unlock(); } @@ -1445,11 +1445,12 @@ void writeback_inodes_sb_nr(struct super_block *sb, .nr_pages = nr, .reason = reason, }; + struct backing_dev_info *bdi = sb->s_bdi; - if (sb->s_bdi == _backing_dev_info) + if (!bdi_has_dirty_io(bdi) || bdi == _backing_dev_info) return; WARN_ON(!rwsem_is_locked(>s_umount)); - wb_queue_work(>s_bdi->wb, ); + wb_queue_work(>wb, ); wait_for_completion(); } EXPORT_SYMBOL(writeback_inodes_sb_nr); @@ -1527,13 +1528,14 @@ void sync_inodes_sb(struct super_block *sb) .reason = WB_REASON_SYNC, .for_sync = 1, }; + struct backing_dev_info *bdi = sb->s_bdi; /* Nothing to do? */ - if (sb->s_bdi == _backing_dev_info) + if (!bdi_has_dirty_io(bdi) || bdi == _backing_dev_info) return; WARN_ON(!rwsem_is_locked(>s_umount)); - wb_queue_work(>s_bdi->wb, ); + wb_queue_work(>wb, ); wait_for_completion(); wait_sb_inodes(sb); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 31/48] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account
bdi_has_dirty_io() used to only reflect whether the root wb (bdi_writeback) has dirty inodes. For cgroup writeback support, it needs to take all active wb's into account. If any wb on the bdi has dirty inodes, bdi_has_dirty_io() should return true. To achieve that, as inode_wb_list_{move|del}_locked() now keep track of the dirty state transition of each wb, the number of dirty wbs can be counted in the bdi; however, bdi is already aggregating wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when there are any dirty inodes by ensuring wb->avg_write_bandwidth can't dip below 1. bdi_has_dirty_io() can simply test whether bdi->tot_write_bandwidth is zero or not. While this bumps the value of wb->avg_write_bandwidth to one when it used to be zero, this shouldn't cause any meaningful behavior difference. bdi_has_dirty_io() is made an inline function which tests whether ->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its value are added to inode_wb_list_{move|del}_locked(). Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c| 5 +++-- include/linux/backing-dev-defs.h | 8 ++-- include/linux/backing-dev.h | 10 +- mm/backing-dev.c | 5 - mm/page-writeback.c | 10 +++--- 5 files changed, 25 insertions(+), 13 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9d85f59..7f44c02 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -87,6 +87,7 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb) return false; } else { set_bit(WB_has_dirty_io, >state); + WARN_ON_ONCE(!wb->avg_write_bandwidth); atomic_long_add(wb->avg_write_bandwidth, >bdi->tot_write_bandwidth); return true; @@ -98,8 +99,8 @@ static void wb_io_lists_depopulated(struct bdi_writeback *wb) if (wb_has_dirty_io(wb) && list_empty(>b_dirty) && list_empty(>b_io) && list_empty(>b_more_io)) { clear_bit(WB_has_dirty_io, >state); - atomic_long_sub(wb->avg_write_bandwidth, - >bdi->tot_write_bandwidth); + WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth, + >bdi->tot_write_bandwidth) < 0); } } diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index d631a61..8c857d7 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -98,7 +98,7 @@ struct bdi_writeback { unsigned long dirtied_stamp; unsigned long written_stamp;/* pages written at bw_time_stamp */ unsigned long write_bandwidth; /* the estimated write bandwidth */ - unsigned long avg_write_bandwidth; /* further smoothed write bw */ + unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */ /* * The base dirty throttle rate, re-calculated on every 200ms. @@ -142,7 +142,11 @@ struct backing_dev_info { unsigned int min_ratio; unsigned int max_ratio, max_prop_frac; - atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */ + /* +* Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are +* any dirty wbs, which is depended upon by bdi_has_dirty(). +*/ + atomic_long_t tot_write_bandwidth; struct bdi_writeback wb; /* the root writeback info for this bdi */ struct bdi_writeback_congested wb_congested; /* its congested state */ diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index bab5927..433b308 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -29,7 +29,6 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, enum wb_reason reason); void bdi_start_background_writeback(struct backing_dev_info *bdi); void wb_workfn(struct work_struct *work); -bool bdi_has_dirty_io(struct backing_dev_info *bdi); void wb_wakeup_delayed(struct bdi_writeback *wb); extern spinlock_t bdi_lock; @@ -42,6 +41,15 @@ static inline bool wb_has_dirty_io(struct bdi_writeback *wb) return test_bit(WB_has_dirty_io, >state); } +static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi) +{ + /* +* @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are +* any dirty wbs. See wb_update_write_bandwidth(). +*/ + return atomic_long_read(>tot_write_bandwidth); +} + static inline void __add_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item, s64 amount) { diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 56d7622..eab5181 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -256,11 +256,6 @@ static int __init default_bdi_init(void) } subsys_initcall(default_bdi_init); -bool
Re: [PATCH 1/2] spi: Add SPI driver for Mikrotik RB4xx series boards
On Sun, Mar 22, 2015 at 12:25:24PM +0100, Bert Vermeulen wrote: > It is bitbanging, at least on write. The hardware has a shift register that > is uses for reads. The generic spi for this board's architecture (ath79) > indeed uses spi-bitbang. > This "fast SPI" thing is what makes this one different: the boot flash and > MMC use regular SPI on the same bus as the CPLD. This CPLD needs this fast > SPI: a mode where it shifts in two bits per clock. The second bit is > apparently sent via the CS2 pin. Please make sure that all this is visible to someone looking at the driver, it's really not at all clear what's going on just from reading the code. > So I don't think spi-bitbang will do. I need to see about reworking things > to use less custom queueing -- I'm not that familiar with this yet. Mostly it's just a case of deleting the loops and using the core ones instead. signature.asc Description: Digital signature
Re: [PATCH 6/7 linux-next] ASoC: fsi: constify of_device_id array
On Wed, Mar 18, 2015 at 05:49:01PM +0100, Fabian Frederick wrote: > of_device_id is always used as const. > (See driver.of_match_table and open firmware functions) Applied, thanks. signature.asc Description: Digital signature
[PATCH 36/48] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's
For cgroup writeback support, all bdi-wide operations should be distributed to all its wb's (bdi_writeback's). This patch updates laptop_mode_timer_fn() so that it invokes wb_start_writeback() on all wb's rather than just the root one. As the intent is writing out all dirty data, there's no reason to split the number of pages to write. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- mm/page-writeback.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7c3a555..fa37e73 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1723,14 +1723,20 @@ void laptop_mode_timer_fn(unsigned long data) struct request_queue *q = (struct request_queue *)data; int nr_pages = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); + struct bdi_writeback *wb; + struct wb_iter iter; /* * We want to write everything out, not just down to the dirty * threshold */ - if (bdi_has_dirty_io(>backing_dev_info)) - wb_start_writeback(>backing_dev_info.wb, nr_pages, true, - WB_REASON_LAPTOP_TIMER); + if (!bdi_has_dirty_io(>backing_dev_info)) + return; + + bdi_for_each_wb(wb, >backing_dev_info, , 0) + if (wb_has_dirty_io(wb)) + wb_start_writeback(wb, nr_pages, true, + WB_REASON_LAPTOP_TIMER); } /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7 linux-next] ASoC: kirkwood: constify of_device_id array
On Wed, Mar 18, 2015 at 05:48:58PM +0100, Fabian Frederick wrote: > of_device_id is always used as const. > (See driver.of_match_table and open firmware functions) Applied, thanks. signature.asc Description: Digital signature
Re: [PATCH] ASoC:pcm512x: Fix divide by zero issue.
On Fri, Mar 20, 2015 at 09:13:45PM +, Howard Mitchell wrote: > If den=1 and pllin_rate>20MHz then den and num are adjusted to 0 > causing a divide by zero error a few lines further on. Therefore > this patch correctly scales num and den such that > pllin_rate/den < 20MHz as required in the device data sheet. Applied, thanks. signature.asc Description: Digital signature
[PATCH 33/48] writeback: make bdi->min/max_ratio handling cgroup writeback aware
bdi->min/max_ratio are user-configurable per-bdi knobs which regulate dirty limit of each bdi. For cgroup writeback, they need to be further distributed across wb's (bdi_writeback's) belonging to the configured bdi. This patch introduces wb_min_max_ratio() which distributes bdi->min/max_ratio according to a wb's proportion in the total active bandwidth of its bdi. v2: Update wb_min_max_ratio() to fix a bug where both min and max were assigned the min value and avoid calculations when possible. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- mm/page-writeback.c | 50 ++ 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 8480a45..349e32b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -155,6 +155,46 @@ static unsigned long writeout_period_time = 0; */ #define VM_COMPLETIONS_PERIOD_LEN (3*HZ) +#ifdef CONFIG_CGROUP_WRITEBACK + +static void wb_min_max_ratio(struct bdi_writeback *wb, +unsigned long *minp, unsigned long *maxp) +{ + unsigned long this_bw = wb->avg_write_bandwidth; + unsigned long tot_bw = atomic_long_read(>bdi->tot_write_bandwidth); + unsigned long long min = wb->bdi->min_ratio; + unsigned long long max = wb->bdi->max_ratio; + + /* +* @wb may already be clean by the time control reaches here and +* the total may not include its bw. +*/ + if (this_bw < tot_bw) { + if (min) { + min *= this_bw; + do_div(min, tot_bw); + } + if (max < 100) { + max *= this_bw; + do_div(max, tot_bw); + } + } + + *minp = min; + *maxp = max; +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static void wb_min_max_ratio(struct bdi_writeback *wb, +unsigned long *minp, unsigned long *maxp) +{ + *minp = wb->bdi->min_ratio; + *maxp = wb->bdi->max_ratio; +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + /* * In a memory zone, there is a certain amount of pages we consider * available for the page cache, which is essentially the number of @@ -539,9 +579,9 @@ static unsigned long hard_dirty_limit(unsigned long thresh) */ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) { - struct backing_dev_info *bdi = wb->bdi; u64 wb_dirty; long numerator, denominator; + unsigned long wb_min_ratio, wb_max_ratio; /* * Calculate this BDI's share of the dirty ratio. @@ -552,9 +592,11 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty) wb_dirty *= numerator; do_div(wb_dirty, denominator); - wb_dirty += (dirty * bdi->min_ratio) / 100; - if (wb_dirty > (dirty * bdi->max_ratio) / 100) - wb_dirty = dirty * bdi->max_ratio / 100; + wb_min_max_ratio(wb, _min_ratio, _max_ratio); + + wb_dirty += (dirty * wb_min_ratio) / 100; + if (wb_dirty > (dirty * wb_max_ratio) / 100) + wb_dirty = dirty * wb_max_ratio / 100; return wb_dirty; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][ASoC]Add ability to remove rate constraints from generic ASoC AC'97 CODEC driver
On Wed, Mar 11, 2015 at 12:28:19AM +0100, Maciej S. Szmigiero wrote: > Add ability to remove rate constraints from generic ASoC AC'97 CODEC > driver via passed platform data, make it selectable in config. Please use subject lines matching the style for the subsystem. This is helpful for identifying relevant patches and not getting your messages deleted unread... > This way this driver can be used for platforms which don't need > specialized AC'97 CODEC drivers while at the same avoiding > code duplication from implementing equivalent functionality in > a controller driver. I'm sorry but this just doesn't explain what this patch is intended to accomplish. If we can talk to the AC'97 CODEC at all we can already identify whatever constraints it has by looking at the ID registers so it's not clear when or why a platform would need to use this. It feels like there is some underlying problem that you're trying to address. signature.asc Description: Digital signature
[PATCH 38/48] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info
bdi_start_background_writeback() currently takes @bdi and kicks the root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 12 ++-- include/linux/backing-dev.h | 2 +- mm/page-writeback.c | 4 ++-- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index c342c05..c9bda4d 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -226,23 +226,23 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, } /** - * bdi_start_background_writeback - start background writeback - * @bdi: the backing device to write from + * wb_start_background_writeback - start background writeback + * @wb: bdi_writback to write from * * Description: * This makes sure WB_SYNC_NONE background writeback happens. When - * this function returns, it is only guaranteed that for given BDI + * this function returns, it is only guaranteed that for given wb * some IO is happening if we are over background dirty threshold. * Caller need not hold sb s_umount semaphore. */ -void bdi_start_background_writeback(struct backing_dev_info *bdi) +void wb_start_background_writeback(struct bdi_writeback *wb) { /* * We just wake up the flusher thread. It will perform background * writeback as soon as there is no other work to do. */ - trace_writeback_wake_background(bdi); - wb_wakeup(>wb); + trace_writeback_wake_background(wb->bdi); + wb_wakeup(wb); } /* diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 216a016..fee39cd 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -27,7 +27,7 @@ void bdi_unregister(struct backing_dev_info *bdi); int __must_check bdi_setup_and_register(struct backing_dev_info *, char *); void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, bool range_cyclic, enum wb_reason reason); -void bdi_start_background_writeback(struct backing_dev_info *bdi); +void wb_start_background_writeback(struct bdi_writeback *wb); void wb_workfn(struct work_struct *work); void wb_wakeup_delayed(struct bdi_writeback *wb); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3e698f4..fd441ea 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1456,7 +1456,7 @@ static void balance_dirty_pages(struct address_space *mapping, } if (unlikely(!writeback_in_progress(wb))) - bdi_start_background_writeback(bdi); + wb_start_background_writeback(wb); if (!strictlimit) wb_dirty_limits(wb, dirty_thresh, background_thresh, @@ -1588,7 +1588,7 @@ pause: return; if (nr_reclaimable > background_thresh) - bdi_start_background_writeback(bdi); + wb_start_background_writeback(wb); } static DEFINE_PER_CPU(int, bdp_ratelimits); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: S3C64XX: Use fixed IRQ bases to avoid conflicts on Cragganmore
On Sun, Mar 22, 2015 at 10:40:41AM +, Charles Keepax wrote: > There are two PMICs on Cragganmore, currently one dynamically assign > its IRQ base and the other uses a fixed base. It is possible for the > statically assigned PMIC to fail if its IRQ is taken by the dynamically > assigned one. Fix this by statically assigning both the IRQ bases. > > Signed-off-by: Charles Keepax Reviwed-by: Mark Brown This probably wants to go to stable as well. signature.asc Description: Digital signature
Re: [PATCH] ASoC:pcm512x: Make PLL lock output selectable via device tree.
On Fri, Mar 20, 2015 at 09:22:43PM +, Howard Mitchell wrote: > + if (pcm512x->pll_lock) { > +if (of_property_read_u32(np, "pll-lock", ) >= 0) { > +if (val > 6) { > +dev_err(dev, "Invalid pll-lock\n"); > +ret = -EINVAL; > +goto err_clk; > +} > +pcm512x->pll_lock = val; > +} This breaks existing boards which rely on GPIO 4 being set as the lock output. This is very unfortunate since it's a silly thing for the driver to default to but nontheless we should really continue to support them - at a guess Peter's board is relying on this, and even if it isn't someone else's might. signature.asc Description: Digital signature
[PATCH 40/48] writeback: add wb_writeback_work->auto_free
Currently, a wb_writeback_work is freed automatically on completion if it doesn't have ->done set. Add wb_writeback_work->auto_free to make the switch explicit. This will help cgroup writeback support where waiting for completion and whether to free automatically don't necessarily move together. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 75d5e5c..25504be 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -47,6 +47,7 @@ struct wb_writeback_work { unsigned int range_cyclic:1; unsigned int for_background:1; unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */ + unsigned int auto_free:1; /* free on completion */ enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ @@ -256,6 +257,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; work->reason= reason; + work->auto_free = 1; wb_queue_work(wb, work); } @@ -1133,19 +1135,16 @@ static long wb_do_writeback(struct bdi_writeback *wb) set_bit(WB_writeback_running, >state); while ((work = get_next_work_item(wb)) != NULL) { + struct completion *done = work->done; trace_writeback_exec(wb->bdi, work); wrote += wb_writeback(wb, work); - /* -* Notify the caller of completion if this is a synchronous -* work item, otherwise just free it. -*/ - if (work->done) - complete(work->done); - else + if (work->auto_free) kfree(work); + if (done) + complete(done); } /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 41/48] writeback: implement bdi_wait_for_completion()
If the completion of a wb_writeback_work can be waited upon by setting its ->done to a struct completion and waiting on it; however, for cgroup writeback support, it's necessary to issue multiple work items to multiple bdi_writebacks and wait for the completion of all. This patch implements wb_completion which can wait for multiple work items and replaces the struct completion with it. It can be defined using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and waited for by wb_wait_for_completion(). Nobody currently issues multiple work items and this patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c| 57 +++- include/linux/backing-dev-defs.h | 2 ++ mm/backing-dev.c | 1 + 3 files changed, 48 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 25504be..944e53d 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -34,6 +34,10 @@ */ #define MIN_WRITEBACK_PAGES(4096UL >> (PAGE_CACHE_SHIFT - 10)) +struct wb_completion { + atomic_tcnt; +}; + /* * Passed into wb_writeback(), essentially a subset of writeback_control */ @@ -51,9 +55,21 @@ struct wb_writeback_work { enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ - struct completion *done;/* set if the caller waits */ + struct wb_completion *done; /* set if the caller waits */ }; +/* + * If one wants to wait for one or more wb_writeback_works, each work's + * ->done should be set to a wb_completion defined using the following + * macro. Once all work items are issued with wb_queue_work(), the caller + * can wait for the completion of all using wb_wait_for_completion(). Work + * items which are waited upon aren't freed automatically on completion. + */ +#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ + struct wb_completion cmpl = { \ + .cnt= ATOMIC_INIT(1), \ + } + static inline struct inode *wb_inode(struct list_head *head) { return list_entry(head, struct inode, i_wb_list); @@ -149,17 +165,34 @@ static void wb_queue_work(struct bdi_writeback *wb, trace_writeback_queue(wb->bdi, work); spin_lock_bh(>work_lock); - if (!test_bit(WB_registered, >state)) { - if (work->done) - complete(work->done); + if (!test_bit(WB_registered, >state)) goto out_unlock; - } + if (work->done) + atomic_inc(>done->cnt); list_add_tail(>list, >work_list); mod_delayed_work(bdi_wq, >dwork, 0); out_unlock: spin_unlock_bh(>work_lock); } +/** + * wb_wait_for_completion - wait for completion of bdi_writeback_works + * @bdi: bdi work items were issued to + * @done: target wb_completion + * + * Wait for one or more work items issued to @bdi with their ->done field + * set to @done, which should have been defined with + * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such + * work items are completed. Work items which are waited upon aren't freed + * automatically on completion. + */ +static void wb_wait_for_completion(struct backing_dev_info *bdi, + struct wb_completion *done) +{ + atomic_dec(>cnt); /* put down the initial count */ + wait_event(bdi->wb_waitq, !atomic_read(>cnt)); +} + #ifdef CONFIG_CGROUP_WRITEBACK /** @@ -1135,7 +1168,7 @@ static long wb_do_writeback(struct bdi_writeback *wb) set_bit(WB_writeback_running, >state); while ((work = get_next_work_item(wb)) != NULL) { - struct completion *done = work->done; + struct wb_completion *done = work->done; trace_writeback_exec(wb->bdi, work); @@ -1143,8 +1176,8 @@ static long wb_do_writeback(struct bdi_writeback *wb) if (work->auto_free) kfree(work); - if (done) - complete(done); + if (done && atomic_dec_and_test(>cnt)) + wake_up_all(>bdi->wb_waitq); } /* @@ -1448,7 +1481,7 @@ void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason) { - DECLARE_COMPLETION_ONSTACK(done); + DEFINE_WB_COMPLETION_ONSTACK(done); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -1463,7 +1496,7 @@ void writeback_inodes_sb_nr(struct super_block *sb, return; WARN_ON(!rwsem_is_locked(>s_umount)); wb_queue_work(>wb, ); - wait_for_completion(); +
Re: [PATCH 2/7 linux-next] ASoC: fsl: constify of_device_id array
On Wed, Mar 18, 2015 at 05:48:57PM +0100, Fabian Frederick wrote: > of_device_id is always used as const. > (See driver.of_match_table and open firmware functions) Applied, thanks. signature.asc Description: Digital signature
[PATCH 44/48] writeback: make writeback initiation functions handle multiple bdi_writeback's
[try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only handle dirty inodes on the root wb (bdi_writeback) of the target bdi. This patch implements bdi_split_work_to_wbs() and use it to make these functions handle multiple wb's. bdi_split_work_to_wbs() takes a base wb_writeback_work and create clones of it and issue them to the wb's of the target bdi. The base work's nr_pages is distributed using wb_split_bdi_pages() - ie. according to each wb's write bandwidth's proportion in the bdi. Cloning a bdi involves memory allocation which may fail. In such cases, bdi_split_work_to_wbs() issues the base work directly and waits for its completion before proceeding to the next wb to guarantee forward progress and correctness under memory pressure. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 96 --- 1 file changed, 91 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index be374ae..57b2282 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -289,6 +289,80 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw); } +/** + * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb + * @wb: target bdi_writeback + * @base_work: source wb_writeback_work + * + * Try to make a clone of @base_work and issue it to @wb. If cloning + * succeeds, %true is returned; otherwise, @base_work is issued directly + * and %false is returned. In the latter case, the caller is required to + * wait for @base_work's completion using wb_wait_for_single_work(). + * + * A clone is auto-freed on completion. @base_work never is. + */ +static bool wb_clone_and_queue_work(struct bdi_writeback *wb, + struct wb_writeback_work *base_work) +{ + struct wb_writeback_work *work; + + work = kmalloc(sizeof(*work), GFP_ATOMIC); + if (work) { + *work = *base_work; + work->auto_free = 1; + work->single_wait = 0; + } else { + work = base_work; + work->auto_free = 0; + work->single_wait = 1; + } + work->single_done = 0; + wb_queue_work(wb, work); + return work != base_work; +} + +/** + * bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi + * @bdi: target backing_dev_info + * @base_work: wb_writeback_work to issue + * @skip_if_busy: skip wb's which already have writeback in progress + * + * Split and issue @base_work to all wb's (bdi_writeback's) of @bdi which + * have dirty inodes. If @base_work->nr_page isn't %LONG_MAX, it's + * distributed to the busy wbs according to each wb's proportion in the + * total active write bandwidth of @bdi. + */ +static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, + struct wb_writeback_work *base_work, + bool skip_if_busy) +{ + long nr_pages = base_work->nr_pages; + int next_blkcg_id = 0; + struct bdi_writeback *wb; + struct wb_iter iter; + + might_sleep(); + + if (!bdi_has_dirty_io(bdi)) + return; +restart: + rcu_read_lock(); + bdi_for_each_wb(wb, bdi, , next_blkcg_id) { + if (!wb_has_dirty_io(wb) || + (skip_if_busy && writeback_in_progress(wb))) + continue; + + base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages); + if (!wb_clone_and_queue_work(wb, base_work)) { + next_blkcg_id = wb->blkcg_css->id + 1; + rcu_read_unlock(); + wb_wait_for_single_work(bdi, base_work); + goto restart; + } + } + rcu_read_unlock(); +} + #else /* CONFIG_CGROUP_WRITEBACK */ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) @@ -296,6 +370,21 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) return nr_pages; } +static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, + struct wb_writeback_work *base_work, + bool skip_if_busy) +{ + might_sleep(); + + if (bdi_has_dirty_io(bdi) && + (!skip_if_busy || !writeback_in_progress(>wb))) { + base_work->auto_free = 0; + base_work->single_wait = 0; + base_work->single_done = 0; + wb_queue_work(>wb, base_work); + } +} + #endif /* CONFIG_CGROUP_WRITEBACK */ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, @@ -1528,10 +1617,7 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, return; WARN_ON(!rwsem_is_locked(>s_umount)); - if (skip_if_busy &&
[PATCH 34/48] writeback: implement bdi_for_each_wb()
This will be used to implement bdi-wide operations which should be distributed across all its cgroup bdi_writebacks. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- include/linux/backing-dev.h | 63 + 1 file changed, 63 insertions(+) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 433b308..9dc4eea 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -384,6 +384,61 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode) return inode->i_wb; } +struct wb_iter { + int start_blkcg_id; + struct radix_tree_iter tree_iter; + void**slot; +}; + +static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter, + struct backing_dev_info *bdi) +{ + struct radix_tree_iter *titer = >tree_iter; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + if (iter->start_blkcg_id >= 0) { + iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id); + iter->start_blkcg_id = -1; + } else { + iter->slot = radix_tree_next_slot(iter->slot, titer, 0); + } + + if (!iter->slot) + iter->slot = radix_tree_next_chunk(>cgwb_tree, titer, 0); + if (iter->slot) + return *iter->slot; + return NULL; +} + +static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter, + struct backing_dev_info *bdi, + int start_blkcg_id) +{ + iter->start_blkcg_id = start_blkcg_id; + + if (start_blkcg_id) + return __wb_iter_next(iter, bdi); + else + return >wb; +} + +/** + * bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order + * @wb_cur: cursor struct bdi_writeback pointer + * @bdi: bdi to walk wb's of + * @iter: pointer to struct wb_iter to be used as iteration buffer + * @start_blkcg_id: blkcg ID to start iteration from + * + * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending + * blkcg ID order starting from @start_blkcg_id. @iter is struct wb_iter + * to be used as temp storage during iteration. rcu_read_lock() must be + * held throughout iteration. + */ +#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \ + for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id); \ +(wb_cur); (wb_cur) = __wb_iter_next(iter, bdi)) + #else /* CONFIG_CGROUP_WRITEBACK */ static inline bool inode_cgwb_enabled(struct inode *inode) @@ -446,6 +501,14 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg) { } +struct wb_iter { + int next_id; +}; + +#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \ + for ((iter)->next_id = (start_blkcg_id);\ +({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); ) + static inline int mapping_congested(struct address_space *mapping, struct task_struct *task, int cong_bits) { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/7 linux-next] ASoC: rsnd: constify of_device_id array
On Wed, Mar 18, 2015 at 05:49:02PM +0100, Fabian Frederick wrote: > of_device_id is always used as const. > (See driver.of_match_table and open firmware functions) Applied, thanks. signature.asc Description: Digital signature
[PATCH 35/48] writeback: remove bdi_start_writeback()
bdi_start_writeback() is a thin wrapper on top of __wb_start_writeback() which is used only by laptop_mode_timer_fn(). This patches removes bdi_start_writeback(), renames __wb_start_writeback() to wb_start_writeback() and makes laptop_mode_timer_fn() use it instead. This doesn't cause any functional difference and will ease making laptop_mode_timer_fn() cgroup writeback aware. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 68 + include/linux/backing-dev.h | 4 +-- mm/page-writeback.c | 4 +-- 3 files changed, 29 insertions(+), 47 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 3ceacbb..c24d6fd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -172,33 +172,6 @@ out_unlock: spin_unlock_bh(>work_lock); } -static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages, -bool range_cyclic, enum wb_reason reason) -{ - struct wb_writeback_work *work; - - if (!wb_has_dirty_io(wb)) - return; - - /* -* This is WB_SYNC_NONE writeback, so if allocation fails just -* wakeup the thread for old dirty data writeback -*/ - work = kzalloc(sizeof(*work), GFP_ATOMIC); - if (!work) { - trace_writeback_nowork(wb->bdi); - wb_wakeup(wb); - return; - } - - work->sync_mode = WB_SYNC_NONE; - work->nr_pages = nr_pages; - work->range_cyclic = range_cyclic; - work->reason= reason; - - wb_queue_work(wb, work); -} - #ifdef CONFIG_CGROUP_WRITEBACK /** @@ -238,22 +211,31 @@ EXPORT_SYMBOL_GPL(mapping_congested); #endif /* CONFIG_CGROUP_WRITEBACK */ -/** - * bdi_start_writeback - start writeback - * @bdi: the backing device to write from - * @nr_pages: the number of pages to write - * @reason: reason why some writeback work was initiated - * - * Description: - * This does WB_SYNC_NONE opportunistic writeback. The IO is only - * started when this function returns, we make no guarantees on - * completion. Caller need not hold sb s_umount semaphore. - * - */ -void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - enum wb_reason reason) +void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, + bool range_cyclic, enum wb_reason reason) { - __wb_start_writeback(>wb, nr_pages, true, reason); + struct wb_writeback_work *work; + + if (!wb_has_dirty_io(wb)) + return; + + /* +* This is WB_SYNC_NONE writeback, so if allocation fails just +* wakeup the thread for old dirty data writeback +*/ + work = kzalloc(sizeof(*work), GFP_ATOMIC); + if (!work) { + trace_writeback_nowork(wb->bdi); + wb_wakeup(wb); + return; + } + + work->sync_mode = WB_SYNC_NONE; + work->nr_pages = nr_pages; + work->range_cyclic = range_cyclic; + work->reason= reason; + + wb_queue_work(wb, work); } /** @@ -1211,7 +1193,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason) rcu_read_lock(); list_for_each_entry_rcu(bdi, _list, bdi_list) - __wb_start_writeback(>wb, nr_pages, false, reason); + wb_start_writeback(>wb, nr_pages, false, reason); rcu_read_unlock(); } diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 9dc4eea..81e39ff 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -25,8 +25,8 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent, int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev); void bdi_unregister(struct backing_dev_info *bdi); int __must_check bdi_setup_and_register(struct backing_dev_info *, char *); -void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - enum wb_reason reason); +void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, + bool range_cyclic, enum wb_reason reason); void bdi_start_background_writeback(struct backing_dev_info *bdi); void wb_workfn(struct work_struct *work); void wb_wakeup_delayed(struct bdi_writeback *wb); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 349e32b..7c3a555 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1729,8 +1729,8 @@ void laptop_mode_timer_fn(unsigned long data) * threshold */ if (bdi_has_dirty_io(>backing_dev_info)) - bdi_start_writeback(>backing_dev_info, nr_pages, - WB_REASON_LAPTOP_TIMER); + wb_start_writeback(>backing_dev_info.wb, nr_pages, true, + WB_REASON_LAPTOP_TIMER); } /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH 4/7 linux-next] ASoC: rt5631: constify of_device_id array
On Wed, Mar 18, 2015 at 05:48:59PM +0100, Fabian Frederick wrote: > of_device_id is always used as const. > (See driver.of_match_table and open firmware functions) Applied, thanks. signature.asc Description: Digital signature
[PATCH 45/48] writeback: dirty inodes against their matching cgroup bdi_writeback's
__mark_inode_dirty() always dirtied the inode against the root wb (bdi_writeback). The previous patches added all the infrastructure necessary to attribute an inode against the wb of the dirtying cgroup. This patch updates __mark_inode_dirty() so that it uses the wb associated with the inode instead of unconditionally using the root one. Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being dirtied against the root wb. v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 57b2282..890cff1 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1442,7 +1442,6 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode) void __mark_inode_dirty(struct inode *inode, int flags) { struct super_block *sb = inode->i_sb; - struct backing_dev_info *bdi = NULL; int dirtytime; trace_writeback_mark_inode_dirty(inode, flags); @@ -1512,21 +1511,21 @@ void __mark_inode_dirty(struct inode *inode, int flags) * reposition it (that would break b_dirty time-ordering). */ if (!was_dirty) { + struct bdi_writeback *wb = inode_to_wb(inode); bool wakeup_bdi = false; - bdi = inode_to_bdi(inode); spin_unlock(>i_lock); - spin_lock(>wb.list_lock); + spin_lock(>list_lock); - WARN(bdi_cap_writeback_dirty(bdi) && -!test_bit(WB_registered, >wb.state), -"bdi-%s not registered\n", bdi->name); + WARN(bdi_cap_writeback_dirty(wb->bdi) && +!test_bit(WB_registered, >state), +"bdi-%s not registered\n", wb->bdi->name); inode->dirtied_when = jiffies; - wakeup_bdi = inode_wb_list_move_locked(inode, >wb, - dirtytime ? >wb.b_dirty_time : - >wb.b_dirty); - spin_unlock(>wb.list_lock); + wakeup_bdi = inode_wb_list_move_locked(inode, wb, + dirtytime ? >b_dirty_time : + >b_dirty); + spin_unlock(>list_lock); trace_writeback_dirty_inode_enqueue(inode); /* @@ -1535,8 +1534,8 @@ void __mark_inode_dirty(struct inode *inode, int flags) * to make sure background write-back happens * later. */ - if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi) - wb_wakeup_delayed(>wb); + if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi) + wb_wakeup_delayed(wb); return; } } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 47/48] mpage: make __mpage_writepage() honor cgroup writeback
__mpage_writepage() is used to implement mpage_writepages() which in turn is used for ->writepages() of various filesystems. All writeback logic is now updated to handle cgroup writeback and the block cgroup to issue IOs for is encoded in writeback_control and can be retrieved from the inode; however, __mpage_writepage() currently ignores the blkcg indicated by the inode and issues all bio's without explicit blkcg association. This patch updates __mpage_writepage() so that the issued bio's are associated with inode_to_writeback_blkcg_css(inode). v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Andrew Morton Cc: Alexander Viro --- fs/mpage.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/mpage.c b/fs/mpage.c index 3e79220..a3ccb0b 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -605,6 +605,8 @@ alloc_new: bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); if (bio == NULL) goto confused; + + bio_associate_blkcg(bio, inode_to_wb_blkcg_css(inode)); } /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 39/48] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's
wakeup_flusher_threads() currently only starts writeback on the root wb (bdi_writeback). For cgroup writeback support, update the function to wake up all wbs and distribute the number of pages to write according to the proportion of each wb's write bandwidth, which is implemented in wb_split_bdi_pages(). Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 48 ++-- 1 file changed, 46 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index c9bda4d..75d5e5c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -196,6 +196,41 @@ int mapping_congested(struct address_space *mapping, } EXPORT_SYMBOL_GPL(mapping_congested); +/** + * wb_split_bdi_pages - split nr_pages to write according to bandwidth + * @wb: target bdi_writeback to split @nr_pages to + * @nr_pages: number of pages to write for the whole bdi + * + * Split @wb's portion of @nr_pages according to @wb's write bandwidth in + * relation to the total write bandwidth of all wb's w/ dirty inodes on + * @wb->bdi. + */ +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) +{ + unsigned long this_bw = wb->avg_write_bandwidth; + unsigned long tot_bw = atomic_long_read(>bdi->tot_write_bandwidth); + + if (nr_pages == LONG_MAX) + return LONG_MAX; + + /* +* This may be called on clean wb's and proportional distribution +* may not make sense, just use the original @nr_pages in those +* cases. In general, we wanna err on the side of writing more. +*/ + if (!tot_bw || this_bw >= tot_bw) + return nr_pages; + else + return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) +{ + return nr_pages; +} + #endif /* CONFIG_CGROUP_WRITEBACK */ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, @@ -1179,8 +1214,17 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason) nr_pages = get_nr_dirty_pages(); rcu_read_lock(); - list_for_each_entry_rcu(bdi, _list, bdi_list) - wb_start_writeback(>wb, nr_pages, false, reason); + list_for_each_entry_rcu(bdi, _list, bdi_list) { + struct bdi_writeback *wb; + struct wb_iter iter; + + if (!bdi_has_dirty_io(bdi)) + continue; + + bdi_for_each_wb(wb, bdi, , 0) + wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages), + false, reason); + } rcu_read_unlock(); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
bpf+tracing next steps. Was: [PATCH v9 tip 3/9] tracing: attach BPF programs to kprobes
On 3/22/15 7:17 PM, Masami Hiramatsu wrote: > (2015/03/23 3:03), Alexei Starovoitov wrote: > >> User space tools that will compile ktap/dtrace scripts into bpf might >> use build-id for their own purpose, but that's a different discussion. > > Agreed. > I'd like to discuss it since kprobe event interface may also have same > issue. I'm not sure what 'issue' you're seeing. My understanding is that build-ids are used by perf to associate binaries with their debug info and by systemtap to make sure that probes actually match the kernel they were compiled for. In bpf case it probably will be perf way only. Are you interested in doing something with bpf ? ;) I know that Jovi is working on clang-based front-end, He Kuang is doing something fancy and I'm going to focus on 'tcp instrumentation' once bpf+kprobes is in. I think these efforts will help us make it concrete and will establish a path towards bpf+tracepoints (debug tracepoints or trace markers) and eventual integration with perf. Here is the wish-list (for kernel and userspace) inspired by Brendan: - access to pid, uid, tid, comm, etc - access to kernel stack trace - access to user-level stack trace - kernel debuginfo for walking kernel structs, and accessing kprobe entry args as variables - tracing of uprobes - tracing of user markers - user debuginfo for user structs and args - easy to use language - library of scripting features - nice one-liner syntax I think there is a lot of interest in bpf+tracing and would be good to align the efforts. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 48/48] ext2: enable cgroup writeback support
Writeback now supports cgroup writeback and the generic writeback, buffer, libfs, and mpage helpers that ext2 uses are all updated to work with cgroup writeback. This patch enables cgroup writeback for ext2 by adding FS_CGROUP_WRITEBACK to its ->fs_flags. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: linux-e...@vger.kernel.org --- fs/ext2/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index d0e746e..549219d 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -1543,7 +1543,7 @@ static struct file_system_type ext2_fs_type = { .name = "ext2", .mount = ext2_mount, .kill_sb= kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK, }; MODULE_ALIAS_FS("ext2"); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 37/48] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info
writeback_in_progress() currently takes @bdi and returns whether writeback is in progress on its root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. While at it, make it an inline function. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 15 +-- include/linux/backing-dev.h | 12 +++- mm/page-writeback.c | 4 ++-- 3 files changed, 14 insertions(+), 17 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index c24d6fd..c342c05 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -53,19 +53,6 @@ struct wb_writeback_work { struct completion *done;/* set if the caller waits */ }; -/** - * writeback_in_progress - determine whether there is writeback in progress - * @bdi: the device's backing_dev_info structure. - * - * Determine whether there is writeback waiting to be handled against a - * backing device. - */ -int writeback_in_progress(struct backing_dev_info *bdi) -{ - return test_bit(WB_writeback_running, >wb.state); -} -EXPORT_SYMBOL(writeback_in_progress); - static inline struct inode *wb_inode(struct list_head *head) { return list_entry(head, struct inode, i_wb_list); @@ -1465,7 +1452,7 @@ int try_to_writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason) { - if (writeback_in_progress(sb->s_bdi)) + if (writeback_in_progress(>s_bdi->wb)) return 1; if (!down_read_trylock(>s_umount)) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 81e39ff..216a016 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -156,7 +156,17 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); extern struct backing_dev_info noop_backing_dev_info; -int writeback_in_progress(struct backing_dev_info *bdi); +/** + * writeback_in_progress - determine whether there is writeback in progress + * @wb: bdi_writeback of interest + * + * Determine whether there is writeback waiting to be handled against a + * bdi_writeback. + */ +static inline bool writeback_in_progress(struct bdi_writeback *wb) +{ + return test_bit(WB_writeback_running, >state); +} static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) { diff --git a/mm/page-writeback.c b/mm/page-writeback.c index fa37e73..3e698f4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1455,7 +1455,7 @@ static void balance_dirty_pages(struct address_space *mapping, break; } - if (unlikely(!writeback_in_progress(bdi))) + if (unlikely(!writeback_in_progress(wb))) bdi_start_background_writeback(bdi); if (!strictlimit) @@ -1573,7 +1573,7 @@ pause: if (!dirty_exceeded && wb->dirty_exceeded) wb->dirty_exceeded = 0; - if (writeback_in_progress(bdi)) + if (writeback_in_progress(wb)) return; /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 43/48] writeback: restructure try_writeback_inodes_sb[_nr]()
try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it handles s_umount locking and skips if writeback is already in progress. The in progress test is performed on the root wb (bdi_writeback) which isn't sufficient for cgroup writeback support. The test must be done per-wb. To prepare for the change, this patch factors out __writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds @skip_if_busy and moves the in progress test right before queueing the wb_writeback_work. try_writeback_inodes_sb_nr() now just grabs s_umount and invokes __writeback_inodes_sb_nr() with asserted @skip_if_busy. This way, later addition of multiple wb handling can skip only the wb's which already have writeback in progress. This swaps the order between in progress test and s_umount test which can flip the return value when writeback is in progress and s_umount is being held by someone else but this shouldn't cause any meaningful difference. It's a fringe condition and the return value is an unsynchronized hint anyway. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 52 ++- include/linux/writeback.h | 6 +++--- 2 files changed, 32 insertions(+), 26 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index f565635..be374ae 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1510,19 +1510,8 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } -/** - * writeback_inodes_sb_nr -writeback dirty inodes from given super_block - * @sb: the superblock - * @nr: the number of pages to write - * @reason: reason why some writeback work initiated - * - * Start writeback on some inodes on this super_block. No guarantees are made - * on how many (if any) will be written, and this function does not wait - * for IO completion of submitted IO. - */ -void writeback_inodes_sb_nr(struct super_block *sb, - unsigned long nr, - enum wb_reason reason) +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, +enum wb_reason reason, bool skip_if_busy) { DEFINE_WB_COMPLETION_ONSTACK(done); struct wb_writeback_work work = { @@ -1538,9 +1527,30 @@ void writeback_inodes_sb_nr(struct super_block *sb, if (!bdi_has_dirty_io(bdi) || bdi == _backing_dev_info) return; WARN_ON(!rwsem_is_locked(>s_umount)); + + if (skip_if_busy && writeback_in_progress(>wb)) + return; + wb_queue_work(>wb, ); wb_wait_for_completion(bdi, ); } + +/** + * writeback_inodes_sb_nr -writeback dirty inodes from given super_block + * @sb: the superblock + * @nr: the number of pages to write + * @reason: reason why some writeback work initiated + * + * Start writeback on some inodes on this super_block. No guarantees are made + * on how many (if any) will be written, and this function does not wait + * for IO completion of submitted IO. + */ +void writeback_inodes_sb_nr(struct super_block *sb, + unsigned long nr, + enum wb_reason reason) +{ + __writeback_inodes_sb_nr(sb, nr, reason, false); +} EXPORT_SYMBOL(writeback_inodes_sb_nr); /** @@ -1567,19 +1577,15 @@ EXPORT_SYMBOL(writeback_inodes_sb); * Invoke writeback_inodes_sb_nr if no writeback is currently underway. * Returns 1 if writeback was started, 0 if not. */ -int try_to_writeback_inodes_sb_nr(struct super_block *sb, - unsigned long nr, - enum wb_reason reason) +bool try_to_writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, + enum wb_reason reason) { - if (writeback_in_progress(>s_bdi->wb)) - return 1; - if (!down_read_trylock(>s_umount)) - return 0; + return false; - writeback_inodes_sb_nr(sb, nr, reason); + __writeback_inodes_sb_nr(sb, nr, reason, true); up_read(>s_umount); - return 1; + return true; } EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr); @@ -1591,7 +1597,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr); * Implement by try_to_writeback_inodes_sb_nr() * Returns 1 if writeback was started, 0 if not. */ -int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) +bool try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) { return try_to_writeback_inodes_sb_nr(sb, get_nr_dirty_pages(), reason); } diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 8e4485f..75349bb 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -93,9 +93,9 @@ struct bdi_writeback; void writeback_inodes_sb(struct super_block *, enum wb_reason reason); void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
[PATCH 46/48] buffer, writeback: make __block_write_full_page() honor cgroup writeback
[__]block_write_full_page() is used to implement ->writepage in various filesystems. All writeback logic is now updated to handle cgroup writeback and the block cgroup to issue IOs for is encoded in writeback_control and can be retrieved from the inode; however, [__]block_write_full_page() currently ignores the blkcg indicated by inode and issues all bio's without explicit blkcg association. This patch adds submit_bh_blkcg() which associates the bio with the specified blkio cgroup before issuing and uses it in __block_write_full_page() so that the issued bio's are associated with inode_to_writeback_blkcg_css(inode). v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Andrew Morton --- fs/buffer.c | 26 -- include/linux/backing-dev.h | 12 2 files changed, 32 insertions(+), 6 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 4aa1dc2..f2d594c 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -44,6 +45,9 @@ #include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); +static int submit_bh_blkcg(int rw, struct buffer_head *bh, + unsigned long bio_flags, + struct cgroup_subsys_state *blkcg_css); #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) @@ -1704,8 +1708,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page, struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr_underway = 0; - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? - WRITE_SYNC : WRITE); + int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); + struct cgroup_subsys_state *blkcg_css = inode_to_wb_blkcg_css(inode); head = create_page_buffers(page, inode, (1 << BH_Dirty)|(1 << BH_Uptodate)); @@ -1794,7 +1798,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, do { struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { - submit_bh(write_op, bh); + submit_bh_blkcg(write_op, bh, 0, blkcg_css); nr_underway++; } bh = next; @@ -1848,7 +1852,7 @@ recover: struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { clear_buffer_dirty(bh); - submit_bh(write_op, bh); + submit_bh_blkcg(write_op, bh, 0, blkcg_css); nr_underway++; } bh = next; @@ -3017,7 +3021,9 @@ void guard_bio_eod(int rw, struct bio *bio) } } -int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) +static int submit_bh_blkcg(int rw, struct buffer_head *bh, + unsigned long bio_flags, + struct cgroup_subsys_state *blkcg_css) { struct bio *bio; int ret = 0; @@ -3040,6 +3046,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) */ bio = bio_alloc(GFP_NOIO, 1); + if (blkcg_css) + bio_associate_blkcg(bio, blkcg_css); + bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio->bi_bdev = bh->b_bdev; bio->bi_io_vec[0].bv_page = bh->b_page; @@ -3070,11 +3079,16 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) bio_put(bio); return ret; } + +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) +{ + return submit_bh_blkcg(rw, bh, bio_flags, NULL); +} EXPORT_SYMBOL_GPL(_submit_bh); int submit_bh(int rw, struct buffer_head *bh) { - return _submit_bh(rw, bh, 0); + return submit_bh_blkcg(rw, bh, 0, NULL); } EXPORT_SYMBOL(submit_bh); diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index fee39cd..a9a843c 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -394,6 +394,12 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode) return inode->i_wb; } +static inline struct cgroup_subsys_state * +inode_to_wb_blkcg_css(struct inode *inode) +{ + return inode_to_wb(inode)->blkcg_css; +} + struct wb_iter { int start_blkcg_id; struct radix_tree_iter tree_iter; @@ -511,6 +517,12 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg) { } +static inline struct cgroup_subsys_state * +inode_to_wb_blkcg_css(struct inode *inode) +{ + return blkcg_root_css; +} + struct wb_iter { int next_id; }; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe
[PATCH 42/48] writeback: implement wb_wait_for_single_work()
For cgroup writeback, multiple wb_writeback_work items may need to be issuedto accomplish a single task. The previous patch updated the waiting mechanism such that wb_wait_for_completion() can wait for multiple work items. Issuing mulitple work items involves memory allocation which may fail. As most writeback operations can't fail or blocked on memory allocation, in such cases, we'll fall back to sequential issuing of an on-stack work item, which would need to be waited upon sequentially. This patch implements wb_wait_for_single_work() which waits for a single work item independently from wb_completion waiting so that such fallback mechanism can be used without getting tangled with the usual issuing / completion operation. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- fs/fs-writeback.c | 47 +-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 944e53d..f565635 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -52,6 +52,8 @@ struct wb_writeback_work { unsigned int for_background:1; unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */ unsigned int auto_free:1; /* free on completion */ + unsigned int single_wait:1; + unsigned int single_done:1; enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ @@ -165,8 +167,11 @@ static void wb_queue_work(struct bdi_writeback *wb, trace_writeback_queue(wb->bdi, work); spin_lock_bh(>work_lock); - if (!test_bit(WB_registered, >state)) + if (!test_bit(WB_registered, >state)) { + if (work->single_wait) + work->single_done = 1; goto out_unlock; + } if (work->done) atomic_inc(>done->cnt); list_add_tail(>list, >work_list); @@ -231,6 +236,32 @@ int mapping_congested(struct address_space *mapping, EXPORT_SYMBOL_GPL(mapping_congested); /** + * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work + * @bdi: bdi the work item was issued to + * @work: work item to wait for + * + * Wait for the completion of @work which was issued to one of @bdi's + * bdi_writeback's. The caller must have set @work->single_wait before + * issuing it. This wait operates independently fo + * wb_wait_for_completion() and also disables automatic freeing of @work. + */ +static void wb_wait_for_single_work(struct backing_dev_info *bdi, + struct wb_writeback_work *work) +{ + if (WARN_ON_ONCE(!work->single_wait)) + return; + + wait_event(bdi->wb_waitq, work->single_done); + + /* +* Paired with smp_wmb() in wb_do_writeback() and ensures that all +* modifications to @work prior to assertion of ->single_done is +* visible to the caller once this function returns. +*/ + smp_rmb(); +} + +/** * wb_split_bdi_pages - split nr_pages to write according to bandwidth * @wb: target bdi_writeback to split @nr_pages to * @nr_pages: number of pages to write for the whole bdi @@ -1169,14 +1200,26 @@ static long wb_do_writeback(struct bdi_writeback *wb) set_bit(WB_writeback_running, >state); while ((work = get_next_work_item(wb)) != NULL) { struct wb_completion *done = work->done; + bool need_wake_up = false; trace_writeback_exec(wb->bdi, work); wrote += wb_writeback(wb, work); - if (work->auto_free) + if (work->single_wait) { + WARN_ON_ONCE(work->auto_free); + /* paired w/ rmb in wb_wait_for_single_work() */ + smp_wmb(); + work->single_done = 1; + need_wake_up = true; + } else if (work->auto_free) { kfree(work); + } + if (done && atomic_dec_and_test(>cnt)) + need_wake_up = true; + + if (need_wake_up) wake_up_all(>bdi->wb_waitq); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 23/48] writeback: attribute stats to the matching per-cgroup bdi_writeback
Until now, all WB_* stats were accounted against the root wb (bdi_writeback), now that multiple wb (bdi_writeback) support is in place, let's attributes the stats to the respective per-cgroup wb's. As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to visible behavior differences. v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- mm/filemap.c| 2 +- mm/page-writeback.c | 22 ++ mm/truncate.c | 6 -- 3 files changed, 19 insertions(+), 11 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index a2b098b..64698fa 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -215,7 +215,7 @@ void __delete_from_page_cache(struct page *page, void *shadow, if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_wb_stat(_to_bdi(mapping->host)->wb, WB_RECLAIMABLE); + dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE); } } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 10624e3..d5635cf 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2175,10 +2175,13 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers); void account_page_redirty(struct page *page) { struct address_space *mapping = page->mapping; + if (mapping && mapping_cap_account_dirty(mapping)) { + struct bdi_writeback *wb = inode_to_wb(mapping->host); + current->nr_dirtied--; dec_zone_page_state(page, NR_DIRTIED); - dec_wb_stat(_to_bdi(mapping->host)->wb, WB_DIRTIED); + dec_wb_stat(wb, WB_DIRTIED); } } EXPORT_SYMBOL(account_page_redirty); @@ -2324,8 +2327,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_wb_stat(_to_bdi(mapping->host)->wb, - WB_RECLAIMABLE); + dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE); ret = 1; } mem_cgroup_end_page_stat(memcg); @@ -2343,7 +2345,8 @@ int test_clear_page_writeback(struct page *page) memcg = mem_cgroup_begin_page_stat(page); if (mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; spin_lock_irqsave(>tree_lock, flags); @@ -2353,8 +2356,10 @@ int test_clear_page_writeback(struct page *page) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) { - __dec_wb_stat(>wb, WB_WRITEBACK); - __wb_writeout_inc(>wb); + struct bdi_writeback *wb = inode_to_wb(inode); + + __dec_wb_stat(wb, WB_WRITEBACK); + __wb_writeout_inc(wb); } } spin_unlock_irqrestore(>tree_lock, flags); @@ -2378,7 +2383,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write) memcg = mem_cgroup_begin_page_stat(page); if (mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; spin_lock_irqsave(>tree_lock, flags); @@ -2388,7 +2394,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) - __inc_wb_stat(>wb, WB_WRITEBACK); + __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK); } if (!PageDirty(page)) radix_tree_tag_clear(>page_tree, diff --git a/mm/truncate.c b/mm/truncate.c index df16f8c..fe2d769 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -113,11 +113,13 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) memcg = mem_cgroup_begin_page_stat(page); if (TestClearPageDirty(page)) { struct address_space *mapping = page->mapping; + if (mapping && mapping_cap_account_dirty(mapping)) { + struct bdi_writeback *wb
[PATCH 18/48] writeback: add @gfp to wb_init()
wb_init() currently always uses GFP_KERNEL but the planned cgroup writeback support needs using other allocation masks. Add @gfp to wb_init(). This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara --- mm/backing-dev.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index b0707d1..805b287 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -291,7 +291,8 @@ void wb_wakeup_delayed(struct bdi_writeback *wb) */ #define INIT_BW(100 << (20 - PAGE_SHIFT)) -static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) +static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, + gfp_t gfp) { int i, err; @@ -315,12 +316,12 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) INIT_LIST_HEAD(>work_list); INIT_DELAYED_WORK(>dwork, wb_workfn); - err = fprop_local_init_percpu(>completions, GFP_KERNEL); + err = fprop_local_init_percpu(>completions, gfp); if (err) return err; for (i = 0; i < NR_WB_STAT_ITEMS; i++) { - err = percpu_counter_init(>stat[i], 0, GFP_KERNEL); + err = percpu_counter_init(>stat[i], 0, gfp); if (err) { while (--i) percpu_counter_destroy(>stat[i]); @@ -378,7 +379,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(>bdi_list); - err = wb_init(>wb, bdi); + err = wb_init(>wb, bdi, GFP_KERNEL); if (err) return err; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/48] writeback: move backing_dev_info->bdi_stat[] into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->bdi_stat[] into wb. * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all enums is changed from BDI_ to WB_. * BDI_STAT_BATCH() -> WB_STAT_BATCH() * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...) * bdi_stat[_error]() -> wb_stat[_error]() * bdi_writeout_inc() -> wb_writeout_inc() * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and frees stat. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Miklos Szeredi Cc: Trond Myklebust --- fs/fs-writeback.c | 2 +- fs/fuse/file.c | 12 fs/nfs/internal.h | 2 +- fs/nfs/write.c | 3 +- include/linux/backing-dev.h | 68 + mm/backing-dev.c| 60 --- mm/filemap.c| 2 +- mm/page-writeback.c | 53 +-- mm/truncate.c | 4 +-- 9 files changed, 108 insertions(+), 98 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index fea13fe..992a065 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -822,7 +822,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi) global_page_state(NR_UNSTABLE_NFS) > background_thresh) return true; - if (bdi_stat(bdi, BDI_RECLAIMABLE) > + if (wb_stat(>wb, WB_RECLAIMABLE) > bdi_dirty_limit(bdi, background_thresh)) return true; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index c01ec3b..997c88a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1469,9 +1469,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(>writepages_entry); for (i = 0; i < req->num_pages; i++) { - dec_bdi_stat(bdi, BDI_WRITEBACK); + dec_wb_stat(>wb, WB_WRITEBACK); dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP); - bdi_writeout_inc(bdi); + wb_writeout_inc(>wb); } wake_up(>page_waitq); } @@ -1658,7 +1658,7 @@ static int fuse_writepage_locked(struct page *page) req->end = fuse_writepage_end; req->inode = inode; - inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK); + inc_wb_stat(_to_bdi(inode)->wb, WB_WRITEBACK); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); spin_lock(>lock); @@ -1773,9 +1773,9 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req, copy_highpage(old_req->pages[0], page); spin_unlock(>lock); - dec_bdi_stat(bdi, BDI_WRITEBACK); + dec_wb_stat(>wb, WB_WRITEBACK); dec_zone_page_state(page, NR_WRITEBACK_TEMP); - bdi_writeout_inc(bdi); + wb_writeout_inc(>wb); fuse_writepage_free(fc, new_req); fuse_request_free(new_req); goto out; @@ -1872,7 +1872,7 @@ static int fuse_writepages_fill(struct page *page, req->page_descs[req->num_pages].offset = 0; req->page_descs[req->num_pages].length = PAGE_SIZE; - inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK); + inc_wb_stat(_to_bdi(inode)->wb, WB_WRITEBACK); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); err = 0; diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 9e6475b..7e3c460 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -607,7 +607,7 @@ void nfs_mark_page_unstable(struct page *page) struct inode *inode = page_file_mapping(page)->host; inc_zone_page_state(page, NR_UNSTABLE_NFS); - inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE); + inc_wb_stat(_to_bdi(inode)->wb, WB_RECLAIMABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 849ed78..8bbeafe 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -853,7 +853,8 @@ static void nfs_clear_page_commit(struct page *page) { dec_zone_page_state(page, NR_UNSTABLE_NFS); - dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), BDI_RECLAIMABLE); + dec_wb_stat(_to_bdi(page_file_mapping(page)->host)->wb, + WB_RECLAIMABLE); } /* Called holding inode (/cinfo) lock */ diff --git a/include/linux/backing-dev.h
[PATCHSET 1/3 v2 block/for-4.1/core] writeback: cgroup writeback support
ne.patch 0018-writeback-add-gfp-to-wb_init.patch 0019-bdi-separate-out-congested-state-into-a-separate-str.patch 0020-writeback-add-CONFIG-BDI_CAP-FS-_CGROUP_WRITEBACK.patch 0021-writeback-make-backing_dev_info-host-cgroup-specific.patch 0022-writeback-blkcg-associate-each-blkcg_gq-with-the-cor.patch 0023-writeback-attribute-stats-to-the-matching-per-cgroup.patch 0024-writeback-let-balance_dirty_pages-work-on-the-matchi.patch 0025-writeback-make-congestion-functions-per-bdi_writebac.patch 0026-writeback-blkcg-restructure-blk_-set-clear-_queue_co.patch 0027-writeback-blkcg-propagate-non-root-blkcg-congestion-.patch 0028-writeback-implement-and-use-mapping_congested.patch 0029-writeback-implement-WB_has_dirty_io-wb_state-flag.patch 0030-writeback-implement-backing_dev_info-tot_write_bandw.patch 0031-writeback-make-bdi_has_dirty_io-take-multiple-bdi_wr.patch 0032-writeback-don-t-issue-wb_writeback_work-if-clean.patch 0033-writeback-make-bdi-min-max_ratio-handling-cgroup-wri.patch 0034-writeback-implement-bdi_for_each_wb.patch 0035-writeback-remove-bdi_start_writeback.patch 0036-writeback-make-laptop_mode_timer_fn-handle-multiple-.patch 0037-writeback-make-writeback_in_progress-take-bdi_writeb.patch 0038-writeback-make-bdi_start_background_writeback-take-b.patch 0039-writeback-make-wakeup_flusher_threads-handle-multipl.patch 0040-writeback-add-wb_writeback_work-auto_free.patch 0041-writeback-implement-bdi_wait_for_completion.patch 0042-writeback-implement-wb_wait_for_single_work.patch 0043-writeback-restructure-try_writeback_inodes_sb-_nr.patch 0044-writeback-make-writeback-initiation-functions-handle.patch 0045-writeback-dirty-inodes-against-their-matching-cgroup.patch 0046-buffer-writeback-make-__block_write_full_page-honor-.patch 0047-mpage-make-__mpage_writepage-honor-cgroup-writeback.patch 0048-ext2-enable-cgroup-writeback-support.patch 0001-0019 are preps. 0020-0045 gradually convert writeback code so that wb (bdi_writeback) operates as an independent writeback domain instead of bdi (backing_dev_info), a single bdi can have multiple per-cgroup wb's working for it, and per-bdi operations are translated and distributed to all its member wb's. 0046-0048 make lower layers to properly propagate the cgroup association from the writeback layer and enable cgroup writeback on ext2. This patchset is on top of block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set") + [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-20150322 diffstat follows. Thanks. Documentation/cgroups/memory.txt |1 block/bio.c | 35 +- block/blk-cgroup.c | 28 + block/blk-cgroup.h | 603 --- block/blk-core.c | 70 ++-- block/blk-integrity.c|1 block/blk-sysfs.c|3 block/blk-throttle.c |2 block/bounce.c |1 block/cfq-iosched.c |2 block/elevator.c |2 block/genhd.c|1 drivers/block/drbd/drbd_int.h|1 drivers/block/drbd/drbd_main.c | 10 drivers/block/pktcdvd.c |1 drivers/char/raw.c |1 drivers/md/bcache/request.c |1 drivers/md/dm.c |2 drivers/md/dm.h |1 drivers/md/md.h |1 drivers/md/raid1.c |4 drivers/md/raid10.c |2 drivers/mtd/devices/block2mtd.c |1 fs/block_dev.c |9 fs/buffer.c | 60 ++- fs/ext2/super.c |2 fs/ext4/extents.c|1 fs/ext4/mballoc.c|1 fs/ext4/super.c |1 fs/f2fs/node.c |4 fs/f2fs/segment.h|3 fs/fs-writeback.c| 609 +-- fs/fuse/file.c | 12 fs/gfs2/super.c |2 fs/hfs/super.c |1 fs/hfsplus/super.c |1 fs/inode.c |1 fs/mpage.c |2 fs/nfs/filelayout/filelayout.c |1 fs/nfs/internal.h|2 fs/nfs/write.c |3 fs/ocfs2/file.c |1 fs/reiserfs/super.c |1 fs/ufs/super.c |1 fs/xfs/xfs_aops.c| 12 fs/xfs/xfs_file.c|1 include/linux/backing-dev-defs.h | 188 ++ include/linux/backing-dev.h | 572 - include/linux/bio.h |3 include/linux/blk-cgroup.h | 631 +++
[PATCH 04/48] memcg: add mem_cgroup_root_css
Add global mem_cgroup_root_css which points to the root memcg css. This will be used by cgroup writeback support. If memcg is disabled, it's defined as ERR_PTR(-EINVAL). Signed-off-by: Tejun Heo Cc: Johannes Weiner aCc: Michal Hocko --- include/linux/memcontrol.h | 4 mm/memcontrol.c| 2 ++ 2 files changed, 6 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5fe6411..294498f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -68,6 +68,8 @@ enum mem_cgroup_events_index { }; #ifdef CONFIG_MEMCG +extern struct cgroup_subsys_state *mem_cgroup_root_css; + void mem_cgroup_events(struct mem_cgroup *memcg, enum mem_cgroup_events_index idx, unsigned int nr); @@ -196,6 +198,8 @@ void mem_cgroup_split_huge_fixup(struct page *head); #else /* CONFIG_MEMCG */ struct mem_cgroup; +#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) + static inline void mem_cgroup_events(struct mem_cgroup *memcg, enum mem_cgroup_events_index idx, unsigned int nr) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf25d1a..fda7025 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -71,6 +71,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys); #define MEM_CGROUP_RECLAIM_RETRIES 5 static struct mem_cgroup *root_mem_cgroup __read_mostly; +struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; /* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP @@ -4551,6 +4552,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* root ? */ if (parent_css == NULL) { root_mem_cgroup = memcg; + mem_cgroup_root_css = >css; page_counter_init(>memory, NULL); memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] pci: fix unhandled interrupt on shutdown
On Thu, 03/19 19:57, Michael S. Tsirkin wrote: > Fam Zheng noticed that pci shutdown disables msi and msix of a device while > device is still active. This was intended to fix kexec with fusion devices but > had the unintended effect of breaking even regular shutdown when using virtio. Series: Reviewed-by: Fam Zheng > > The same problem would affect any driver which doesn't register > a level interrupt handler when using msix. > > I think the fix is to avoid touching device on shutdown: > we clear bus master anyway, so we won't get any more > msi interrupts, and bus reset will clear the msi/msix > state eventually anyway. > > The patches seems to all work well for me. Given they affect all pci devices, > and the bug has been there since 2.6 times, I think there's no rush: we can > merge them for 4.1. > > At the same time, once merged, they will likely make a good > stable candidate. > > Michael S. Tsirkin (4): > pci: disable msi/msix at probe time > pci: don't disable msi/msix at shutdown > pci: make msi/msix shutdown functions static > virtio_pci: drop msi_off on probe > > include/linux/pci.h| 4 > drivers/pci/msi.c | 4 ++-- > drivers/pci/pci-driver.c | 8 ++-- > drivers/virtio/virtio_pci_common.c | 3 --- > 4 files changed, 8 insertions(+), 11 deletions(-) > > -- > MST > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] thermal: hisilicon: add new hisilicon thermal sensor driver
On Fri, Mar 20, 2015 at 03:55:27PM +, Mark Rutland wrote: > > > That may be the case in the code as it stands today, but per the binding > > > the trip points are the temperatures at which an action is to be taken. > > > > > > The thermal-zone has poilling-delay and polling-delay-passive, but > > > there's no reason you couldn't also use the interrupt to handle the > > > "hot" trip-point, adn the reset at the "critical" trip-point. All that's > > > missing is the plumbing in order to do so. > > > > > > So please co-ordinate with the thermal framework to do that. > > > > Let's dig further more for this point, so that we can get more specific > > gudiance and have a good preparation for next version's patch set. > > > > After i reviewed the thermal framework code, currently have one smooth > > way to co-ordinate the trip points w/t thermal framework: use the function > > *thermal_zone_device_register()* to register sensor, and can use the > > callback function .get_trip_temp to tell thermal framework for the > > trip points' temperature. > > > > For hisi thermal case, now the driver is using the function > > *thermal_zone_of_sensor_register* to register sensor, but use this way > > i have not found there have standard APIs which can be used by sensor > > driver to get the trip points info from thermal framework. > > > > I may miss something for thermal framework, so if have existed APIs to > > get the trip points, could pls point out? > > I am only familiar with the binding, not the Linux implementation -- The > latter can change to accomodate your hardware without requiring binding > changes. Please co-ordinate with the thermal maintainers. Found the functions of_thermal_get_trip_points(tz) and of_thermal_get_ntrips(tz) will help for this. > > > > > > + if (of_property_read_bool(np, > > > > > > "hisilicon,tsensor-bind-irq")) { > > > > > > + > > > > > > + if (data->irq_bind_sensor != -1) > > > > > > + dev_warn(>dev, "irq has bound to > > > > > > index %d\n", > > > > > > +data->irq_bind_sensor); > > > > > > + > > > > > > + /* bind irq to this sensor */ > > > > > > + data->irq_bind_sensor = index; > > > > > > + } > > > > > > > > > > I don't see why this should be specified in the DT. Why do you believe > > > > > it should? > > > > > > > > The thermal sensor module has four sensors, but have only one > > > > interrupt signal; This interrupt can only be used by one sensor; > > > > So want to use dts to bind the interrupt with one selected sensor. > > > > > > That's not all that great, though I'm not exactly sure how the kernel > > > would select the best sensor to measure with. It would be good if you > > > could talk to the thermal maintainers w.r.t. this. > > > > This will be decided by the silicon, right? Every soc has different > > combination with cpu/gpu/vpu, so which part is hottest, this maybe > > highly dependent on individual SoC. > > > > S/W just need provide the flexibility so that later can choose > > the interrupt to bind with the sensor within the hottest part. > > Then the property you care about is which sensor is closest to what is > likely to be the hottest component. Given that, the kernel can decide > how to use the interrupt. Will modify the driver to dynamically bind the interrupt to hottest sensor; Appreciate for good suggestion. Thanks, Leo Yan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RT 2/4] Revert "timers: do not raise softirq unconditionally"
On Sat, 2015-03-21 at 19:02 +0100, Mike Galbraith wrote: > On Thu, 2015-03-19 at 10:42 -0600, Thavatchai Makphaibulchoke wrote: > > On 03/19/2015 10:26 AM, Steven Rostedt wrote: > > > On Thu, 19 Mar 2015 09:17:09 +0100 > > > Mike Galbraith wrote: > > > > > > > > >> (aw crap, let's go shopping)... so why is the one in timer.c ok? > > > > > > It's not. Sebastian, you said there were no other cases of rt_mutexes > > > being taken in hard irq context. Looks like timer.c has one. > > > > > > So perhaps the real fix is to get that special case of ownership in > > > hard interrupt context? > > > > > > -- Steve > > > > > > > Steve, I'm still working on the fix we discussed using dummy irq_task. > > I should be able to submit some time next week, if still interested. > > > > Either that, or I think we should remove the function > > spin_do_trylock_in_interrupt() to prevent any possibility of running > > into similar problems in the future. > > Why can't we just Let swapper be the owner when in irq with no dummy? > > I have "don't raise timer unconditionally" re-applied, the check for a > running callback bits of my nohz_full fixlet, and the below on top of > that, and all _seems_ well. But not so well on 64 core box. That has nothing to do with hacklet though, re-applying timers-do-not-raise-softirq-unconditionally.patch without thta hangs the 64 core box during boot with no help from me other than to patchlet to let nohz work at all, seems there's another issue lurking there. Hohum. Without 'don't raise..", big box is fine. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
On Thu, 19 Mar 2015, Konstantin Khlebnikov wrote: > On 21.02.2015 07:09, Hugh Dickins wrote: > > > > The "team_usage" field added to struct page (in union with "private") > > is somewhat vaguely named: because while the huge page is sparsely > > occupied, it counts the occupancy; but once the huge page is fully > > occupied, it will come to be used differently in a later patch, as > > the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is > > never possible to map a sparsely occupied huge page, because that > > would expose stale data to the user. > > That might be a problem if this approach is supposed to be used for > normal filesystems. Yes, most filesystems have their own use for page->private. My concern at this stage has just been to have a good implementation for tmpfs, but Kirill and others are certainly interested in looking beyond that. > Instead of adding dedicated counter shmem could > detect partially occupied page by scanning though all tail pages and > checking PageUptodate() and bump mapcount for all tail pages prevent > races between mmap and truncate. Overhead shouldn't be that big, also > we can add fastpath - mark completely uptodate page with one of unused > page flag (PG_private or something). I do already use PageChecked (PG_owner_priv_1) for just that purpose: noting all subpages Uptodate (and marked Dirty) when first mapping by pmd (in 12/24). But don't bump mapcount on the subpages, just the head: I don't mind doing a pass down the subpages when it's first hugely mapped, but prefer to avoid such a pass on every huge map and unmap - seems unnecessary. The team_usage (== private) field ends up with three or four separate counts (and an mlocked flag) packed into it: I expect we could trade some of those counts for scans down the 512 subpages when necessary, but I doubt it's a good tradeoff; and keeping atomicity would be difficult (I've never wanted to have to take page_lock or somesuch on every page in zap_pte_range). Without atomicity the stats go wrong (I think Kirill has a problem of that kind in his page_remove_rmap scan). It will be interesting to see what Kirill does to maintain the stats for huge pagecache: but he will have no difficulty in finding fields to store counts, because he's got lots of spare fields in those 511 tail pages - that's a useful benefit of the compound page, but does prevent the tails from being used in ordinary ways. (I did try using team_head[1].team_usage for more, but atomicity needs prevented it.) > > Another (strange) idea is adding separate array of struct huge_page > into each zone. They will work as headers for huge pages and hold > that kind of fields. Pageblock flags also could be stored here. It's not such a strange idea, it is a definite possibility. Though I've tended to think of them more as a separate array of struct pages, one for each of the hugepages. It's a complication I'll keep away from as long as I can, but something like that will probably have to come. Consider the ambiguity of the head page, whose flags and counts may represent the 4k page mapped by pte and the 2M page mapped by pmd: there's an absurdity to that, one that I can live with for now, but expect some nasty case to demand a change (the way I have it at present, just mlocking the 4k head is enough to hold the 2M hugepage in memory: that's not good, but should be quite easily fixed without needing the hugepage array itself). And I think ideas may emerge from the persistent memory struct page discussions, which feed in here. One reason to hold back for now. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/