Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache
On 04/19/2018 11:10 AM, Jens Axboe wrote: > On 4/19/18 11:59 AM, Michael Lyle wrote: >> Too much to do with other projects. I've enjoyed working with everyone >> here, and hope to occasionally contribute on bcache. Jens--- > It's been a pleasure having someone maintain it upstream, so this is > pretty sad. I've really enjoyed working with everyone here. The queue keeps stacking up and I feel pretty bad when people need to wait a couple weeks for reviews and assistance. > Do you have anyone in mind for taking over the upstream part? The > community seems fairly active, and there are regular contributors. > > Coly and Tang Junhui come to mind. I think Coly would be a great maintainer. Mike
Re: extra part in bcache patch commit 539d39eb2708
Hi everyone-- On Wed, Apr 18, 2018 at 10:17 PM, Coly Liwrote: > Hi Michael and Jens > > When I do back port of bcache patches, I find commit 539d39eb2708 > ("bcache: fix wrong return value in bch_debug_init()") has extra part > from the original patch which Junhui Tanng posted. Sorry, I must have messed up a rebase. The patch it was combined with is here: https://www.spinics.net/lists/linux-bcache/msg05464.html and was properly signed off and reviewed. Mike
[PATCH] MAINTAINERS: Remove me as maintainer of bcache
Too much to do with other projects. I've enjoyed working with everyone here, and hope to occasionally contribute on bcache. Signed-off-by: Michael Lyle <ml...@lyle.org> --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index f9bf4651db41..9fd861906832 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2617,7 +2617,6 @@ S:Maintained F: drivers/net/hamradio/baycom* BCACHE (BLOCK LAYER CACHE) -M: Michael Lyle <ml...@lyle.org> M: Kent Overstreet <kent.overstr...@gmail.com> L: linux-bca...@vger.kernel.org W: http://bcache.evilpiepirate.org -- 2.17.0
Re: bcache and hibernation
On Thu, Apr 5, 2018 at 12:51 PM, Nikolaus Rathwrote: > Hi Michael, > > Could you explain why this isn't a problem with writethrough? It seems > to me that the trouble happens when the hibernation image is *read*, so > why does it matter what kind of write caching is used? With writethrough you can set up your loader to read it directly from the backing device-- e.g. you don't need the cache, and there's at least some valid configurations; with writeback some of the extents may be on the cache dev so... That said, it's not really great to put swap/hibernate on a cache device... the workloads don't usually benefit much from tiering (since they tend to be write-once-read-never or write-once-read-once). >> I am unaware of a mechanism to prohibit this in the kernel-- to say that >> a given type of block provider can't be involved in a resume operation. >> Most documentation for hibernation explicitly cautions about the btrfs >> situation, but use of bcache is less common and as a result generally >> isn't covered. > > Could you maybe add a warning to Documentation/bcache.txt? I think this > would have saved me. Yah, I can look at that. > > Best, > -Nikolaus Mike
Re: bcache and hibernation
Hi Nikolaus (and everyone else), Sorry I've been slow in responding. I probably need to step down as bcache maintainer because so many other things have competed for my time lately and I've fallen behind on both patches and mailing list. On 04/05/2018 01:51 AM, Nikolaus Rath wrote: > Is there a way to prevent this from happening? Could eg the kernel detect > that the swap devices is (indirectly) on bcache and refuse to hibernate? Or > is there a way to do a "true" read-only mount of a bcache volume so that one > can safely resume from it? I think you're correct. If you're using bcache in writeback mode, it is not safe to hibernate there, because some of the blocks involved in the resume can end up in cache (and dependency issues, like you mention). There's similar cautions/problems with btrfs. I am unaware of a mechanism to prohibit this in the kernel-- to say that a given type of block provider can't be involved in a resume operation. Most documentation for hibernation explicitly cautions about the btrfs situation, but use of bcache is less common and as a result generally isn't covered. > Best, > -Nikolaus Mike
Re: [for-4.17 20/20] bcache: Fix a compiler warning in bcache_device_init()
On 03/18/2018 06:32 PM, Coly Li wrote: > On 19/03/2018 8:36 AM, Michael Lyle wrote: >> From: Bart Van Assche <bart.vanass...@wdc.com> >> >> Avoid that building with W=1 triggers the following compiler warning: >> >> drivers/md/bcache/super.c:776:20: warning: comparison is always false due to >> limited range of data type [-Wtype-limits] >> d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) { >> ^ >> >> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> >> Reviewed-by: Michael Lyle <ml...@lyle.org> > > There is a missing Reviewed-by: Coly Li <col...@suse.de> from me :-) Hi Coly--- sorry that I lost these. Mike
[for-4.17 16/20] bcache: Fix kernel-doc warnings
From: Bart Van Assche <bart.vanass...@wdc.com> Avoid that building with W=1 triggers warnings about the kernel-doc headers. Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/btree.c | 2 +- drivers/md/bcache/closure.c | 8 drivers/md/bcache/request.c | 1 + drivers/md/bcache/util.c| 18 -- 4 files changed, 14 insertions(+), 15 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 143ed5a758e7..17936b2dc7d6 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -962,7 +962,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op, return b; } -/** +/* * bch_btree_node_get - find a btree node in the cache and lock it, reading it * in from disk if necessary. * diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index c0949c9f843b..0e14969182c6 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -46,7 +46,7 @@ void closure_sub(struct closure *cl, int v) } EXPORT_SYMBOL(closure_sub); -/** +/* * closure_put - decrement a closure's refcount */ void closure_put(struct closure *cl) @@ -55,7 +55,7 @@ void closure_put(struct closure *cl) } EXPORT_SYMBOL(closure_put); -/** +/* * closure_wake_up - wake up all closures on a wait list, without memory barrier */ void __closure_wake_up(struct closure_waitlist *wait_list) @@ -79,9 +79,9 @@ EXPORT_SYMBOL(__closure_wake_up); /** * closure_wait - add a closure to a waitlist - * - * @waitlist will own a ref on @cl, which will be released when + * @waitlist: will own a ref on @cl, which will be released when * closure_wake_up() is called on @waitlist. + * @cl: closure pointer. * */ bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 5a82237c7025..a65e3365eeb9 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -296,6 +296,7 @@ static void bch_data_insert_start(struct closure *cl) /** * bch_data_insert - stick some data in the cache + * @cl: closure pointer. * * This is the starting point for any data to end up in a cache device; it could * be from a normal write, or a writeback write, or a write to a flash only diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index 6198041f0ee2..74febd5230df 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -82,10 +82,9 @@ STRTO_H(strtoll, long long) STRTO_H(strtoull, unsigned long long) /** - * bch_hprint() - formats @v to human readable string for sysfs. - * - * @v - signed 64 bit integer - * @buf - the (at least 8 byte) buffer to format the result into. + * bch_hprint - formats @v to human readable string for sysfs. + * @buf: the (at least 8 byte) buffer to format the result into. + * @v: signed 64 bit integer * * Returns the number of bytes used by format. */ @@ -225,13 +224,12 @@ void bch_time_stats_update(struct time_stats *stats, uint64_t start_time) } /** - * bch_next_delay() - increment @d by the amount of work done, and return how - * long to delay until the next time to do some work. + * bch_next_delay() - update ratelimiting statistics and calculate next delay + * @d: the struct bch_ratelimit to update + * @done: the amount of work done, in arbitrary units * - * @d - the struct bch_ratelimit to update - * @done - the amount of work done, in arbitrary units - * - * Returns the amount of time to delay by, in jiffies + * Increment @d by the amount of work done, and return how long to delay in + * jiffies until the next time to do some work. */ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) { -- 2.14.1
[for-4.17 03/20] bcache: stop dc->writeback_rate_update properly
From: Coly Li <col...@suse.de> struct delayed_work writeback_rate_update in struct cache_dev is a delayed worker to call function update_writeback_rate() in period (the interval is defined by dc->writeback_rate_update_seconds). When a metadate I/O error happens on cache device, bcache error handling routine bch_cache_set_error() will call bch_cache_set_unregister() to retire whole cache set. On the unregister code path, this delayed work is stopped by calling cancel_delayed_work_sync(>writeback_rate_update). dc->writeback_rate_update is a special delayed work from others in bcache. In its routine update_writeback_rate(), this delayed work is re-armed itself. That means when cancel_delayed_work_sync() returns, this delayed work can still be executed after several seconds defined by dc->writeback_rate_update_seconds. The problem is, after cancel_delayed_work_sync() returns, the cache set unregister code path will continue and release memory of struct cache set. Then the delayed work is scheduled to run, __update_writeback_rate() will reference the already released cache_set memory, and trigger a NULL pointer deference fault. This patch introduces two more bcache device flags, - BCACHE_DEV_WB_RUNNING bit set: bcache device is in writeback mode and running, it is OK for dc->writeback_rate_update to re-arm itself. bit clear:bcache device is trying to stop dc->writeback_rate_update, this delayed work should not re-arm itself and quit. - BCACHE_DEV_RATE_DW_RUNNING bit set: routine update_writeback_rate() is executing. bit clear: routine update_writeback_rate() quits. This patch also adds a function cancel_writeback_rate_update_dwork() to wait for dc->writeback_rate_update quits before cancel it by calling cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected quit dc->writeback_rate_update, after time_out seconds this function will give up and continue to call cancel_delayed_work_sync(). And here I explain how this patch stops self re-armed delayed work properly with the above stuffs. update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING. Before calling cancel_delayed_work_sync() wait utill flag BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling cancel_delayed_work_sync(), dc->writeback_rate_update must be already re- armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases delayed work routine update_writeback_rate() won't be executed after cancel_delayed_work_sync() returns. Inside update_writeback_rate() before calling schedule_delayed_work(), flag BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means someone is about to stop the delayed work. Because flag BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync() has to wait for this flag to be cleared, we don't need to worry about race condition here. If update_writeback_rate() is scheduled to run after checking BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync() in cancel_writeback_rate_update_dwork(), it is also safe. Because at this moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear and quit immediately. Because there are more dependences inside update_writeback_rate() to struct cache_set memory, dc->writeback_rate_update is not a simple self re-arm delayed work. After trying many different methods (e.g. hold dc->count, or use locks), this is the only way I can find which works to properly stop dc->writeback_rate_update delayed work. Changelog: v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING to bit index, for test_bit(). v2: Try to fix the race issue which is pointed out by Junhui. v1: The initial version for review Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Junhui Tang <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Michael Lyle <ml...@lyle.org> Cc: Hannes Reinecke <h...@suse.com> --- drivers/md/bcache/bcache.h| 9 + drivers/md/bcache/super.c | 38 ++ drivers/md/bcache/sysfs.c | 3 ++- drivers/md/bcache/writeback.c | 29 - 4 files changed, 69 insertions(+), 10 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 12e5197f186c..b5ddb848cd31 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -258,10 +258,11 @@ struct bcache_device { struct gendisk *disk; unsigned long flags; -#define BCACHE_DEV_CLOSING 0 -#define BCACHE_DEV_DETACHING 1 -#define BCACHE_DEV_UNLINK_DONE 2 - +#define BCACHE_DEV_CLOSING 0 +#define BCACHE_DEV_
[for-4.17 20/20] bcache: Fix a compiler warning in bcache_device_init()
From: Bart Van Assche <bart.vanass...@wdc.com> Avoid that building with W=1 triggers the following compiler warning: drivers/md/bcache/super.c:776:20: warning: comparison is always false due to limited range of data type [-Wtype-limits] d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) { ^ Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/super.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 640ff8072ed8..d90d9e59ca00 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -778,6 +778,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size, sector_t sectors) { struct request_queue *q; + const size_t max_stripes = min_t(size_t, INT_MAX, +SIZE_MAX / sizeof(atomic_t)); size_t n; int idx; @@ -786,9 +788,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size, d->nr_stripes = DIV_ROUND_UP_ULL(sectors, d->stripe_size); - if (!d->nr_stripes || - d->nr_stripes > INT_MAX || - d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) { + if (!d->nr_stripes || d->nr_stripes > max_stripes) { pr_err("nr_stripes too large or invalid: %u (start sector beyond end of disk?)", (unsigned)d->nr_stripes); return -ENOMEM; -- 2.14.1
[for-4.17 19/20] bcache: Reduce the number of sparse complaints about lock imbalances
From: Bart Van Assche <bart.vanass...@wdc.com> Add more annotations for sparse to inform it about which functions do not have the same number of spin_lock() and spin_unlock() calls. Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/journal.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index acd0e5c074dd..18f1b5239620 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -594,6 +594,7 @@ static void journal_write_done(struct closure *cl) } static void journal_write_unlock(struct closure *cl) + __releases(>journal.lock) { struct cache_set *c = container_of(cl, struct cache_set, journal.io); @@ -705,6 +706,7 @@ static void journal_try_write(struct cache_set *c) static struct journal_write *journal_wait_for_write(struct cache_set *c, unsigned nkeys) + __acquires(>journal.lock) { size_t sectors; struct closure cl; -- 2.14.1
[for-4.17 15/20] bcache: Annotate switch fall-through
From: Bart Van Assche <bart.vanass...@wdc.com> This patch avoids that building with W=1 triggers complaints about switch fall-throughs. Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/util.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index a23cd6a14b74..6198041f0ee2 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -32,20 +32,27 @@ int bch_ ## name ## _h(const char *cp, type *res) \ case 'y': \ case 'z': \ u++;\ + /* fall through */ \ case 'e': \ u++;\ + /* fall through */ \ case 'p': \ u++;\ + /* fall through */ \ case 't': \ u++;\ + /* fall through */ \ case 'g': \ u++;\ + /* fall through */ \ case 'm': \ u++;\ + /* fall through */ \ case 'k': \ u++;\ if (e++ == cp) \ return -EINVAL; \ + /* fall through */ \ case '\n': \ case '\0': \ if (*e == '\n') \ -- 2.14.1
[for-4.17 02/20] bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
From: Coly Li <col...@suse.de> In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", cached_dev_get() is called when creating dc->writeback_thread, and cached_dev_put() is called when exiting dc->writeback_thread. This modification works well unless people detach the bcache device manually by 'echo 1 > /sys/block/bcache/bcache/detach' Because this sysfs interface only calls bch_cached_dev_detach() which wakes up dc->writeback_thread but does not stop it. The reason is, before patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside bch_writeback_thread(), if cache is not dirty after writeback, cached_dev_put() will be called here. And in cached_dev_make_request() when a new write request makes cache from clean to dirty, cached_dev_get() will be called there. Since we don't operate dc->count in these locations, refcount d->count cannot be dropped after cache becomes clean, and cached_dev_detach_finish() won't be called to detach bcache device. This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is set inside bch_writeback_thread(). If this bit is set and cache is clean (no existing writeback_keys), break the while-loop, call cached_dev_put() and quit the writeback thread. Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the writeback thread should continue to perform writeback, this is the original design of manually detach. It is safe to do the following check without locking, let me explain why, + if (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) && + (!atomic_read(>has_dirty) || !dc->writeback_running)) { If the kenrel thread does not sleep and continue to run due to conditions are not updated in time on the running CPU core, it just consumes more CPU cycles and has no hurt. This should-sleep-but-run is safe here. We just focus on the should-run-but-sleep condition, which means the writeback thread goes to sleep in mistake while it should continue to run. 1, First of all, no matter the writeback thread is hung or not, kthread_stop() from cached_dev_detach_finish() will wake up it and terminate by making kthread_should_stop() return true. And in normal run time, bit on index BCACHE_DEV_DETACHING is always cleared, the condition !test_bit(BCACHE_DEV_DETACHING, >disk.flags) is always true and can be ignored as constant value. 2, If one of the following conditions is true, the writeback thread should go to sleep, "!atomic_read(>has_dirty)" or "!dc->writeback_running)" each of them independently controls the writeback thread should sleep or not, let's analyse them one by one. 2.1 condition "!atomic_read(>has_dirty)" If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will call bch_writeback_queue() immediately or call bch_writeback_add() which indirectly calls bch_writeback_queue() too. In bch_writeback_queue(), wake_up_process(dc->writeback_thread) is called. It sets writeback thread's task state to TASK_RUNNING and following an implicit memory barrier, then tries to wake up the writeback thread. In writeback thread, its task state is set to TASK_INTERRUPTIBLE before doing the condition check. If other CPU core sets the TASK_RUNNING state after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread will be scheduled to run very soon because its state is not TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier of wake_up_process() will make sure modification of dc->has_dirty on other CPU core is updated and observed on the CPU core of writeback thread. Therefore the condition check will correctly be false, and continue writeback code without sleeping. 2.2 condition "!dc->writeback_running)" dc->writeback_running can be changed via sysfs file, every time it is modified, a following bch_writeback_queue() is alwasy called. So the change is always observed on the CPU core of writeback thread. If dc->writeback_running is changed from 0 to 1 on other CPU core, this condition check will observe the modification and allow writeback thread to continue to run without sleeping. Now we can see, even without a locking protection, multiple conditions check is safe here, no deadlock or process hang up will happen. I compose a separte patch because that patch "bcache: fix cached_dev->count usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes Reinecke. Also this fix is not trivial and good for a separate patch. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Hannes Reinecke <h...@suse.com> Cc: Huijun Tang <tang.jun...@zte.com.cn> --- drivers/md/bcache/writeback.c
[for-4.17 04/20] bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
From: Coly Li <col...@suse.de> When too many I/Os failed on cache device, bch_cache_set_error() is called in the error handling code path to retire whole problematic cache set. If new I/O requests continue to come and take refcount dc->count, the cache set won't be retired immediately, this is a problem. Further more, there are several kernel thread and self-armed kernel work may still running after bch_cache_set_error() is called. It needs to wait quite a while for them to stop, or they won't stop at all. They also prevent the cache set from being retired. The solution in this patch is, to add per cache set flag to disable I/O request on this cache and all attached backing devices. Then new coming I/O requests can be rejected in *_make_request() before taking refcount, kernel threads and self-armed kernel worker can stop very fast when flags bit CACHE_SET_IO_DISABLE is set. Because bcache also do internal I/Os for writeback, garbage collection, bucket allocation, journaling, this kind of I/O should be disabled after bch_cache_set_error() is called. So closure_bio_submit() is modified to check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set, closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and return, generic_make_request() won't be called. A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit from cache_set->flags, to disable or enable cache set I/O for debugging. It is helpful to trigger more corner case issues for failed cache device. Changelog v4, add wait_for_kthread_stop(), and call it before exits writeback and gc kernel threads. v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index. remove "bcache: " prefix when printing out kernel message. v2, more changes by previous review, - Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui. - Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this is reported and inspired from origal patch of Pavel Vazharov. v1, initial version. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Junhui Tang <tang.jun...@zte.com.cn> Cc: Michael Lyle <ml...@lyle.org> Cc: Pavel Vazharov <frea...@gmail.com> --- drivers/md/bcache/alloc.c | 3 ++- drivers/md/bcache/bcache.h| 33 + drivers/md/bcache/btree.c | 11 --- drivers/md/bcache/io.c| 2 +- drivers/md/bcache/journal.c | 4 ++-- drivers/md/bcache/request.c | 26 +++--- drivers/md/bcache/super.c | 6 +- drivers/md/bcache/sysfs.c | 18 ++ drivers/md/bcache/util.h | 6 -- drivers/md/bcache/writeback.c | 37 - 10 files changed, 116 insertions(+), 30 deletions(-) diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index 458e1d38577d..004cc3cc6123 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -287,7 +287,8 @@ do { \ break; \ \ mutex_unlock(&(ca)->set->bucket_lock); \ - if (kthread_should_stop()) {\ + if (kthread_should_stop() ||\ + test_bit(CACHE_SET_IO_DISABLE, >set->flags)) { \ set_current_state(TASK_RUNNING);\ return 0; \ } \ diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index b5ddb848cd31..8a0327581d62 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -188,6 +188,7 @@ #include #include #include +#include #include "bset.h" #include "util.h" @@ -475,10 +476,15 @@ struct gc_stat { * * CACHE_SET_RUNNING means all cache devices have been registered and journal * replay is complete. + * + * CACHE_SET_IO_DISABLE is set when bcache is stopping the whold cache set, all + * external and internal I/O should be denied when this flag is set. + * */ #define CACHE_SET_UNREGISTERING0 #defineCACHE_SET_STOPPING 1 #defineCACHE_SET_RUNNING 2 +#define CACHE_SET_IO_DISABLE 3 struct cache_set { struct closure cl; @@ -868,6 +874,33 @@ static inline void wake_up_allocators(struct cache_set *c) wake_up_process(ca->alloc_thread); } +static inline void closure_bio_submit(struct cache_set *c, + struct bio *bio, +
[for-4.17 05/20] bcache: add stop_when_cache_set_failed option to backing device
From: Coly Li <col...@suse.de> When there are too many I/O errors on cache device, current bcache code will retire the whole cache set, and detach all bcache devices. But the detached bcache devices are not stopped, which is problematic when bcache is in writeback mode. If the retired cache set has dirty data of backing devices, continue writing to bcache device will write to backing device directly. If the LBA of write request has a dirty version cached on cache device, next time when the cache device is re-registered and backing device re-attached to it again, the stale dirty data on cache device will be written to backing device, and overwrite latest directly written data. This situation causes a quite data corruption. But we cannot simply stop all attached bcache devices when the cache set is broken or disconnected. For example, use bcache to accelerate performance of an email service. In such workload, if cache device is broken but no dirty data lost, keep the bcache device alive and permit email service continue to access user data might be a better solution for the cache device failure. Nix <n...@esperi.org.uk> points out the issue and provides the above example to explain why it might be necessary to not stop bcache device for broken cache device. Pavel Goran <via-bca...@pvgoran.name> provides a brilliant suggestion to provide "always" and "auto" options to per-cached device sysfs file stop_when_cache_set_failed. If cache set is retiring and the backing device has no dirty data on cache, it should be safe to keep the bcache device alive. In this case, if stop_when_cache_set_failed is set to "auto", the device failure handling code will not stop this bcache device and permit application to access the backing device with a unattached bcache device. Changelog: [mlyle: edited to not break string constants across lines] v3: fix typos pointed out by Nix. v2: change option values of stop_when_cache_set_failed from 1/0 to "auto"/"always". v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1 (always stop). Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Michael Lyle <ml...@lyle.org> Signed-off-by: Michael Lyle <ml...@lyle.org> Cc: Nix <n...@esperi.org.uk> Cc: Pavel Goran <via-bca...@pvgoran.name> Cc: Junhui Tang <tang.jun...@zte.com.cn> Cc: Hannes Reinecke <h...@suse.com> --- drivers/md/bcache/bcache.h | 9 ++ drivers/md/bcache/super.c | 78 -- drivers/md/bcache/sysfs.c | 17 ++ 3 files changed, 94 insertions(+), 10 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 8a0327581d62..5e9f3610c6fd 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -288,6 +288,12 @@ struct io { sector_tlast; }; +enum stop_on_failure { + BCH_CACHED_DEV_STOP_AUTO = 0, + BCH_CACHED_DEV_STOP_ALWAYS, + BCH_CACHED_DEV_STOP_MODE_MAX, +}; + struct cached_dev { struct list_headlist; struct bcache_devicedisk; @@ -380,6 +386,8 @@ struct cached_dev { unsignedwriteback_rate_i_term_inverse; unsignedwriteback_rate_p_term_inverse; unsignedwriteback_rate_minimum; + + enum stop_on_failurestop_when_cache_set_failed; }; enum alloc_reserve { @@ -939,6 +947,7 @@ void bch_write_bdev_super(struct cached_dev *, struct closure *); extern struct workqueue_struct *bcache_wq; extern const char * const bch_cache_modes[]; +extern const char * const bch_stop_on_failure_modes[]; extern struct mutex bch_register_lock; extern struct list_head bch_cache_sets; diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fea283e6389e..dab9f59a1a2f 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -47,6 +47,14 @@ const char * const bch_cache_modes[] = { NULL }; +/* Default is -1; we skip past it for stop_when_cache_set_failed */ +const char * const bch_stop_on_failure_modes[] = { + "default", + "auto", + "always", + NULL +}; + static struct kobject *bcache_kobj; struct mutex bch_register_lock; LIST_HEAD(bch_cache_sets); @@ -1199,6 +1207,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size) max(dc->disk.disk->queue->backing_dev_info->ra_pages, q->backing_dev_info->ra_pages); + /* default to auto */ + dc->stop_when_cache_set_failed = BCH_CACHED_DEV_STOP_AUTO; + bch_cached_dev_request_init(dc); bch_cached_dev_writeback_init(dc); return 0; @@ -1475,25 +1486,72 @@ static void cache_set_flush(struct closure *cl) closure_return(cl); } +/* + * This function is only called when CACHE_SE
[for-4.17 11/20] bcache: add backing_request_endio() for bi_end_io
From: Coly Li <col...@suse.de> In order to catch I/O error of backing device, a separate bi_end_io call back is required. Then a per backing device counter can record I/O errors number and retire the backing device if the counter reaches a per backing device I/O error limit. This patch adds backing_request_endio() to bcache backing device I/O code path, this is a preparation for further complicated backing device failure handling. So far there is no real code logic change, I make this change a separate patch to make sure it is stable and reliable for further work. Changelog: v2: Fix code comments typo, remove a redundant bch_writeback_add() line added in v4 patch set. v1: indeed this is new added in this patch set. [mlyle: truncated commit subject] Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Junhui Tang <tang.jun...@zte.com.cn> Cc: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/request.c | 91 --- drivers/md/bcache/super.c | 1 + drivers/md/bcache/writeback.c | 1 + 3 files changed, 78 insertions(+), 15 deletions(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 5c8ae69c8502..b4a5768afbe9 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl) } op->insert_data_done = true; + /* get in bch_data_insert() */ bio_put(bio); out: continue_at(cl, bch_data_insert_keys, op->wq); @@ -630,6 +631,38 @@ static void request_endio(struct bio *bio) closure_put(cl); } +static void backing_request_endio(struct bio *bio) +{ + struct closure *cl = bio->bi_private; + + if (bio->bi_status) { + struct search *s = container_of(cl, struct search, cl); + /* +* If a bio has REQ_PREFLUSH for writeback mode, it is +* speically assembled in cached_dev_write() for a non-zero +* write request which has REQ_PREFLUSH. we don't set +* s->iop.status by this failure, the status will be decided +* by result of bch_data_insert() operation. +*/ + if (unlikely(s->iop.writeback && +bio->bi_opf & REQ_PREFLUSH)) { + char buf[BDEVNAME_SIZE]; + + bio_devname(bio, buf); + pr_err("Can't flush %s: returned bi_status %i", + buf, bio->bi_status); + } else { + /* set to orig_bio->bi_status in bio_complete() */ + s->iop.status = bio->bi_status; + } + s->recoverable = false; + /* should count I/O error for backing device here */ + } + + bio_put(bio); + closure_put(cl); +} + static void bio_complete(struct search *s) { if (s->orig_bio) { @@ -644,13 +677,21 @@ static void bio_complete(struct search *s) } } -static void do_bio_hook(struct search *s, struct bio *orig_bio) +static void do_bio_hook(struct search *s, + struct bio *orig_bio, + bio_end_io_t *end_io_fn) { struct bio *bio = >bio.bio; bio_init(bio, NULL, 0); __bio_clone_fast(bio, orig_bio); - bio->bi_end_io = request_endio; + /* +* bi_end_io can be set separately somewhere else, e.g. the +* variants in, +* - cache_bio->bi_end_io from cached_dev_cache_miss() +* - n->bi_end_io from cache_lookup_fn() +*/ + bio->bi_end_io = end_io_fn; bio->bi_private = >cl; bio_cnt_set(bio, 3); @@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio, s = mempool_alloc(d->c->search, GFP_NOIO); closure_init(>cl, NULL); - do_bio_hook(s, bio); + do_bio_hook(s, bio, request_endio); s->orig_bio = bio; s->cache_miss = NULL; @@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl) trace_bcache_read_retry(s->orig_bio); s->iop.status = 0; - do_bio_hook(s, s->orig_bio); + do_bio_hook(s, s->orig_bio, backing_request_endio); /* XXX: invalidate cache */ + /* I/O request sent to backing device */ closure_bio_submit(s->iop.c, bio, cl); } @@ -859,7 +901,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, bio_copy_dev(cache_bio, miss); cache_bio->bi_iter.bi_size = s->insert_bio_sectors << 9; - cache_bio->
[for-4.17 06/20] bcache: fix inaccurate io state for detached bcache devices
From: Tang Junhui <tang.jun...@zte.com.cn> When we run IO in a detached device, and run iostat to shows IO status, normally it will show like bellow (Omitted some fields): Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util sdd... 15.89 0.531.820.202.23 1.81 52.30 bcache0... 15.89 115.420.000.000.00 2.40 69.60 but after IO stopped, there are still very big avgqu-sz and %util values as bellow: Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util bcache0 ... 0 5326.320.000.000.00 0.00 100.10 The reason for this issue is that, only generic_start_io_acct() called and no generic_end_io_acct() called for detached device in cached_dev_make_request(). See the code: //start generic_start_io_acct() generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0); if (cached_dev_get(dc)) { //will callback generic_end_io_acct() } else { //will not call generic_end_io_acct() } This patch calls generic_end_io_acct() in the end of IO for detached devices, so we can show IO state correctly. (Modified to use GFP_NOIO in kzalloc() by Coly Li) Changelog: v2: fix typo. v1: the initial version. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/request.c | 58 +++-- 1 file changed, 51 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 7aca308bee5b..5c8ae69c8502 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -986,6 +986,55 @@ static void cached_dev_nodata(struct closure *cl) continue_at(cl, cached_dev_bio_complete, NULL); } +struct detached_dev_io_private { + struct bcache_device*d; + unsigned long start_time; + bio_end_io_t*bi_end_io; + void*bi_private; +}; + +static void detached_dev_end_io(struct bio *bio) +{ + struct detached_dev_io_private *ddip; + + ddip = bio->bi_private; + bio->bi_end_io = ddip->bi_end_io; + bio->bi_private = ddip->bi_private; + + generic_end_io_acct(ddip->d->disk->queue, + bio_data_dir(bio), + >d->disk->part0, ddip->start_time); + + kfree(ddip); + + bio->bi_end_io(bio); +} + +static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) +{ + struct detached_dev_io_private *ddip; + struct cached_dev *dc = container_of(d, struct cached_dev, disk); + + /* +* no need to call closure_get(>disk.cl), +* because upper layer had already opened bcache device, +* which would call closure_get(>disk.cl) +*/ + ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO); + ddip->d = d; + ddip->start_time = jiffies; + ddip->bi_end_io = bio->bi_end_io; + ddip->bi_private = bio->bi_private; + bio->bi_end_io = detached_dev_end_io; + bio->bi_private = ddip; + + if ((bio_op(bio) == REQ_OP_DISCARD) && + !blk_queue_discard(bdev_get_queue(dc->bdev))) + bio->bi_end_io(bio); + else + generic_make_request(bio); +} + /* Cached devices - read & write stuff */ static blk_qc_t cached_dev_make_request(struct request_queue *q, @@ -1028,13 +1077,8 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q, else cached_dev_read(dc, s); } - } else { - if ((bio_op(bio) == REQ_OP_DISCARD) && - !blk_queue_discard(bdev_get_queue(dc->bdev))) - bio_endio(bio); - else - generic_make_request(bio); - } + } else + detached_dev_do_request(d, bio); return BLK_QC_T_NONE; } -- 2.14.1
[for-4.17 07/20] bcache: fix incorrect sysfs output value of strip size
From: Tang Junhui <tang.jun...@zte.com.cn> Stripe size is shown as zero when no strip in back end device: [root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size 0.0k Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs interface, stripe_size was changed from sectors to bytes, and move 9 bits left, so the 32 bits variable overflows. This patch change the variable to a 64 bits type before moving bits. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 414129f7c49f..8c3fd05db87a 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -181,7 +181,7 @@ SHOW(__bch_cached_dev) sysfs_hprint(dirty_data, bcache_dev_sectors_dirty(>disk) << 9); - sysfs_hprint(stripe_size, dc->disk.stripe_size << 9); + sysfs_hprint(stripe_size,((uint64_t)dc->disk.stripe_size) << 9); var_printf(partial_stripes_expensive, "%u"); var_hprint(sequential_cutoff); -- 2.14.1
[for-4.17 00/20] First 20 commits for bcache
Hi Jens--- Here's a series of patches for 4.17. All have been reviewed and tested. There's more in my review queue, but they're all things that are "trickier" and will take a bit more work to convince myself are safe. All have received simple creation/attach/detach/reboot scenario testing and will continue to be tested. Please apply at your leisure. Thanks, Mike -- Bart Van Assche (8): bcache: Fix indentation bcache: Add __printf annotation to __bch_check_keys() bcache: Annotate switch fall-through bcache: Fix kernel-doc warnings bcache: Remove an unused variable bcache: Suppress more warnings about set-but-not-used variables bcache: Reduce the number of sparse complaints about lock imbalances bcache: Fix a compiler warning in bcache_device_init() Chengguang Xu (1): bcache: move closure debug file into debug directory Coly Li (7): bcache: fix cached_dev->count usage for bch_cache_set_error() bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set bcache: stop dc->writeback_rate_update properly bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags bcache: add stop_when_cache_set_failed option to backing device bcache: add backing_request_endio() for bi_end_io bcache: add io_disable to struct cached_dev Tang Junhui (4): bcache: fix inaccurate io state for detached bcache devices bcache: fix incorrect sysfs output value of strip size bcache: fix error return value in memory shrink bcache: fix using of loop variable in memory shrink drivers/md/bcache/alloc.c | 3 +- drivers/md/bcache/bcache.h| 57 - drivers/md/bcache/bset.c | 4 +- drivers/md/bcache/bset.h | 5 +- drivers/md/bcache/btree.c | 26 +++--- drivers/md/bcache/closure.c | 17 ++-- drivers/md/bcache/closure.h | 5 +- drivers/md/bcache/debug.c | 14 ++-- drivers/md/bcache/extents.c | 2 - drivers/md/bcache/io.c| 16 +++- drivers/md/bcache/journal.c | 8 +- drivers/md/bcache/request.c | 184 +++--- drivers/md/bcache/super.c | 154 ++- drivers/md/bcache/sysfs.c | 55 - drivers/md/bcache/util.c | 25 +++--- drivers/md/bcache/util.h | 6 -- drivers/md/bcache/writeback.c | 92 ++--- drivers/md/bcache/writeback.h | 4 +- 18 files changed, 552 insertions(+), 125 deletions(-) -- 2.14.1
[for-4.17 17/20] bcache: Remove an unused variable
From: Bart Van Assche <bart.vanass...@wdc.com> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/extents.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/md/bcache/extents.c b/drivers/md/bcache/extents.c index f9d391711595..c334e461 100644 --- a/drivers/md/bcache/extents.c +++ b/drivers/md/bcache/extents.c @@ -534,7 +534,6 @@ static bool bch_extent_bad_expensive(struct btree *b, const struct bkey *k, static bool bch_extent_bad(struct btree_keys *bk, const struct bkey *k) { struct btree *b = container_of(bk, struct btree, keys); - struct bucket *g; unsigned i, stale; if (!KEY_PTRS(k) || @@ -549,7 +548,6 @@ static bool bch_extent_bad(struct btree_keys *bk, const struct bkey *k) return false; for (i = 0; i < KEY_PTRS(k); i++) { - g = PTR_BUCKET(b->c, k, i); stale = ptr_stale(b->c, k, i); btree_bug_on(stale > 96, b, -- 2.14.1
[for-4.17 14/20] bcache: Add __printf annotation to __bch_check_keys()
From: Bart Van Assche <bart.vanass...@wdc.com> Make it possible for the compiler to verify the consistency of the format string passed to __bch_check_keys() and the arguments that should be formatted according to that format string. Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/bset.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/bset.h b/drivers/md/bcache/bset.h index fa506c1aa524..0c24280f3b98 100644 --- a/drivers/md/bcache/bset.h +++ b/drivers/md/bcache/bset.h @@ -531,14 +531,15 @@ int __bch_keylist_realloc(struct keylist *, unsigned); #ifdef CONFIG_BCACHE_DEBUG int __bch_count_data(struct btree_keys *); -void __bch_check_keys(struct btree_keys *, const char *, ...); +void __printf(2, 3) __bch_check_keys(struct btree_keys *, const char *, ...); void bch_dump_bset(struct btree_keys *, struct bset *, unsigned); void bch_dump_bucket(struct btree_keys *); #else static inline int __bch_count_data(struct btree_keys *b) { return -1; } -static inline void __bch_check_keys(struct btree_keys *b, const char *fmt, ...) {} +static inline void __printf(2, 3) + __bch_check_keys(struct btree_keys *b, const char *fmt, ...) {} static inline void bch_dump_bucket(struct btree_keys *b) {} void bch_dump_bset(struct btree_keys *, struct bset *, unsigned); -- 2.14.1
[for-4.17 18/20] bcache: Suppress more warnings about set-but-not-used variables
From: Bart Van Assche <bart.vanass...@wdc.com> This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/bset.c| 4 ++-- drivers/md/bcache/journal.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index e56d3ecdbfcb..579c696a5fe0 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -1072,7 +1072,7 @@ EXPORT_SYMBOL(bch_btree_iter_init); static inline struct bkey *__bch_btree_iter_next(struct btree_iter *iter, btree_iter_cmp_fn *cmp) { - struct btree_iter_set unused; + struct btree_iter_set b __maybe_unused; struct bkey *ret = NULL; if (!btree_iter_end(iter)) { @@ -1087,7 +1087,7 @@ static inline struct bkey *__bch_btree_iter_next(struct btree_iter *iter, } if (iter->data->k == iter->data->end) - heap_pop(iter, unused, cmp); + heap_pop(iter, b, cmp); else heap_sift(iter, 0, cmp); } diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index c94085f400a4..acd0e5c074dd 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -493,7 +493,7 @@ static void journal_reclaim(struct cache_set *c) struct cache *ca; uint64_t last_seq; unsigned iter, n = 0; - atomic_t p; + atomic_t p __maybe_unused; atomic_long_inc(>reclaim); -- 2.14.1
[for-4.17 08/20] bcache: fix error return value in memory shrink
From: Tang Junhui <tang.jun...@zte.com.cn> In bch_mca_scan(), the return value should not be the number of freed btree nodes, but the number of pages of freed btree nodes. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/btree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 39cc8a549091..b2d4899f48d5 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -719,7 +719,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, } out: mutex_unlock(>bucket_lock); - return freed; + return freed * c->btree_pages; } static unsigned long bch_mca_count(struct shrinker *shrink, -- 2.14.1
[for-4.17 09/20] bcache: fix using of loop variable in memory shrink
From: Tang Junhui <tang.jun...@zte.com.cn> In bch_mca_scan(), There are some confusion and logical error in the use of loop variables. In this patch, we clarify them as: 1) nr: the number of btree nodes needs to scan, which will decrease after we scan a btree node, and should not be less than 0; 2) i: the number of btree nodes have scanned, includes both btree_cache_freeable and btree_cache, which should not be bigger than btree_cache_used; 3) freed: the number of btree nodes have freed. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/btree.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index b2d4899f48d5..d64aff0b8abc 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -665,6 +665,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, struct btree *b, *t; unsigned long i, nr = sc->nr_to_scan; unsigned long freed = 0; + unsigned int btree_cache_used; if (c->shrinker_disabled) return SHRINK_STOP; @@ -689,9 +690,10 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, nr = min_t(unsigned long, nr, mca_can_free(c)); i = 0; + btree_cache_used = c->btree_cache_used; list_for_each_entry_safe(b, t, >btree_cache_freeable, list) { - if (freed >= nr) - break; + if (nr <= 0) + goto out; if (++i > 3 && !mca_reap(b, 0, false)) { @@ -699,9 +701,10 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, rw_unlock(true, b); freed++; } + nr--; } - for (i = 0; (nr--) && i < c->btree_cache_used; i++) { + for (; (nr--) && i < btree_cache_used; i++) { if (list_empty(>btree_cache)) goto out; -- 2.14.1
[for-4.17 10/20] bcache: move closure debug file into debug directory
From: Chengguang Xu <cgxu...@gmx.com> In current code closure debug file is outside of debug directory and when unloading module there is lack of removing operation for closure debug file, so it will cause creating error when trying to reload module. This patch move closure debug file into "bcache" debug direcory so that the file can get deleted properly. Signed-off-by: Chengguang Xu <cgxu...@gmx.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> --- drivers/md/bcache/closure.c | 9 + drivers/md/bcache/closure.h | 5 +++-- drivers/md/bcache/debug.c | 14 +++--- drivers/md/bcache/super.c | 3 +-- 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index 7f12920c14f7..c0949c9f843b 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -157,7 +157,7 @@ void closure_debug_destroy(struct closure *cl) } EXPORT_SYMBOL(closure_debug_destroy); -static struct dentry *debug; +static struct dentry *closure_debug; static int debug_seq_show(struct seq_file *f, void *data) { @@ -199,11 +199,12 @@ static const struct file_operations debug_ops = { .release= single_release }; -void __init closure_debug_init(void) +int __init closure_debug_init(void) { - debug = debugfs_create_file("closures", 0400, NULL, NULL, _ops); + closure_debug = debugfs_create_file("closures", + 0400, bcache_debug, NULL, _ops); + return IS_ERR_OR_NULL(closure_debug); } - #endif MODULE_AUTHOR("Kent Overstreet <koverstr...@google.com>"); diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h index 3b9dfc9962ad..71427eb5fdae 100644 --- a/drivers/md/bcache/closure.h +++ b/drivers/md/bcache/closure.h @@ -105,6 +105,7 @@ struct closure; struct closure_syncer; typedef void (closure_fn) (struct closure *); +extern struct dentry *bcache_debug; struct closure_waitlist { struct llist_head list; @@ -185,13 +186,13 @@ static inline void closure_sync(struct closure *cl) #ifdef CONFIG_BCACHE_CLOSURES_DEBUG -void closure_debug_init(void); +int closure_debug_init(void); void closure_debug_create(struct closure *cl); void closure_debug_destroy(struct closure *cl); #else -static inline void closure_debug_init(void) {} +static inline int closure_debug_init(void) { return 0; } static inline void closure_debug_create(struct closure *cl) {} static inline void closure_debug_destroy(struct closure *cl) {} diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index af89408befe8..028f7b386e01 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -17,7 +17,7 @@ #include #include -static struct dentry *debug; +struct dentry *bcache_debug; #ifdef CONFIG_BCACHE_DEBUG @@ -232,11 +232,11 @@ static const struct file_operations cache_set_debug_ops = { void bch_debug_init_cache_set(struct cache_set *c) { - if (!IS_ERR_OR_NULL(debug)) { + if (!IS_ERR_OR_NULL(bcache_debug)) { char name[50]; snprintf(name, 50, "bcache-%pU", c->sb.set_uuid); - c->debug = debugfs_create_file(name, 0400, debug, c, + c->debug = debugfs_create_file(name, 0400, bcache_debug, c, _set_debug_ops); } } @@ -245,13 +245,13 @@ void bch_debug_init_cache_set(struct cache_set *c) void bch_debug_exit(void) { - if (!IS_ERR_OR_NULL(debug)) - debugfs_remove_recursive(debug); + if (!IS_ERR_OR_NULL(bcache_debug)) + debugfs_remove_recursive(bcache_debug); } int __init bch_debug_init(struct kobject *kobj) { - debug = debugfs_create_dir("bcache", NULL); + bcache_debug = debugfs_create_dir("bcache", NULL); - return IS_ERR_OR_NULL(debug); + return IS_ERR_OR_NULL(bcache_debug); } diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index dab9f59a1a2f..26b7a7d4df15 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2239,7 +2239,6 @@ static int __init bcache_init(void) mutex_init(_register_lock); init_waitqueue_head(_wait); register_reboot_notifier(); - closure_debug_init(); bcache_major = register_blkdev(0, "bcache"); if (bcache_major < 0) { @@ -2251,7 +2250,7 @@ static int __init bcache_init(void) if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) || !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) || bch_request_init() || - bch_debug_init(bcache_kobj) || + bch_debug_init(bcache_kobj) || closure_debug_init() || sysfs_create_files(bcache_kobj, files)) goto err; -- 2.14.1
[for-4.17 01/20] bcache: fix cached_dev->count usage for bch_cache_set_error()
From: Coly Li <col...@suse.de> When bcache metadata I/O fails, bcache will call bch_cache_set_error() to retire the whole cache set. The expected behavior to retire a cache set is to unregister the cache set, and unregister all backing device attached to this cache set, then remove sysfs entries of the cache set and all attached backing devices, finally release memory of structs cache_set, cache, cached_dev and bcache_device. In my testing when journal I/O failure triggered by disconnected cache device, sometimes the cache set cannot be retired, and its sysfs entry /sys/fs/bcache/ still exits and the backing device also references it. This is not expected behavior. When metadata I/O failes, the call senquence to retire whole cache set is, bch_cache_set_error() bch_cache_set_unregister() bch_cache_set_stop() __cache_set_unregister() <- called as callback by calling clousre_queue(>caching) cache_set_flush()<- called as a callback when refcount of cache_set->caching is 0 cache_set_free() <- called as a callback when refcount of catch_set->cl is 0 bch_cache_set_release() <- called as a callback when refcount of catch_set->kobj is 0 I find if kernel thread bch_writeback_thread() quits while-loop when kthread_should_stop() is true and searched_full_index is false, clousre callback cache_set_flush() set by continue_at() will never be called. The result is, bcache fails to retire whole cache set. cache_set_flush() will be called when refcount of closure c->caching is 0, and in function bcache_device_detach() refcount of closure c->caching is released to 0 by clousre_put(). In metadata error code path, function bcache_device_detach() is called by cached_dev_detach_finish(). This is a callback routine being called when cached_dev->count is 0. This refcount is decreased by cached_dev_put(). The above dependence indicates, cache_set_flush() will be called when refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0 when refcount of cache_dev->count is 0. The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount of cache_dev is not decreased properly. In bch_writeback_thread(), cached_dev_put() is called only when searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a there is no dirty data on cache. In most of run time it is correct, but when bch_writeback_thread() quits the while-loop while cache is still dirty, current code forget to call cached_dev_put() before this kernel thread exits. This is why sometimes cache_set_flush() is not executed and cache set fails to be retired. The reason to call cached_dev_put() in bch_writeback_rate() is, when the cache device changes from clean to dirty, cached_dev_get() is called, to make sure during writeback operatiions both backing and cache devices won't be released. Adding following code in bch_writeback_thread() does not work, static int bch_writeback_thread(void *arg) } + if (atomic_read(>has_dirty)) + cached_dev_put() + return 0; } because writeback kernel thread can be waken up and start via sysfs entry: echo 1 > /sys/block/bcache/bcache/writeback_running It is difficult to check whether backing device is dirty without race and extra lock. So the above modification will introduce potential refcount underflow in some conditions. The correct fix is, to take cached dev refcount when creating the kernel thread, and put it before the kernel thread exits. Then bcache does not need to take a cached dev refcount when cache turns from clean to dirty, or to put a cached dev refcount when cache turns from ditry to clean. The writeback kernel thread is alwasy safe to reference data structure from cache set, cache and cached device (because a refcount of cache device is taken for it already), and no matter the kernel thread is stopped by I/O errors or system reboot, cached_dev->count can always be used correctly. The patch is simple, but understanding how it works is quite complicated. Changelog: v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes. v1: initial version for review. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Michael Lyle <ml...@lyle.org> Cc: Junhui Tang <tang.jun...@zte.com.cn> --- drivers/md/bcache/super.c | 1 - drivers/md/bcache/writeback.c | 11 --- drivers/md/bcache/writeback.h | 2 -- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index f
[for-4.17 12/20] bcache: add io_disable to struct cached_dev
From: Coly Li <col...@suse.de> If a bcache device is configured to writeback mode, current code does not handle write I/O errors on backing devices properly. In writeback mode, write request is written to cache device, and latter being flushed to backing device. If I/O failed when writing from cache device to the backing device, bcache code just ignores the error and upper layer code is NOT noticed that the backing device is broken. This patch tries to handle backing device failure like how the cache device failure is handled, - Add a error counter 'io_errors' and error limit 'error_limit' in struct cached_dev. Add another io_disable to struct cached_dev to disable I/Os on the problematic backing device. - When I/O error happens on backing device, increase io_errors counter. And if io_errors reaches error_limit, set cache_dev->io_disable to true, and stop the bcache device. The result is, if backing device is broken of disconnected, and I/O errors reach its error limit, backing device will be disabled and the associated bcache device will be removed from system. Changelog: v2: remove "bcache: " prefix in pr_error(), and use correct name string to print out bcache device gendisk name. v1: indeed this is new added in v2 patch set. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Michael Lyle <ml...@lyle.org> Cc: Junhui Tang <tang.jun...@zte.com.cn> --- drivers/md/bcache/bcache.h | 6 ++ drivers/md/bcache/io.c | 14 ++ drivers/md/bcache/request.c | 14 -- drivers/md/bcache/super.c | 21 + drivers/md/bcache/sysfs.c | 15 ++- 5 files changed, 67 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 5e9f3610c6fd..d338b7086013 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -367,6 +367,7 @@ struct cached_dev { unsignedsequential_cutoff; unsignedreadahead; + unsignedio_disable:1; unsignedverify:1; unsignedbypass_torture_test:1; @@ -388,6 +389,9 @@ struct cached_dev { unsignedwriteback_rate_minimum; enum stop_on_failurestop_when_cache_set_failed; +#define DEFAULT_CACHED_DEV_ERROR_LIMIT 64 + atomic_tio_errors; + unsignederror_limit; }; enum alloc_reserve { @@ -911,6 +915,7 @@ static inline void wait_for_kthread_stop(void) /* Forward declarations */ +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio); void bch_count_io_errors(struct cache *, blk_status_t, int, const char *); void bch_bbio_count_io_errors(struct cache_set *, struct bio *, blk_status_t, const char *); @@ -938,6 +943,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned, struct bkey *, int, bool); bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned, unsigned, unsigned, bool); +bool bch_cached_dev_error(struct cached_dev *dc); __printf(2, 3) bool bch_cache_set_error(struct cache_set *, const char *, ...); diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index 8013ecbcdbda..7fac97ae036e 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c, } /* IO errors */ +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio) +{ + char buf[BDEVNAME_SIZE]; + unsigned errors; + + WARN_ONCE(!dc, "NULL pointer of struct cached_dev"); + + errors = atomic_add_return(1, >io_errors); + if (errors < dc->error_limit) + pr_err("%s: IO error on backing device, unrecoverable", + bio_devname(bio, buf)); + else + bch_cached_dev_error(dc); +} void bch_count_io_errors(struct cache *ca, blk_status_t error, diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index b4a5768afbe9..5a82237c7025 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio) if (bio->bi_status) { struct search *s = container_of(cl, struct search, cl); + struct cached_dev *dc = container_of(s->d, +struct cached_dev, disk); /* * If a bio has REQ_PREFLUSH for writeback mode, it is * speically assembled in cached_dev_write() for a non-zero @@ -657,6 +659,7 @@ static void backing_request_endio(struct bio *bio) } s->
[for-4.17 13/20] bcache: Fix indentation
From: Bart Van Assche <bart.vanass...@wdc.com> This patch avoids that smatch complains about inconsistent indentation. Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Coly Li <col...@suse.de> --- drivers/md/bcache/btree.c | 2 +- drivers/md/bcache/writeback.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index d64aff0b8abc..143ed5a758e7 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -2178,7 +2178,7 @@ int bch_btree_insert_check_key(struct btree *b, struct btree_op *op, if (b->key.ptr[0] != btree_ptr || b->seq != seq + 1) { - op->lock = b->level; + op->lock = b->level; goto out; } } diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 0bba8f1c6cdf..610fb01de629 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -39,7 +39,7 @@ static inline uint64_t bcache_flash_devs_sectors_dirty(struct cache_set *c) if (!d || !UUID_FLASH_ONLY(>uuids[i])) continue; - ret += bcache_dev_sectors_dirty(d); + ret += bcache_dev_sectors_dirty(d); } mutex_unlock(_register_lock); -- 2.14.1
Re: [PATCH 07/16] bcache: Reduce the number of sparse complaints about lock imbalances
On 03/15/2018 08:08 AM, Bart Van Assche wrote: > Add more annotations for sparse to inform it about which functions do > not have the same number of spin_lock() and spin_unlock() calls. > > Signed-off-by: Bart Van AsscheLGTM, applying. -mpl
Re: [PATCH 05/16] bcache: Remove an unused variable
On 03/15/2018 08:08 AM, Bart Van Assche wrote: > Signed-off-by: Bart Van AsscheLGTM, applying. (I don't see a problem with empty body).
Re: [PATCH 03/16] bcache: Annotate switch fall-through
On 03/15/2018 08:08 AM, Bart Van Assche wrote: > This patch avoids that building with W=1 triggers complaints about > switch fall-throughs. > > Signed-off-by: Bart Van AsscheLGTM, applying.
Re: [PATCH 04/16] bcache: Fix kernel-doc warnings
On 03/15/2018 08:08 AM, Bart Van Assche wrote: > Avoid that building with W=1 triggers warnings about the kernel-doc > headers. > > Signed-off-by: Bart Van AsscheLGTM, applying.
Re: [PATCH 02/16] bcache: Add __printf annotation to __bch_check_keys()
On Thu, Mar 15, 2018 at 8:08 AM, Bart Van Asschewrote: > Make it possible for the compiler to verify the consistency of the > format string passed to __bch_check_keys() and the arguments that > should be formatted according to that format string. > > Signed-off-by: Bart Van Assche LGTM, applying.
Re: [PATCH 01/16] bcache: Fix indentation
On Thu, Mar 15, 2018 at 8:07 AM, Bart Van Asschewrote: > This patch avoids that smatch complains about inconsistent indentation. > > Signed-off-by: Bart Van Assche LGTM, applying
Re: [PATCH 08/16] bcache: Fix a compiler warning in bcache_device_init()
On Thu, Mar 15, 2018 at 8:08 AM, Bart Van Asschewrote: > Avoid that building with W=1 triggers the following compiler warning: > > drivers/md/bcache/super.c:776:20: warning: comparison is always false due to > limited range of data type [-Wtype-limits] > d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) { > ^ > > Signed-off-by: Bart Van Assche LGTM, applying.
Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev
On 02/27/2018 08:55 AM, Coly Li wrote: > If a bcache device is configured to writeback mode, current code does not > handle write I/O errors on backing devices properly. > > In writeback mode, write request is written to cache device, and > latter being flushed to backing device. If I/O failed when writing from > cache device to the backing device, bcache code just ignores the error and > upper layer code is NOT noticed that the backing device is broken. lgtm, applied
Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev
LGTM, applying. On 02/27/2018 08:55 AM, Coly Li wrote: > If a bcache device is configured to writeback mode, current code does not > handle write I/O errors on backing devices properly. > > In writeback mode, write request is written to cache device, and > latter being flushed to backing device. If I/O failed when writing from > cache device to the backing device, bcache code just ignores the error and > upper layer code is NOT noticed that the backing device is broken. > > This patch tries to handle backing device failure like how the cache device > failure is handled, > - Add a error counter 'io_errors' and error limit 'error_limit' in struct > cached_dev. Add another io_disable to struct cached_dev to disable I/Os > on the problematic backing device. > - When I/O error happens on backing device, increase io_errors counter. And > if io_errors reaches error_limit, set cache_dev->io_disable to true, and > stop the bcache device. > > The result is, if backing device is broken of disconnected, and I/O errors > reach its error limit, backing device will be disabled and the associated > bcache device will be removed from system. > > Changelog: > v2: remove "bcache: " prefix in pr_error(), and use correct name string to > print out bcache device gendisk name. > v1: indeed this is new added in v2 patch set. > > Signed-off-by: Coly Li <col...@suse.de> > Reviewed-by: Hannes Reinecke <h...@suse.com> > Cc: Michael Lyle <ml...@lyle.org> > Cc: Junhui Tang <tang.jun...@zte.com.cn> > --- > drivers/md/bcache/bcache.h | 6 ++ > drivers/md/bcache/io.c | 14 ++ > drivers/md/bcache/request.c | 14 -- > drivers/md/bcache/super.c | 21 + > drivers/md/bcache/sysfs.c | 15 ++- > 5 files changed, 67 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 5e9f3610c6fd..d338b7086013 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -367,6 +367,7 @@ struct cached_dev { > unsignedsequential_cutoff; > unsignedreadahead; > > + unsignedio_disable:1; > unsignedverify:1; > unsignedbypass_torture_test:1; > > @@ -388,6 +389,9 @@ struct cached_dev { > unsignedwriteback_rate_minimum; > > enum stop_on_failurestop_when_cache_set_failed; > +#define DEFAULT_CACHED_DEV_ERROR_LIMIT 64 > + atomic_tio_errors; > + unsignederror_limit; > }; > > enum alloc_reserve { > @@ -911,6 +915,7 @@ static inline void wait_for_kthread_stop(void) > > /* Forward declarations */ > > +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio); > void bch_count_io_errors(struct cache *, blk_status_t, int, const char *); > void bch_bbio_count_io_errors(struct cache_set *, struct bio *, > blk_status_t, const char *); > @@ -938,6 +943,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned, >struct bkey *, int, bool); > bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned, > unsigned, unsigned, bool); > +bool bch_cached_dev_error(struct cached_dev *dc); > > __printf(2, 3) > bool bch_cache_set_error(struct cache_set *, const char *, ...); > diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c > index 8013ecbcdbda..7fac97ae036e 100644 > --- a/drivers/md/bcache/io.c > +++ b/drivers/md/bcache/io.c > @@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c, > } > > /* IO errors */ > +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio) > +{ > + char buf[BDEVNAME_SIZE]; > + unsigned errors; > + > + WARN_ONCE(!dc, "NULL pointer of struct cached_dev"); > + > + errors = atomic_add_return(1, >io_errors); > + if (errors < dc->error_limit) > + pr_err("%s: IO error on backing device, unrecoverable", > + bio_devname(bio, buf)); > + else > + bch_cached_dev_error(dc); > +} > > void bch_count_io_errors(struct cache *ca, >blk_status_t error, > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 0c517dd806a5..d7a463e0250e 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio) > > if (bio->bi_status) { > struct search *s = container_o
Re: [PATCH v7 7/9] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O
LGTM, applying On 02/27/2018 08:55 AM, Coly Li wrote: > In order to catch I/O error of backing device, a separate bi_end_io > call back is required. Then a per backing device counter can record I/O > errors number and retire the backing device if the counter reaches a > per backing device I/O error limit. > > This patch adds backing_request_endio() to bcache backing device I/O code > path, this is a preparation for further complicated backing device failure > handling. So far there is no real code logic change, I make this change a > separate patch to make sure it is stable and reliable for further work. > > Changelog: > v2: Fix code comments typo, remove a redundant bch_writeback_add() line > added in v4 patch set. > v1: indeed this is new added in this patch set. > > Signed-off-by: Coly Li <col...@suse.de> > Reviewed-by: Hannes Reinecke <h...@suse.com> > Cc: Junhui Tang <tang.jun...@zte.com.cn> > Cc: Michael Lyle <ml...@lyle.org> > --- > drivers/md/bcache/request.c | 93 > +++ > drivers/md/bcache/super.c | 1 + > drivers/md/bcache/writeback.c | 1 + > 3 files changed, 79 insertions(+), 16 deletions(-) > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 279c9266bf50..0c517dd806a5 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl) > } > > op->insert_data_done = true; > + /* get in bch_data_insert() */ > bio_put(bio); > out: > continue_at(cl, bch_data_insert_keys, op->wq); > @@ -630,6 +631,38 @@ static void request_endio(struct bio *bio) > closure_put(cl); > } > > +static void backing_request_endio(struct bio *bio) > +{ > + struct closure *cl = bio->bi_private; > + > + if (bio->bi_status) { > + struct search *s = container_of(cl, struct search, cl); > + /* > + * If a bio has REQ_PREFLUSH for writeback mode, it is > + * speically assembled in cached_dev_write() for a non-zero > + * write request which has REQ_PREFLUSH. we don't set > + * s->iop.status by this failure, the status will be decided > + * by result of bch_data_insert() operation. > + */ > + if (unlikely(s->iop.writeback && > + bio->bi_opf & REQ_PREFLUSH)) { > + char buf[BDEVNAME_SIZE]; > + > + bio_devname(bio, buf); > + pr_err("Can't flush %s: returned bi_status %i", > + buf, bio->bi_status); > + } else { > + /* set to orig_bio->bi_status in bio_complete() */ > + s->iop.status = bio->bi_status; > + } > + s->recoverable = false; > + /* should count I/O error for backing device here */ > + } > + > + bio_put(bio); > + closure_put(cl); > +} > + > static void bio_complete(struct search *s) > { > if (s->orig_bio) { > @@ -644,13 +677,21 @@ static void bio_complete(struct search *s) > } > } > > -static void do_bio_hook(struct search *s, struct bio *orig_bio) > +static void do_bio_hook(struct search *s, > + struct bio *orig_bio, > + bio_end_io_t *end_io_fn) > { > struct bio *bio = >bio.bio; > > bio_init(bio, NULL, 0); > __bio_clone_fast(bio, orig_bio); > - bio->bi_end_io = request_endio; > + /* > + * bi_end_io can be set separately somewhere else, e.g. the > + * variants in, > + * - cache_bio->bi_end_io from cached_dev_cache_miss() > + * - n->bi_end_io from cache_lookup_fn() > + */ > + bio->bi_end_io = end_io_fn; > bio->bi_private = >cl; > > bio_cnt_set(bio, 3); > @@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio, > s = mempool_alloc(d->c->search, GFP_NOIO); > > closure_init(>cl, NULL); > - do_bio_hook(s, bio); > + do_bio_hook(s, bio, request_endio); > > s->orig_bio = bio; > s->cache_miss = NULL; > @@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl) > trace_bcache_read_retry(s->orig_bio); > > s->iop.status = 0; > - do_bio_hook(s, s->orig_bio); > + do_bio_hook(s, s->orig_bio, backing_request_endio); >
Re: [PATCH v2] bcache: move closure debug file into debug direcotry
On 03/06/2018 07:23 PM, Chengguang Xu wrote: > In current code closure debug file is outside of debug directory > and when unloading module there is lack of removing operation > for closure debug file, so it will cause creating error when trying > to reload module. > > This patch move closure debug file into "bcache" debug direcory > so that the file can get deleted properly. > > Signed-off-by: Chengguang XuThanks! This looks much better and is applied. Mike
Re: [PATCH v2] bcache: move closure debug file into debug direcotry
Sorry- I had to pull/unapply this actually. On 03/04/2018 11:40 PM, Chengguang Xu wrote: > -static struct dentry *debug > +struct dentry *debug; This conflicts with other symbols called "debug" and doesn't compile. Please be sure that your patch set compiles before submitting. Mike > > #ifdef CONFIG_BCACHE_DEBUG > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index 1a9fdab..b784292 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -2133,7 +2133,6 @@ static int __init bcache_init(void) > mutex_init(_register_lock); > init_waitqueue_head(_wait); > register_reboot_notifier(); > - closure_debug_init(); > > bcache_major = register_blkdev(0, "bcache"); > if (bcache_major < 0) { > @@ -2145,7 +2144,7 @@ static int __init bcache_init(void) > if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) || > !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) || > bch_request_init() || > - bch_debug_init(bcache_kobj) || > + bch_debug_init(bcache_kobj) || closure_debug_init() || > sysfs_create_files(bcache_kobj, files)) > goto err; > >
Re: Some bcache patches need reveiw
Hey Tang Junhui-- You're right. I'm sorry, these had slipped through the cracks / I had lost track of them. I also have the rest of Coly's patchset to work through. I was able to apply the first two, and will continue to work on my for-next branch. Mike On 03/05/2018 02:39 AM, tang.jun...@zte.com.cn wrote: > Hello Mike > > I send some patches some times before, with no response, > > Bellow patches are very simple, can you or anybody else have a review? > [PATCH] bcache: fix incorrect sysfs output value of strip size > [PATCH] bcache: fix error return value in memory shrink > [PATCH] bcache: fix error count in memory shrink > > Bellow patches are simple too, and important for me and also important for > bcache, > can you or anybody else have a review? > [PATCH] bcache: finish incremental GC > [PATCH] bcache: calculate the number of incremental GC nodes according to the > total of btree nodes > > Thanks, > Tang Junhui > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: [PATCH] bcache: fix error count in memory shrink
Hi Tang Junhui--- I'm not really sure about this one. It changes the semantics of the amount of work done-- nr_to_scan now means number of things to free instead of the number to check. If the system is under severe memory pressure, and most of the cache is essential/actively used, this could greatly increase the amount of work done. As to the i < btree_cache_used, it's arguably a bug, but it only reduces the amount of work done if the cache is very, very small. Mike On 01/30/2018 06:46 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > In bch_mca_scan(), nr btree nodes are expected to shrink, so the for(;;) > loop need to satisfy the condition freed < nr. > And since c->btree_cache_used always decrease after mca_data_free() calling > in for(;;) loop, so we need a auto variable to record c->btree_cache_used > before the for(;;) loop. > > Signed-off-by: Tang Junhui > --- > drivers/md/bcache/btree.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index 81e8dc3..7e9ef3d 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -664,6 +664,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, > struct btree *b, *t; > unsigned long i, nr = sc->nr_to_scan; > unsigned long freed = 0; > + unsigned int btree_cache_used; > > if (c->shrinker_disabled) > return SHRINK_STOP; > @@ -700,7 +701,8 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, > } > } > > - for (i = 0; (nr--) && i < c->btree_cache_used; i++) { > + btree_cache_used = c->btree_cache_used; > + for (i = 0; freed < nr && i < btree_cache_used; i++) { > if (list_empty(>btree_cache)) > goto out; > >
Re: [PATCH] bcache: fix error return value in memory shrink
LGTM, applied On 01/30/2018 07:30 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > In bch_mca_scan(), the return value should not be the number of freed btree > nodes, but the number of pages of freed btree nodes. > > Signed-off-by: Tang Junhui > --- > drivers/md/bcache/btree.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index 81e8dc3..a5fbccc 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -718,7 +718,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, > } > out: > mutex_unlock(>bucket_lock); > - return freed; > + return freed * c->btree_pages; > } > > static unsigned long bch_mca_count(struct shrinker *shrink, >
Re: [PATCH] bcache: fix incorrect sysfs output value of strip size
LGTM, applied. On 02/10/2018 06:30 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > Stripe size is shown as zero when no strip in back end device: > [root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size > 0.0k > > Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs > interface, stripe_size was changed from sectors to bytes, and move > 9 bits left, so the 32 bits variable overflows. > > This patch change the variable to a 64 bits type before moving bits. > > Signed-off-by: Tang Junhui > --- > drivers/md/bcache/sysfs.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index b418409..240cbec 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -170,7 +170,7 @@ > sysfs_hprint(dirty_data, >bcache_dev_sectors_dirty(>disk) << 9); > > - sysfs_hprint(stripe_size, dc->disk.stripe_size << 9); > + sysfs_hprint(stripe_size,((uint64_t)dc->disk.stripe_size) << 9); > var_printf(partial_stripes_expensive, "%u"); > > var_hprint(sequential_cutoff); >
Re: [PATCH] bcache: don't attach backing with duplicate UUID
Hi Tang Junhui--- Thanks for your review. I just sent it upstream (with your change) to Jens. Mike On 03/04/2018 05:07 PM, tang.jun...@zte.com.cn wrote: > Hello Mike > > I send the email from my personal mailbox(110950...@qq.com), it may be fail, > so I resend this email from my office mailbox again. bellow is the mail > context I send previous. > > > > I am Tang Junhui(tang.jun...@zte.com.cn), This email comes from my > personal mailbox, since I am not in office today. > >>> From: Tang Junhui>>> >>> Hello, Mike >>> >>> This patch looks good, but has some conflicts with this patch: >>> bcache: fix for data collapse after re-attaching an attached device >>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930 >>> Could you modify your fix base on the previous patch? > >> That doesn't make sense. This patch was generated from a current tree >> where it's applied on top of that: (It's based on next when it should >> really be based on Linus's tree, but it doesn't matter for patch >> application because there's no changes in next right now to bcache that >> aren't in Linus's tree). > > Originally, I did not mean merger conflicts, but the > code logical conflicts, since the previous patch add a new input parameter > set_uuid in bch_cached_dev_attach(), and if set_uuid is not NULL, > we use set_uuid as cache set uuid, otherwise, we use dc->sb.set_uuid as > the cache set uuid. > > But now, I read your patch again, and realize that you did not use > dc->sb.set_uuid, but use dc->sb.uuid to judge whether the device is a > duplicate backend device, so it's OK for me. > >> May I add your reviewed-by so I can send this (and your fix) upstream? > Reviewed-by: Tang Junhui > > Thanks > Tang Junhui > > Thanks > Tang Junhui > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
[PATCH 1/2] bcache: fix crashes in duplicate cache device register
From: Tang Junhui <tang.jun...@zte.com.cn> Kernel crashed when register a duplicate cache device, the call trace is bellow: [ 417.643790] CPU: 1 PID: 16886 Comm: bcache-register Tainted: G W OE4.15.5-amd64-preempt-sysrq-20171018 #2 [ 417.643861] Hardware name: LENOVO 20ERCTO1WW/20ERCTO1WW, BIOS N1DET41W (1.15 ) 12/31/2015 [ 417.643870] RIP: 0010:bdevname+0x13/0x1e [ 417.643876] RSP: 0018:a3aa9138fd38 EFLAGS: 00010282 [ 417.643884] RAX: RBX: 8c8f2f2f8000 RCX: d6701f8 c7edf [ 417.643890] RDX: a3aa9138fd88 RSI: a3aa9138fd88 RDI: 000 0 [ 417.643895] RBP: a3aa9138fde0 R08: a3aa9138fae8 R09: 000 1850e [ 417.643901] R10: 8c8eed34b271 R11: 8c8eed34b250 R12: 000 0 [ 417.643906] R13: d6701f78f940 R14: 8c8f38f8 R15: 8c8ea7d 9 [ 417.643913] FS: 7fde7e66f500() GS:8c8f6144() knlGS: [ 417.643919] CS: 0010 DS: ES: CR0: 80050033 [ 417.643925] CR2: 0314 CR3: 0007e6fa0001 CR4: 003 606e0 [ 417.643931] DR0: DR1: DR2: 000 0 [ 417.643938] DR3: DR6: fffe0ff0 DR7: 000 00400 [ 417.643946] Call Trace: [ 417.643978] register_bcache+0x1117/0x1270 [bcache] [ 417.643994] ? slab_pre_alloc_hook+0x15/0x3c [ 417.644001] ? slab_post_alloc_hook.isra.44+0xa/0x1a [ 417.644013] ? kernfs_fop_write+0xf6/0x138 [ 417.644020] kernfs_fop_write+0xf6/0x138 [ 417.644031] __vfs_write+0x31/0xcc [ 417.644043] ? current_kernel_time64+0x10/0x36 [ 417.644115] ? __audit_syscall_entry+0xbf/0xe3 [ 417.644124] vfs_write+0xa5/0xe2 [ 417.644133] SyS_write+0x5c/0x9f [ 417.644144] do_syscall_64+0x72/0x81 [ 417.644161] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 417.644169] RIP: 0033:0x7fde7e1c1974 [ 417.644175] RSP: 002b:7fff13009a38 EFLAGS: 0246 ORIG_RAX: 000 1 [ 417.644183] RAX: ffda RBX: 01658280 RCX: 7fde7e1c 1974 [ 417.644188] RDX: 000a RSI: 01658280 RDI: 0001 [ 417.644193] RBP: 000a R08: 0003 R09: 0077 [ 417.644198] R10: 089e R11: 0246 R12: 0001 [ 417.644203] R13: 000a R14: 7fff R15: [ 417.644213] Code: c7 c2 83 6f ee 98 be 20 00 00 00 48 89 df e8 6c 27 3b 0 0 48 89 d8 5b c3 0f 1f 44 00 00 48 8b 47 70 48 89 f2 48 8b bf 80 00 00 00 <8 b> b0 14 03 00 00 e9 73 ff ff ff 0f 1f 44 00 00 48 8b 47 40 39 [ 417.644302] RIP: bdevname+0x13/0x1e RSP: a3aa9138fd38 [ 417.644306] CR2: 0314 When registering duplicate cache device in register_cache(), after failure on calling register_cache_set(), bch_cache_release() will be called, then bdev will be freed, so bdevname(bdev, name) caused kernel crash. Since bch_cache_release() will free bdev, so in this patch we make sure bdev being freed if register_cache() fail, and do not free bdev again in register_bcache() when register_cache() fail. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reported-by: Marc MERLIN <m...@merlins.org> Tested-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: <sta...@vger.kernel.org> --- drivers/md/bcache/super.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 312895788036..9c141a8aaacc 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1204,7 +1204,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page, return; err: - pr_notice("error opening %s: %s", bdevname(bdev, name), err); + pr_notice("error %s: %s", bdevname(bdev, name), err); bcache_device_stop(>disk); } @@ -1883,6 +1883,8 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page, const char *err = NULL; /* must be set for any error case */ int ret = 0; + bdevname(bdev, name); + memcpy(>sb, sb, sizeof(struct cache_sb)); ca->bdev = bdev; ca->bdev->bd_holder = ca; @@ -1891,11 +1893,12 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page, bio_first_bvec_all(>sb_bio)->bv_page = sb_page; get_page(sb_page); - if (blk_queue_discard(bdev_get_queue(ca->bdev))) + if (blk_queue_discard(bdev_get_queue(bdev))) ca->discard = CACHE_DISCARD(>sb); ret = cache_alloc(ca); if (ret != 0) { + blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); if (ret == -ENOMEM) err = "cache_alloc(): -ENOMEM"; else @@ -1918,14 +1921,14 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page, goto out;
[PATCH 2/2] bcache: don't attach backing with duplicate UUID
This can happen e.g. during disk cloning. This is an incomplete fix: it does not catch duplicate UUIDs earlier when things are still unattached. It does not unregister the device. Further changes to cope better with this are planned but conflict with Coly's ongoing improvements to handling device errors. In the meantime, one can manually stop the device after this has happened. Attempts to attach a duplicate device result in: [ 136.372404] loop: module loaded [ 136.424461] bcache: register_bdev() registered backing device loop0 [ 136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but duplicate UUID already attached My test procedure is: dd if=/dev/sdb1 of=imgfile bs=1024 count=262144 losetup -f imgfile Signed-off-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> Cc: <sta...@vger.kernel.org> --- drivers/md/bcache/super.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 9c141a8aaacc..5cace6892958 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -963,6 +963,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, uint32_t rtime = cpu_to_le32(get_seconds()); struct uuid_entry *u; char buf[BDEVNAME_SIZE]; + struct cached_dev *exist_dc, *t; bdevname(dc->bdev, buf); @@ -987,6 +988,16 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, return -EINVAL; } + /* Check whether already attached */ + list_for_each_entry_safe(exist_dc, t, >cached_devs, list) { + if (!memcmp(dc->sb.uuid, exist_dc->sb.uuid, 16)) { + pr_err("Tried to attach %s but duplicate UUID already attached", + buf); + + return -EINVAL; + } + } + u = uuid_find(c, dc->sb.uuid); if (u && -- 2.14.1
[PATCH 0/2] Duplicate UUID fixes for 4.16
Jens-- Sorry for a couple of last-minute changes. Marc Merlin noticed an issue with disk imaging where if multiple devices were present with the same bcache UUID that a kernel panic could result. Tang Junhui fixed this. I found a related data corruption issue with duplicate backing devices. Both are minimal, reviewed, and tested. As always, thanks for your help. Mike
Re: [PATCH] bcache: don't attach backing with duplicate UUID
On 03/02/2018 06:55 AM, Michael Lyle wrote: > Hi Tang Junhui--- > > On 03/01/2018 09:45 PM, tang.jun...@zte.com.cn wrote: >> From: Tang Junhui <tang.jun...@zte.com.cn> >> >> Hello, Mike >> >> This patch looks good, but has some conflicts with this patch: >> bcache: fix for data collapse after re-attaching an attached device >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930 >> Could you modify your fix base on the previous patch? > > That doesn't make sense. This patch was generated from a current tree > where it's applied on top of that: (It's based on next when it should > really be based on Linus's tree, but it doesn't matter for patch > application because there's no changes in next right now to bcache that > aren't in Linus's tree). > > https://github.com/mlyle/linux/commit/502c38831b42100ac5d3319c792a66c7e01357ae > > May I add your reviewed-by so I can send this (and your fix) upstream? > > Mike Just to confirm this all works--- mlyle@midnight:~/trees/linux$ git checkout linus/master Note: checking out 'linus/master'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b HEAD is now at 5d60e057d127... Merge tag 'drm-fixes-for-v4.16-rc4' of git://people.freedesktop.org/~airlied/linux mlyle@midnight:~/trees/linux$ git am 0001-bcache-fix-crashes-in-duplicate-cache-device-registe.patch Applying: bcache: fix crashes in duplicate cache device register mlyle@midnight:~/trees/linux$ git am 0001-bcache-don-t-attach-backing-with-duplicate-UUID.patch Applying: bcache: don't attach backing with duplicate UUID Mike
Re: [PATCH] bcache: don't attach backing with duplicate UUID
Hi Tang Junhui--- On 03/01/2018 09:45 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > Hello, Mike > > This patch looks good, but has some conflicts with this patch: > bcache: fix for data collapse after re-attaching an attached device > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930 > Could you modify your fix base on the previous patch? That doesn't make sense. This patch was generated from a current tree where it's applied on top of that: (It's based on next when it should really be based on Linus's tree, but it doesn't matter for patch application because there's no changes in next right now to bcache that aren't in Linus's tree). https://github.com/mlyle/linux/commit/502c38831b42100ac5d3319c792a66c7e01357ae May I add your reviewed-by so I can send this (and your fix) upstream? Mike
[PATCH] bcache: don't attach backing with duplicate UUID
This can happen e.g. during disk cloning. This is an incomplete fix: it does not catch duplicate UUIDs earlier when things are still unattached. It does not unregister the device. Further changes to cope better with this are planned but conflict with Coly's ongoing improvements to handling device errors. In the meantime, one can manually stop the device after this has happened. Attempts to attach a duplicate device result in: [ 136.372404] loop: module loaded [ 136.424461] bcache: register_bdev() registered backing device loop0 [ 136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but duplicate UUID already attached My test procedure is: dd if=/dev/sdb1 of=imgfile bs=1024 count=262144 losetup -f imgfile Signed-off-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/super.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 9c141a8aaacc..5cace6892958 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -963,6 +963,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, uint32_t rtime = cpu_to_le32(get_seconds()); struct uuid_entry *u; char buf[BDEVNAME_SIZE]; + struct cached_dev *exist_dc, *t; bdevname(dc->bdev, buf); @@ -987,6 +988,16 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, return -EINVAL; } + /* Check whether already attached */ + list_for_each_entry_safe(exist_dc, t, >cached_devs, list) { + if (!memcmp(dc->sb.uuid, exist_dc->sb.uuid, 16)) { + pr_err("Tried to attach %s but duplicate UUID already attached", + buf); + + return -EINVAL; + } + } + u = uuid_find(c, dc->sb.uuid); if (u && -- 2.14.1
Re: [PATCH 06/11] bcache: Use the blk_queue_flag_{set,clear}() functions
LGTM On Wed, Feb 28, 2018 at 11:28 AM, Bart Van Assche <bart.vanass...@wdc.com> wrote: > Use the blk_queue_flag_{set,clear}() functions instead of open-coding > these. > > Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com> > Cc: Michael Lyle <ml...@lyle.org> > Cc: Kent Overstreet <kent.overstr...@gmail.com> > Cc: Christoph Hellwig <h...@lst.de> > Cc: Hannes Reinecke <h...@suse.de> > Cc: Johannes Thumshirn <jthumsh...@suse.de> > Cc: Ming Lei <ming@redhat.com> > --- > drivers/md/bcache/super.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index 312895788036..047e30cb93dc 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -833,9 +833,9 @@ static int bcache_device_init(struct bcache_device *d, > unsigned block_size, > q->limits.io_min= block_size; > q->limits.logical_block_size= block_size; > q->limits.physical_block_size = block_size; > - set_bit(QUEUE_FLAG_NONROT, >disk->queue->queue_flags); > - clear_bit(QUEUE_FLAG_ADD_RANDOM, >disk->queue->queue_flags); > - set_bit(QUEUE_FLAG_DISCARD, >disk->queue->queue_flags); > + blk_queue_flag_set(QUEUE_FLAG_NONROT, d->disk->queue); > + blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue); > + blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue); > > blk_queue_write_cache(q, true, true); > > -- > 2.16.2 >
Re: [PATCH v7 0/9] bcache: device failure handling improvement
On 02/27/2018 10:33 AM, Michael Lyle wrote: > On 02/27/2018 10:29 AM, Michael Lyle wrote: >> Hi Coly Li-- >> >> On 02/27/2018 08:55 AM, Coly Li wrote: >>> Hi maintainers and folks, >>> >>> This patch set tries to improve bcache device failure handling, includes >>> cache device and backing device failures. >> >> I have applied 1, 2, 4 & 6 from this series to my 4.17 bcache-for-next >> for testing. > > Correction: I meant 1, 2, 5 & 6. Sorry everyone for churn. Present applied state is 1-6: #6 6a2fb6357eae bcache: fix inaccurate io state for detached bcache devices #5 6aee8a57d141 bcache: add stop_when_cache_set_failed option to backing device #4 3a9c638b1950 bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags #3 1b355007b5ae bcache: stop dc->writeback_rate_update properly #2 70120583e4b5 bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set #1 b8f557d64115 bcache: fix cached_dev->count usage for bch_cache_set_error() There's some checkpatch warnings relating to non-preferred style (splitting strings to hit 80 character limit) but I will fix them up myself later). This subset is through review and passes initial testing (will be testing more / and writing test suite). Mike
Re: [PATCH v7 3/9] bcache: stop dc->writeback_rate_update properly
OK, I have convinced myself this is safe. Reviewed-by: Michael Lyle <ml...@lyle.org> On 02/27/2018 08:55 AM, Coly Li wrote: > struct delayed_work writeback_rate_update in struct cache_dev is a delayed > worker to call function update_writeback_rate() in period (the interval is > defined by dc->writeback_rate_update_seconds). > > When a metadate I/O error happens on cache device, bcache error handling > routine bch_cache_set_error() will call bch_cache_set_unregister() to > retire whole cache set. On the unregister code path, this delayed work is > stopped by calling cancel_delayed_work_sync(>writeback_rate_update). > > dc->writeback_rate_update is a special delayed work from others in bcache. > In its routine update_writeback_rate(), this delayed work is re-armed > itself. That means when cancel_delayed_work_sync() returns, this delayed > work can still be executed after several seconds defined by > dc->writeback_rate_update_seconds. > > The problem is, after cancel_delayed_work_sync() returns, the cache set > unregister code path will continue and release memory of struct cache set. > Then the delayed work is scheduled to run, __update_writeback_rate() > will reference the already released cache_set memory, and trigger a NULL > pointer deference fault. > > This patch introduces two more bcache device flags, > - BCACHE_DEV_WB_RUNNING > bit set: bcache device is in writeback mode and running, it is OK for > dc->writeback_rate_update to re-arm itself. > bit clear:bcache device is trying to stop dc->writeback_rate_update, > this delayed work should not re-arm itself and quit. > - BCACHE_DEV_RATE_DW_RUNNING > bit set: routine update_writeback_rate() is executing. > bit clear: routine update_writeback_rate() quits. > > This patch also adds a function cancel_writeback_rate_update_dwork() to > wait for dc->writeback_rate_update quits before cancel it by calling > cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected > quit dc->writeback_rate_update, after time_out seconds this function will > give up and continue to call cancel_delayed_work_sync(). > > And here I explain how this patch stops self re-armed delayed work properly > with the above stuffs. > > update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning > and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling > cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING. > > Before calling cancel_delayed_work_sync() wait utill flag > BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling > cancel_delayed_work_sync(), dc->writeback_rate_update must be already re- > armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases > delayed work routine update_writeback_rate() won't be executed after > cancel_delayed_work_sync() returns. > > Inside update_writeback_rate() before calling schedule_delayed_work(), flag > BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means > someone is about to stop the delayed work. Because flag > BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync() > has to wait for this flag to be cleared, we don't need to worry about race > condition here. > > If update_writeback_rate() is scheduled to run after checking > BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync() > in cancel_writeback_rate_update_dwork(), it is also safe. Because at this > moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned > previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear > and quit immediately. > > Because there are more dependences inside update_writeback_rate() to struct > cache_set memory, dc->writeback_rate_update is not a simple self re-arm > delayed work. After trying many different methods (e.g. hold dc->count, or > use locks), this is the only way I can find which works to properly stop > dc->writeback_rate_update delayed work. > > Changelog: > v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING > to bit index, for test_bit(). > v2: Try to fix the race issue which is pointed out by Junhui. > v1: The initial version for review > > Signed-off-by: Coly Li <col...@suse.de> > Reviewed-by: Junhui Tang <tang.jun...@zte.com.cn> > Cc: Michael Lyle <ml...@lyle.org> > Cc: Hannes Reinecke <h...@suse.com> > --- > drivers/md/bcache/bcache.h| 9 + > drivers/md/bcache/super.c | 39 +++ > drivers/md/bcache/sysfs.c | 3 ++- > drivers/md/bcache/writeback.c | 29 - > 4 files changed, 70 insertions(+), 10 deletions(-
Re: [PATCH v7 0/9] bcache: device failure handling improvement
On 02/27/2018 10:29 AM, Michael Lyle wrote: > Hi Coly Li-- > > On 02/27/2018 08:55 AM, Coly Li wrote: >> Hi maintainers and folks, >> >> This patch set tries to improve bcache device failure handling, includes >> cache device and backing device failures. > > I have applied 1, 2, 4 & 6 from this series to my 4.17 bcache-for-next > for testing. Correction: I meant 1, 2, 5 & 6. > > Mike > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: [PATCH v7 0/9] bcache: device failure handling improvement
Hi Coly Li-- On 02/27/2018 08:55 AM, Coly Li wrote: > Hi maintainers and folks, > > This patch set tries to improve bcache device failure handling, includes > cache device and backing device failures. I have applied 1, 2, 4 & 6 from this series to my 4.17 bcache-for-next for testing. Mike
Re: [PATCH v7 9/9] bcache: stop bcache device when backing device is offline
Hi Coly Li-- Just a couple of questions. On 02/27/2018 08:55 AM, Coly Li wrote: > +#define BACKING_DEV_OFFLINE_TIMEOUT 5 I think you wanted this to be 30 (per commit message)-- was this turned down for testing or deliberate? > +static int cached_dev_status_update(void *arg) > +{ > + struct cached_dev *dc = arg; > + struct request_queue *q; > + char buf[BDEVNAME_SIZE]; > + > + /* > + * If this delayed worker is stopping outside, directly quit here. > + * dc->io_disable might be set via sysfs interface, so check it > + * here too. > + */ > + while (!kthread_should_stop() && !dc->io_disable) { > + q = bdev_get_queue(dc->bdev); > + if (blk_queue_dying(q)) I am not sure-- is this the correct test to know if the bdev is offline? It's very sparsely used outside of the core block system. (Is there any scenario where the queue comes back? Can the device be "offline" and still have a live queue? Another approach might be to set io_disable when there's an extended period without a successful IO, and it might be possible to do this with atomics without a thread). This approach can work if you're certain that blk_queue_dying is an appropriate test for all failure scenarios (I just don't know). Mike
Re: [PATCH v7 4/9] bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
Hi Coly Li-- On 02/27/2018 08:55 AM, Coly Li wrote: > When too many I/Os failed on cache device, bch_cache_set_error() is called > in the error handling code path to retire whole problematic cache set. If > new I/O requests continue to come and take refcount dc->count, the cache > set won't be retired immediately, this is a problem. > > Further more, there are several kernel thread and self-armed kernel work > may still running after bch_cache_set_error() is called. It needs to wait > quite a while for them to stop, or they won't stop at all. They also > prevent the cache set from being retired. It's too bad this is necessary-- I wish the IO layer could latch error for us in some kind of meaningful way instead of us having to do it ourselves (and for filesystems, etc, having to each do similar things to prevent just continuously hitting IO timeouts). That said, the code looks good. Reviewed-by: Michael Lyle <ml...@lyle.org>
Re: [PATCH v7 1/9] bcache: fix cached_dev->count usage for bch_cache_set_error()
Hi Coly Li--- Thanks for this. I've been uncomfortable with the interaction between the dirty status and the refcount (even aside from this issue), and I believe you've resolved it. I'm sorry for the slow review-- it's taken me some time to convince myself that this is safe. I'm getting closer to be able to apply these for 4.17. Reviewed-by: Michael Lyle <ml...@lyle.org> Mike On 02/27/2018 08:55 AM, Coly Li wrote: > When bcache metadata I/O fails, bcache will call bch_cache_set_error() > to retire the whole cache set. The expected behavior to retire a cache > set is to unregister the cache set, and unregister all backing device > attached to this cache set, then remove sysfs entries of the cache set > and all attached backing devices, finally release memory of structs > cache_set, cache, cached_dev and bcache_device. > > In my testing when journal I/O failure triggered by disconnected cache > device, sometimes the cache set cannot be retired, and its sysfs > entry /sys/fs/bcache/ still exits and the backing device also > references it. This is not expected behavior. > > When metadata I/O failes, the call senquence to retire whole cache set is, > bch_cache_set_error() > bch_cache_set_unregister() > bch_cache_set_stop() > __cache_set_unregister() <- called as callback by calling > clousre_queue(>caching) > cache_set_flush()<- called as a callback when refcount > of cache_set->caching is 0 > cache_set_free() <- called as a callback when refcount > of catch_set->cl is 0 > bch_cache_set_release() <- called as a callback when refcount > of catch_set->kobj is 0 > > I find if kernel thread bch_writeback_thread() quits while-loop when > kthread_should_stop() is true and searched_full_index is false, clousre > callback cache_set_flush() set by continue_at() will never be called. The > result is, bcache fails to retire whole cache set. > > cache_set_flush() will be called when refcount of closure c->caching is 0, > and in function bcache_device_detach() refcount of closure c->caching is > released to 0 by clousre_put(). In metadata error code path, function > bcache_device_detach() is called by cached_dev_detach_finish(). This is a > callback routine being called when cached_dev->count is 0. This refcount > is decreased by cached_dev_put(). > > The above dependence indicates, cache_set_flush() will be called when > refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0 > when refcount of cache_dev->count is 0. > > The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails > and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount > of cache_dev is not decreased properly. > > In bch_writeback_thread(), cached_dev_put() is called only when > searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a > there is no dirty data on cache. In most of run time it is correct, but > when bch_writeback_thread() quits the while-loop while cache is still > dirty, current code forget to call cached_dev_put() before this kernel > thread exits. This is why sometimes cache_set_flush() is not executed and > cache set fails to be retired. > > The reason to call cached_dev_put() in bch_writeback_rate() is, when the > cache device changes from clean to dirty, cached_dev_get() is called, to > make sure during writeback operatiions both backing and cache devices > won't be released. > > Adding following code in bch_writeback_thread() does not work, >static int bch_writeback_thread(void *arg) > } > > + if (atomic_read(>has_dirty)) > + cached_dev_put() > + > return 0; > } > because writeback kernel thread can be waken up and start via sysfs entry: > echo 1 > /sys/block/bcache/bcache/writeback_running > It is difficult to check whether backing device is dirty without race and > extra lock. So the above modification will introduce potential refcount > underflow in some conditions. > > The correct fix is, to take cached dev refcount when creating the kernel > thread, and put it before the kernel thread exits. Then bcache does not > need to take a cached dev refcount when cache turns from clean to dirty, > or to put a cached dev refcount when cache turns from ditry to clean. The > writeback kernel thread is alwasy safe to reference data structure from > cache set, cache and cached device (because a refcount of cache device is > taken for it already), and no matter the k
[PATCH 1/2] bcache: correct flash only vols (check all uuids)
From: Coly Li <col...@suse.de> Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by devices_max_used") adds c->devices_max_used to reduce iteration of c->uuids elements, this value is updated in bcache_device_attach(). But for flash only volume, when calling flash_devs_run(), the function bcache_device_attach() is not called yet and c->devices_max_used is not updated. The unexpected result is, the flash only volume won't be run by flash_devs_run(). This patch fixes the issue by iterate all c->uuids elements in flash_devs_run(). c->devices_max_used will be updated properly when bcache_device_attach() gets called. [mlyle: commit subject edited for character limit] Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by devices_max_used") Reported-by: Tang Junhui <tang.jun...@zte.com.cn> Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 312895788036..4d1d8dfb2d2a 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1274,7 +1274,7 @@ static int flash_devs_run(struct cache_set *c) struct uuid_entry *u; for (u = c->uuids; -u < c->uuids + c->devices_max_used && !ret; +u < c->uuids + c->nr_uuids && !ret; u++) if (UUID_FLASH_ONLY(u)) ret = flash_dev_run(c, u); -- 2.14.1
[PATCH 0/2] Bcache fixes for 4.16
Hi Jens, Please pick up these two critical fixes to bcache by Tang Junhui. They're both one-liners and have been reviewed and tested. The first corrects a regression when flash-only volumes are present that was introduced in 4.16-RC1. The second adjusts bio refcount and completion behavior to work with md RAID5 backing. Thanks, Mike
[PATCH 2/2] bcache: fix kcrashes with fio in RAID5 backend dev
From: Tang Junhui <tang.jun...@zte.com.cn> Kernel crashed when run fio in a RAID5 backend bcache device, the call trace is bellow: [ 440.012034] kernel BUG at block/blk-ioc.c:146! [ 440.012696] invalid opcode: [#1] SMP NOPTI [ 440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8 [ 440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16 /2015 [ 440.028615] RIP: 0010:put_io_context+0x8b/0x90 [ 440.029246] RSP: 0018:a8c882b43af8 EFLAGS: 00010246 [ 440.029990] RAX: RBX: a8c88294fca0 RCX: 00 0f4240 [ 440.031006] RDX: 0004 RSI: 0286 RDI: a8c882 94fca0 [ 440.032030] RBP: a8c882b43b10 R08: 0003 R09: 949cb8 0c1700 [ 440.033206] R10: 0104 R11: b71c R12: 000 01000 [ 440.034222] R13: R14: 949cad84db70 R15: 949cb11 bd1e0 [ 440.035239] FS: () GS:949cba28() knlGS: [ 440.060190] CS: 0010 DS: ES: CR0: 80050033 [ 440.084967] CR2: 7ff0493ef000 CR3: 0002f1e0a002 CR4: 001 606e0 [ 440.110498] Call Trace: [ 440.135443] bio_disassociate_task+0x1b/0x60 [ 440.160355] bio_free+0x1b/0x60 [ 440.184666] bio_put+0x23/0x30 [ 440.208272] search_free+0x23/0x40 [bcache] [ 440.231448] cached_dev_write_complete+0x31/0x70 [bcache] [ 440.254468] closure_put+0xb6/0xd0 [bcache] [ 440.277087] request_endio+0x30/0x40 [bcache] [ 440.298703] bio_endio+0xa1/0x120 [ 440.319644] handle_stripe+0x418/0x2270 [raid456] [ 440.340614] ? load_balance+0x17b/0x9c0 [ 440.360506] handle_active_stripes.isra.58+0x387/0x5a0 [raid456] [ 440.380675] ? __release_stripe+0x15/0x20 [raid456] [ 440.400132] raid5d+0x3ed/0x5d0 [raid456] [ 440.419193] ? schedule+0x36/0x80 [ 440.437932] ? schedule_timeout+0x1d2/0x2f0 [ 440.456136] md_thread+0x122/0x150 [ 440.473687] ? wait_woken+0x80/0x80 [ 440.491411] kthread+0x102/0x140 [ 440.508636] ? find_pers+0x70/0x70 [ 440.524927] ? kthread_associate_blkcg+0xa0/0xa0 [ 440.541791] ret_from_fork+0x35/0x40 [ 440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2 48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41 [ 440.610020] RIP: put_io_context+0x8b/0x90 RSP: a8c882b43af8 [ 440.628575] ---[ end trace a1fd79d85643a73e ]-- All the crash issue happened when a bypass IO coming, in such scenario s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the s->orig_bio by calling bio_complete(), and after that, s->iop.bio became invalid, then kernel would crash when calling bio_put(). Maybe its upper layer's faulty, since bio should not be freed before we calling bio_put(), but we'd better calling bio_put() first before calling bio_complete() to notify upper layer ending this bio. This patch moves bio_complete() under bio_put() to avoid kernel crash. [mlyle: fixed commit subject for character limits] Reported-by: Matthias Ferdinand <bca...@mfedv.net> Tested-by: Matthias Ferdinand <bca...@mfedv.net> Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 1a46b41dac70..6422846b546e 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -659,11 +659,11 @@ static void do_bio_hook(struct search *s, struct bio *orig_bio) static void search_free(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); - bio_complete(s); if (s->iop.bio) bio_put(s->iop.bio); + bio_complete(s); closure_debug_destroy(cl); mempool_free(s, s->d->c->search); } -- 2.14.1
Re: [PATCH] bcache: check all uuid elements when start flash only volumes
Reviewed-by: Michael Lyle <ml...@lyle.org> I have this and the other fix in the queue for testing tomorrow and will pass along a changeset to Jens if all looks good. Thanks! Mike On 02/26/2018 08:15 AM, Coly Li wrote: > On 27/02/2018 12:10 AM, Coly Li wrote: >> Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by >> devices_max_used") adds c->devices_max_used to reduce iteration of >> c->uuids elements, this value is updated in bcache_device_attach(). >> >> But for flash only volume, when calling flash_devs_run(), the function >> bcache_device_attach() is not called yet and c->devices_max_used is not >> updated. The unexpected result is, the flash only volume won't be run >> by flash_devs_run(). >> >> This patch fixes the issue by iterate all c->uuids elements in >> flash_devs_run(). c->devices_max_used will be updated properly when >> bcache_device_attach() gets called. >> >> Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by >> devices_max_used") >> Reported-by: Tang Junhui <tang.jun...@zte.com.cn> >> Signed-off-by: Coly Li <col...@suse.de> >> Cc: Michael Lyle <ml...@lyle.org> >> --- >> drivers/md/bcache/super.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index 312895788036..4d1d8dfb2d2a 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -1274,7 +1274,7 @@ static int flash_devs_run(struct cache_set *c) >> struct uuid_entry *u; >> >> for (u = c->uuids; >> - u < c->uuids + c->devices_max_used && !ret; >> + u < c->uuids + c->nr_uuids && !ret; >> u++) >> if (UUID_FLASH_ONLY(u)) >> ret = flash_dev_run(c, u); >> > > Hi Jens and Michael, > > Could you please have a look and pick this fix into 4.16 ? Commit > 2831231d4c3f ("bcache: reduce cache_set devices iteration by > devices_max_used") is just merged in 4.16-rc1, so I don't CC stable for > this fix. > > And I still need a peer code reviewer on this patch, any Reviewed-by is > welcome. > > Thanks to Junhui Tang for reporting this issue today. > > Coly Li > > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
[PATCH 7/8] bcache: return attach error when no cache set exist
From: Tang Junhui <tang.jun...@zte.com.cn> I attach a back-end device to a cache set, and the cache set is not registered yet, this back-end device did not attach successfully, and no error returned: [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach [root]# In sysfs_attach(), the return value "v" is initialized to "size" in the beginning, and if no cache set exist in bch_cache_sets, the "v" value would not change any more, and return to sysfs, sysfs regard it as success since the "size" is a positive number. This patch fixes this issue by assigning "v" with "-ENOENT" in the initialization. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/sysfs.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 4a6a697e1680..a7552f509f50 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -198,7 +198,7 @@ STORE(__cached_dev) { struct cached_dev *dc = container_of(kobj, struct cached_dev, disk.kobj); - ssize_t v = size; + ssize_t v; struct cache_set *c; struct kobj_uevent_env *env; @@ -275,6 +275,7 @@ STORE(__cached_dev) if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16) return -EINVAL; + v = -ENOENT; list_for_each_entry(c, _cache_sets, list) { v = bch_cached_dev_attach(dc, c); if (!v) @@ -282,7 +283,7 @@ STORE(__cached_dev) } pr_err("Can't attach %s: cache set not found", buf); - size = v; + return v; } if (attr == _detach && dc->disk.c) -- 2.14.1
[PATCH 5/8] bcache: fix for allocator and register thread race
From: Tang Junhui <tang.jun...@zte.com.cn> After long time running of random small IO writing, I reboot the machine, and after the machine power on, I found bcache got stuck, the stack is: [root@ceph153 ~]# cat /proc/2510/task/*/stack [] closure_sync+0x25/0x90 [bcache] [] bch_journal+0x118/0x2b0 [bcache] [] bch_journal_meta+0x47/0x70 [bcache] [] bch_prio_write+0x237/0x340 [bcache] [] bch_allocator_thread+0x3c8/0x3d0 [bcache] [] kthread+0xcf/0xe0 [] ret_from_fork+0x58/0x90 [] 0x [root@ceph153 ~]# cat /proc/2038/task/*/stack [] __bch_btree_map_nodes+0x12d/0x150 [bcache] [] bch_btree_insert+0xf1/0x170 [bcache] [] bch_journal_replay+0x13f/0x230 [bcache] [] run_cache_set+0x79a/0x7c2 [bcache] [] register_bcache+0xd48/0x1310 [bcache] [] kobj_attr_store+0xf/0x20 [] sysfs_write_file+0xc6/0x140 [] vfs_write+0xbd/0x1e0 [] SyS_write+0x7f/0xe0 [] system_call_fastpath+0x16/0x1 The stack shows the register thread and allocator thread were getting stuck when registering cache device. I reboot the machine several times, the issue always exsit in this machine. I debug the code, and found the call trace as bellow: register_bcache() ==>run_cache_set() ==>bch_journal_replay() ==>bch_btree_insert() ==>__bch_btree_map_nodes() ==>btree_insert_fn() ==>btree_split() //node need split ==>btree_check_reserve() In btree_check_reserve(), It will check if there is enough buckets of RESERVE_BTREE type, since allocator thread did not work yet, so no buckets of RESERVE_BTREE type allocated, so the register thread waits on c->btree_cache_wait, and goes to sleep. Then the allocator thread initialized, the call trace is bellow: bch_allocator_thread() ==>bch_prio_write() ==>bch_journal_meta() ==>bch_journal() ==>journal_wait_for_write() In journal_wait_for_write(), It will check if journal is full by journal_full(), but the long time random small IO writing causes the exhaustion of journal buckets(journal.blocks_free=0), In order to release the journal buckets, the allocator calls btree_flush_write() to flush keys to btree nodes, and waits on c->journal.wait until btree nodes writing over or there has already some journal buckets space, then the allocator thread goes to sleep. but in btree_flush_write(), since bch_journal_replay() is not finished, so no btree nodes have journal (condition "if (btree_current_write(b)->journal)" never satisfied), so we got no btree node to flush, no journal bucket released, and allocator sleep all the times. Through the above analysis, we can see that: 1) Register thread wait for allocator thread to allocate buckets of RESERVE_BTREE type; 2) Alloctor thread wait for register thread to replay journal, so it can flush btree nodes and get journal bucket. then they are all got stuck by waiting for each other. Hua Rui provided a patch for me, by allocating some buckets of RESERVE_BTREE type in advance, so the register thread can get bucket when btree node splitting and no need to waiting for the allocator thread. I tested it, it has effect, and register thread run a step forward, but finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE type were allocated, and in bch_journal_replay(), after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left, then btree_check_reserve() is not satisfied anymore, so it goes to sleep again, and in the same time, alloctor thread did not flush enough btree nodes to release a journal bucket, so they all got stuck again. So we need to allocate more buckets of RESERVE_BTREE type in advance, but how much is enough? By experience and test, I think it should be as much as journal buckets. Then I modify the code as this patch, and test in the machine, and it works. This patch modified base on Hua Rui’s patch, and allocate more buckets of RESERVE_BTREE type in advance to avoid register thread and allocate thread going to wait for each other. [patch v2] ca->sb.njournal_buckets would be 0 in the first time after cache creation, and no journal exists, so just 8 btree buckets is OK. Signed-off-by: Hua Rui <huarui@gmail.com> Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/btree.c | 9 ++--- drivers/md/bcache/super.c | 13 - 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index bf3a48aa9a9a..fad9fe8817eb 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1869,14 +1869,17 @@ void bch_initial_gc_finish(struct cache_set *c) */ for_each_cache(ca, c, i) { for_each_bucket(b, ca) { - if (fifo_full(>free[RESERVE_PRIO])) + if (fifo_full(>free[RESERVE_PRIO]) && +
[PATCH 8/8] bcache: fix for data collapse after re-attaching an attached device
From: Tang Junhui <tang.jun...@zte.com.cn> back-end device sdm has already attached a cache_set with ID f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with another cache set, and it returns with an error: [root]# cd /sys/block/sdm/bcache [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach -bash: echo: write error: Invalid argument After that, execute a command to modify the label of bcache device: [root]# echo data_disk1 > label Then we reboot the system, when the system power on, the back-end device can not attach to cache_set, a messages show in the log: Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache: bch_cached_dev_attach() couldn't find uuid for sdm in set In sysfs_attach(), dc->sb.set_uuid was assigned to the value which input through sysfs, no matter whether it is success or not in bch_cached_dev_attach(). For example, If the back-end device has already attached to an cache set, bch_cached_dev_attach() would fail, but dc->sb.set_uuid was changed. Then modify the label of bcache device, it will call bch_write_bdev_super(), which would write the dc->sb.set_uuid to the super block, so we record a wrong cache set ID in the super block, after the system reboot, the cache set couldn't find the uuid of the back-end device, so the bcache device couldn't exist and use any more. In this patch, we don't assigned cache set ID to dc->sb.set_uuid in sysfs_attach() directly, but input it into bch_cached_dev_attach(), and assigned dc->sb.set_uuid to the cache set ID after the back-end device attached to the cache set successful. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/bcache.h | 2 +- drivers/md/bcache/super.c | 10 ++ drivers/md/bcache/sysfs.c | 6 -- 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index b8c2e1bef1f1..12e5197f186c 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -924,7 +924,7 @@ void bcache_write_super(struct cache_set *); int bch_flash_dev_create(struct cache_set *c, uint64_t size); -int bch_cached_dev_attach(struct cached_dev *, struct cache_set *); +int bch_cached_dev_attach(struct cached_dev *, struct cache_set *, uint8_t *); void bch_cached_dev_detach(struct cached_dev *); void bch_cached_dev_run(struct cached_dev *); void bcache_device_stop(struct bcache_device *); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a2ad37a8afc0..312895788036 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -957,7 +957,8 @@ void bch_cached_dev_detach(struct cached_dev *dc) cached_dev_put(dc); } -int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c) +int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, + uint8_t *set_uuid) { uint32_t rtime = cpu_to_le32(get_seconds()); struct uuid_entry *u; @@ -965,7 +966,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c) bdevname(dc->bdev, buf); - if (memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16)) + if ((set_uuid && memcmp(set_uuid, c->sb.set_uuid, 16)) || + (!set_uuid && memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16))) return -ENOENT; if (dc->disk.c) { @@ -1194,7 +1196,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page, list_add(>list, _devices); list_for_each_entry(c, _cache_sets, list) - bch_cached_dev_attach(dc, c); + bch_cached_dev_attach(dc, c, NULL); if (BDEV_STATE(>sb) == BDEV_STATE_NONE || BDEV_STATE(>sb) == BDEV_STATE_STALE) @@ -1716,7 +1718,7 @@ static void run_cache_set(struct cache_set *c) bcache_write_super(c); list_for_each_entry_safe(dc, t, _devices, list) - bch_cached_dev_attach(dc, c); + bch_cached_dev_attach(dc, c, NULL); flash_devs_run(c); diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index a7552f509f50..78cd7bd50fdd 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -272,12 +272,14 @@ STORE(__cached_dev) } if (attr == _attach) { - if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16) + uint8_t set_uuid[16]; + + if (bch_parse_uuid(buf, set_uuid) < 16) return -EINVAL; v = -ENOENT; list_for_each_entry(c, _cache_sets, list) { - v = bch_cached_dev_attach(dc, c); + v = bch_cached_dev_attach(dc, c, set_uuid); if (!v) return size; } -- 2.14.1
[PATCH 4/8] bcache: set error_limit correctly
From: Coly Li <col...@suse.de> Struct cache uses io_errors for two purposes, - Error decay: when cache set error_decay is set, io_errors is used to generate a small piece of delay when I/O error happens. - I/O errors counter: in order to generate big enough value for error decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a IO_ERROR_SHIFT). In function bch_count_io_errors(), if I/O errors counter reaches cache set error limit, bch_cache_set_error() will be called to retire the whold cache set. But current code is problematic when checking the error limit, see the following code piece from bch_count_io_errors(), 90 if (error) { 91 char buf[BDEVNAME_SIZE]; 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT, 93 >io_errors); 94 errors >>= IO_ERROR_SHIFT; 95 96 if (errors < ca->set->error_limit) 97 pr_err("%s: IO error on %s, recovering", 98bdevname(ca->bdev, buf), m); 99 else 100 bch_cache_set_error(ca->set, 101 "%s: too many IO errors %s", 102 bdevname(ca->bdev, buf), m); 103 } At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real errors counter to compare at line 96. But ca->set->error_limit is initia- lized with an amplified value in bch_cache_set_alloc(), 1545 c->error_limit = 8 << IO_ERROR_SHIFT; It means by default, in bch_count_io_errors(), before 8<<20 errors happened bch_cache_set_error() won't be called to retire the problematic cache device. If the average request size is 64KB, it means bcache won't handle failed device until 512GB data is requested. This is too large to be an I/O threashold. So I believe the correct error limit should be much less. This patch sets default cache set error limit to 8, then in bch_count_io_errors() when errors counter reaches 8 (if it is default value), function bch_cache_set_error() will be called to retire the whole cache set. This patch also removes bits shifting when store or show io_error_limit value via sysfs interface. Nowadays most of SSDs handle internal flash failure automatically by LBA address re-indirect mapping. If an I/O error can be observed by upper layer code, it will be a notable error because that SSD can not re-indirect map the problematic LBA address to an available flash block. This situation indicates the whole SSD will be failed very soon. Therefore setting 8 as the default io error limit value makes sense, it is enough for most of cache devices. Changelog: v2: add reviewed-by from Hannes. v1: initial version for review. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.com> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Junhui Tang <tang.jun...@zte.com.cn> --- drivers/md/bcache/bcache.h | 1 + drivers/md/bcache/super.c | 2 +- drivers/md/bcache/sysfs.c | 4 ++-- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index a857eb3c10de..b8c2e1bef1f1 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -666,6 +666,7 @@ struct cache_set { ON_ERROR_UNREGISTER, ON_ERROR_PANIC, } on_error; +#define DEFAULT_IO_ERROR_LIMIT 8 unsignederror_limit; unsignederror_decay; diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 133b81225ea9..e29150ec17e7 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1553,7 +1553,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) c->congested_read_threshold_us = 2000; c->congested_write_threshold_us = 2; - c->error_limit = 8 << IO_ERROR_SHIFT; + c->error_limit = DEFAULT_IO_ERROR_LIMIT; return c; err: diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 46e7a6b3e7c7..c524305cc9a7 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -568,7 +568,7 @@ SHOW(__bch_cache_set) /* See count_io_errors for why 88 */ sysfs_print(io_error_halflife, c->error_decay * 88); - sysfs_print(io_error_limit, c->error_limit >> IO_ERROR_SHIFT); + sysfs_print(io_error_limit, c->error_limit); sysfs_hprint(congested, ((uint64_t) bch_get_congested(c)) << 9); @@ -668,7 +668,7 @@ STORE(__bch_cache_set) } if (attr == _io_error_limit) - c->error_limit = strtoul_or_return(buf) << IO_ERROR_SHIFT; + c->error_limit = strtoul_or_return(buf); /* See count_io_errors() for why 88 */ if (attr == _io_error_halflife) -- 2.14.1
[PATCH 3/8] bcache: properly set task state in bch_writeback_thread()
From: Coly Li <col...@suse.de> Kernel thread routine bch_writeback_thread() has the following code block, 447 down_write(>writeback_lock); 448~450 if (check conditions) { 451 up_write(>writeback_lock); 452 set_current_state(TASK_INTERRUPTIBLE); 453 454 if (kthread_should_stop()) 455 return 0; 456 457 schedule(); 458 continue; 459 } If condition check is true, its task state is set to TASK_INTERRUPTIBLE and call schedule() to wait for others to wake up it. There are 2 issues in current code, 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if another process changes the condition and call wake_up_process(dc-> writeback_thread), then at line 452 task state is set back to TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be waken up. 2, At line 454 if kthread_should_stop() is true, writeback kernel thread will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and call do_exit(). It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a warning message is reported by __might_sleep(): "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at []". For the first issue, task state should be set before condition checks. Ineed because dc->writeback_lock is required when modifying all the conditions, calling set_current_state() inside code block where dc-> writeback_lock is hold is safe. But this is quite implicit, so I still move set_current_state() before all the condition checks. For the second issue, frankley speaking it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE state, but this warning message scares users, makes them feel there might be something risky with bcache and hurt their data. Setting task state to TASK_RUNNING before returning fixes this problem. In alloc.c:allocator_wait(), there is also a similar issue, and is also fixed in this patch. Changelog: v3: merge two similar fixes into one patch v2: fix the race issue in v1 patch. v1: initial buggy fix. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.de> Reviewed-by: Michael Lyle <ml...@lyle.org> Cc: Michael Lyle <ml...@lyle.org> Cc: Junhui Tang <tang.jun...@zte.com.cn> --- drivers/md/bcache/alloc.c | 4 +++- drivers/md/bcache/writeback.c | 7 +-- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index 6cc6c0f9c3a9..458e1d38577d 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -287,8 +287,10 @@ do { \ break; \ \ mutex_unlock(&(ca)->set->bucket_lock); \ - if (kthread_should_stop()) \ + if (kthread_should_stop()) {\ + set_current_state(TASK_RUNNING);\ return 0; \ + } \ \ schedule(); \ mutex_lock(&(ca)->set->bucket_lock);\ diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 51306a19ab03..58218f7e77c3 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -564,18 +564,21 @@ static int bch_writeback_thread(void *arg) while (!kthread_should_stop()) { down_write(>writeback_lock); + set_current_state(TASK_INTERRUPTIBLE); if (!atomic_read(>has_dirty) || (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) && !dc->writeback_running)) { up_write(>writeback_lock); - set_current_state(TASK_INTERRUPTIBLE); - if (kthread_should_stop()) + if (kthread_should_stop()) { + set_current_state(TASK_RUNNING); return 0; + } schedule(); continue; } + set_current_state(TASK_RUNNING); searched_full_index = refill_dirty(dc); -- 2.14.1
[PATCH 1/8] bcache: add journal statistic
From: Tang Junhui <tang.jun...@zte.com.cn> Sometimes, Journal takes up a lot of CPU, we need statistics to know what's the journal is doing. So this patch provide some journal statistics: 1) reclaim: how many times the journal try to reclaim resource, usually the journal bucket or/and the pin are exhausted. 2) flush_write: how many times the journal try to flush btree node to cache device, usually the journal bucket are exhausted. 3) retry_flush_write: how many times the journal retry to flush the next btree node, usually the previous tree node have been flushed by other thread. we show these statistic by sysfs interface. Through these statistics We can totally see the status of journal module when the CPU is too high. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/bcache.h | 4 drivers/md/bcache/journal.c | 5 + drivers/md/bcache/sysfs.c | 15 +++ 3 files changed, 24 insertions(+) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 5e2d4e80198e..b98d7705acb6 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -658,6 +658,10 @@ struct cache_set { atomic_long_t writeback_keys_done; atomic_long_t writeback_keys_failed; + atomic_long_t reclaim; + atomic_long_t flush_write; + atomic_long_t retry_flush_write; + enum{ ON_ERROR_UNREGISTER, ON_ERROR_PANIC, diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index a87165c1d8e5..f5296007a9d5 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -377,6 +377,8 @@ static void btree_flush_write(struct cache_set *c) */ struct btree *b, *best; unsigned i; + + atomic_long_inc(>flush_write); retry: best = NULL; @@ -397,6 +399,7 @@ static void btree_flush_write(struct cache_set *c) if (!btree_current_write(b)->journal) { mutex_unlock(>write_lock); /* We raced */ + atomic_long_inc(>retry_flush_write); goto retry; } @@ -476,6 +479,8 @@ static void journal_reclaim(struct cache_set *c) unsigned iter, n = 0; atomic_t p; + atomic_long_inc(>reclaim); + while (!atomic_read(_front(>journal.pin))) fifo_pop(>journal.pin, p); diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index b4184092c727..46e7a6b3e7c7 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -65,6 +65,9 @@ read_attribute(bset_tree_stats); read_attribute(state); read_attribute(cache_read_races); +read_attribute(reclaim); +read_attribute(flush_write); +read_attribute(retry_flush_write); read_attribute(writeback_keys_done); read_attribute(writeback_keys_failed); read_attribute(io_errors); @@ -545,6 +548,15 @@ SHOW(__bch_cache_set) sysfs_print(cache_read_races, atomic_long_read(>cache_read_races)); + sysfs_print(reclaim, + atomic_long_read(>reclaim)); + + sysfs_print(flush_write, + atomic_long_read(>flush_write)); + + sysfs_print(retry_flush_write, + atomic_long_read(>retry_flush_write)); + sysfs_print(writeback_keys_done, atomic_long_read(>writeback_keys_done)); sysfs_print(writeback_keys_failed, @@ -731,6 +743,9 @@ static struct attribute *bch_cache_set_internal_files[] = { _bset_tree_stats, _cache_read_races, + _reclaim, + _flush_write, + _retry_flush_write, _writeback_keys_done, _writeback_keys_failed, -- 2.14.1
[PATCH 6/8] bcache: set writeback_rate_update_seconds in range [1, 60] seconds
From: Coly Li <col...@suse.de> dc->writeback_rate_update_seconds can be set via sysfs and its value can be set to [1, ULONG_MAX]. It does not make sense to set such a large value, 60 seconds is long enough value considering the default 5 seconds works well for long time. Because dc->writeback_rate_update is a special delayed work, it re-arms itself inside the delayed work routine update_writeback_rate(). When stopping it by cancel_delayed_work_sync(), there should be a timeout to wait and make sure the re-armed delayed work is stopped too. A small max value of dc->writeback_rate_update_seconds is also helpful to decide a reasonable small timeout. This patch limits sysfs interface to set dc->writeback_rate_update_seconds in range of [1, 60] seconds, and replaces the hand-coded number by macros. Changelog: v2: fix a rebase typo in v4, which is pointed out by Michael Lyle. v1: initial version. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Hannes Reinecke <h...@suse.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/sysfs.c | 4 +++- drivers/md/bcache/writeback.c | 2 +- drivers/md/bcache/writeback.h | 3 +++ 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index c524305cc9a7..4a6a697e1680 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -218,7 +218,9 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_rate, dc->writeback_rate.rate, 1, INT_MAX); - d_strtoul_nonzero(writeback_rate_update_seconds); + sysfs_strtoul_clamp(writeback_rate_update_seconds, + dc->writeback_rate_update_seconds, + 1, WRITEBACK_RATE_UPDATE_SECS_MAX); d_strtoul(writeback_rate_i_term_inverse); d_strtoul_nonzero(writeback_rate_p_term_inverse); diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 58218f7e77c3..f1d2fc15abcc 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -655,7 +655,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_rate.rate = 1024; dc->writeback_rate_minimum = 8; - dc->writeback_rate_update_seconds = 5; + dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; dc->writeback_rate_p_term_inverse = 40; dc->writeback_rate_i_term_inverse = 1; diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 66f1c527fa24..587b25599856 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -8,6 +8,9 @@ #define MAX_WRITEBACKS_IN_PASS 5 #define MAX_WRITESIZE_IN_PASS 5000 /* *512b */ +#define WRITEBACK_RATE_UPDATE_SECS_MAX 60 +#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5 + /* * 14 (16384ths) is chosen here as something that each backing device * should be a reasonable fraction of the share, and not to blow up -- 2.14.1
[PATCH 2/8] bcache: fix high CPU occupancy during journal
From: Tang Junhui <tang.jun...@zte.com.cn> After long time small writing I/O running, we found the occupancy of CPU is very high and I/O performance has been reduced by about half: [root@ceph151 internal]# top top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd The kernel work consumed a lot of CPU, and I found they are running journal work, The journal is reclaiming source and flush btree node with surprising frequency. Through further analysis, we found that in btree_flush_write(), we try to get a btree node with the smallest fifo idex to flush by traverse all the btree nodein c->bucket_hash, after we getting it, since no locker protects it, this btree node may have been written to cache device by other works, and if this occurred, we retry to traverse in c->bucket_hash and get another btree node. When the problem occurrd, the retry times is very high, and we consume a lot of CPU in looking for a appropriate btree node. In this patch, we try to record 128 btree nodes with the smallest fifo idex in heap, and pop one by one when we need to flush btree node. It greatly reduces the time for the loop to find the appropriate BTREE node, and also reduce the occupancy of CPU. [note by mpl: this triggers a checkpatch error because of adjacent, pre-existing style violations] Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/bcache.h | 2 ++ drivers/md/bcache/journal.c | 47 ++--- drivers/md/bcache/util.h| 2 ++ 3 files changed, 36 insertions(+), 15 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index b98d7705acb6..a857eb3c10de 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -679,6 +679,8 @@ struct cache_set { #define BUCKET_HASH_BITS 12 struct hlist_head bucket_hash[1 << BUCKET_HASH_BITS]; + + DECLARE_HEAP(struct btree *, flush_btree); }; struct bbio { diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index f5296007a9d5..1b736b860739 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -368,6 +368,12 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) } /* Journalling */ +#define journal_max_cmp(l, r) \ + (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \ +fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) +#define journal_min_cmp(l, r) \ + (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \ +fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) static void btree_flush_write(struct cache_set *c) { @@ -375,25 +381,35 @@ static void btree_flush_write(struct cache_set *c) * Try to find the btree node with that references the oldest journal * entry, best is our current candidate and is locked if non NULL: */ - struct btree *b, *best; - unsigned i; + struct btree *b; + int i; atomic_long_inc(>flush_write); + retry: - best = NULL; - - for_each_cached_btree(b, c, i) - if (btree_current_write(b)->journal) { - if (!best) -
Bcache patches for inclusion in 4.16
Hi Jens--- Here are a few more changes for 4.16 from Coly Li and Tang Junhui. Probably the most "functional" is 2/8, which makes a change to improve a performance bug under sustained small writes. Still, it's a small, safe change that seems good to take. The rest are small bugfixes. All have seen some degree of test (which is, of course, always ongoing). Note that there's one checkpatch error in this, but it was already present and is just triggered by altering an adjacent line. 9 files changed, 111 insertions(+), 36 deletions(-) Coly Li (3): bcache: properly set task state in bch_writeback_thread() bcache: set error_limit correctly bcache: set writeback_rate_update_seconds in range [1, 60] seconds Tang Junhui (5): bcache: add journal statistic bcache: fix high CPU occupancy during journal bcache: fix for allocator and register thread race bcache: return attach error when no cache set exist bcache: fix for data collapse after re-attaching an attached device Thanks, as always, Mike
Re: [PATCH v5 07/10] bcache: fix inaccurate io state for detached bcache devices
detatched should be detached, otherwise lgtm. Reviewed-by: Michael Lyle <ml...@lyle.org> On 02/05/2018 08:20 PM, Coly Li wrote: > From: Tang Junhui <tang.jun...@zte.com.cn> > > When we run IO in a detached device, and run iostat to shows IO status, > normally it will show like bellow (Omitted some fields): > Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util > sdd... 15.89 0.531.820.202.23 1.81 52.30 > bcache0... 15.89 115.420.000.000.00 2.40 69.60 > but after IO stopped, there are still very big avgqu-sz and %util > values as bellow: > Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util > bcache0 ... 0 5326.320.000.000.00 0.00 100.10 > > The reason for this issue is that, only generic_start_io_acct() called > and no generic_end_io_acct() called for detached device in > cached_dev_make_request(). See the code: > //start generic_start_io_acct() > generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0); > if (cached_dev_get(dc)) { > //will callback generic_end_io_acct() > } > else { > //will not call generic_end_io_acct() > } > > This patch calls generic_end_io_acct() in the end of IO for detached > devices, so we can show IO state correctly. > > (Modified to use GFP_NOIO in kzalloc() by Coly Li) > > Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> > Reviewed-by: Coly Li <col...@suse.de> > Reviewed-by: Hannes Reinecke <h...@suse.com> > --- > drivers/md/bcache/request.c | 58 > +++-- > 1 file changed, 51 insertions(+), 7 deletions(-) > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 02296bda6384..e09c5ae745be 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -986,6 +986,55 @@ static void cached_dev_nodata(struct closure *cl) > continue_at(cl, cached_dev_bio_complete, NULL); > } > > +struct detached_dev_io_private { > + struct bcache_device*d; > + unsigned long start_time; > + bio_end_io_t*bi_end_io; > + void*bi_private; > +}; > + > +static void detatched_dev_end_io(struct bio *bio) > +{ > + struct detached_dev_io_private *ddip; > + > + ddip = bio->bi_private; > + bio->bi_end_io = ddip->bi_end_io; > + bio->bi_private = ddip->bi_private; > + > + generic_end_io_acct(ddip->d->disk->queue, > + bio_data_dir(bio), > + >d->disk->part0, ddip->start_time); > + > + kfree(ddip); > + > + bio->bi_end_io(bio); > +} > + > +static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) > +{ > + struct detached_dev_io_private *ddip; > + struct cached_dev *dc = container_of(d, struct cached_dev, disk); > + > + /* > + * no need to call closure_get(>disk.cl), > + * because upper layer had already opened bcache device, > + * which would call closure_get(>disk.cl) > + */ > + ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO); > + ddip->d = d; > + ddip->start_time = jiffies; > + ddip->bi_end_io = bio->bi_end_io; > + ddip->bi_private = bio->bi_private; > + bio->bi_end_io = detatched_dev_end_io; > + bio->bi_private = ddip; > + > + if ((bio_op(bio) == REQ_OP_DISCARD) && > + !blk_queue_discard(bdev_get_queue(dc->bdev))) > + bio->bi_end_io(bio); > + else > + generic_make_request(bio); > +} > + > /* Cached devices - read & write stuff */ > > static blk_qc_t cached_dev_make_request(struct request_queue *q, > @@ -1028,13 +1077,8 @@ static blk_qc_t cached_dev_make_request(struct > request_queue *q, > else > cached_dev_read(dc, s); > } > - } else { > - if ((bio_op(bio) == REQ_OP_DISCARD) && > - !blk_queue_discard(bdev_get_queue(dc->bdev))) > - bio_endio(bio); > - else > - generic_make_request(bio); > - } > + } else > + detached_dev_do_request(d, bio); > > return BLK_QC_T_NONE; > } >
Re: [PATCH v5 06/10] bcache: add stop_when_cache_set_failed option to backing device
I agree about the typos-- after they're fixed, Reviewed-by: Michael Lyle <ml...@lyle.org> On 02/05/2018 08:20 PM, Coly Li wrote: > When there are too many I/O errors on cache device, current bcache code > will retire the whole cache set, and detach all bcache devices. But the > detached bcache devices are not stopped, which is problematic when bcache > is in writeback mode. > > If the retired cache set has dirty data of backing devices, continue > writing to bcache device will write to backing device directly. If the > LBA of write request has a dirty version cached on cache device, next time > when the cache device is re-registered and backing device re-attached to > it again, the stale dirty data on cache device will be written to backing > device, and overwrite latest directly written data. This situation causes > a quite data corruption. > > But we cannot simply stop all attached bcache devices when the cache set is > broken or disconnected. For example, use bcache to accelerate performance > of an email service. In such workload, if cache device is broken but no > dirty data lost, keep the bcache device alive and permit email service > continue to access user data might be a better solution for the cache > device failure. > > Nix <n...@esperi.org.uk> points out the issue and provides the above example > to explain why it might be necessary to not stop bcache device for broken > cache device. Pavel Goran <via-bca...@pvgoran.name> provides a brilliant > suggestion to provide "always" and "auto" options to per-cached device > sysfs file stop_when_cache_set_failed. If cache set is retiring and the > backing device has no dirty data on cache, it should be safe to keep the > bcache device alive. In this case, if stop_when_cache_set_failed is set to > "auto", the device failure handling code will not stop this bcache device > and permit application to access the backing device with a unattached > bcache device. > > Changelog: > v2: change option values of stop_when_cache_set_failed from 1/0 to > "auto"/"always". > v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1 > (always stop). > > Signed-off-by: Coly Li <col...@suse.de> > Cc: Nix <n...@esperi.org.uk> > Cc: Pavel Goran <via-bca...@pvgoran.name> > Cc: Michael Lyle <ml...@lyle.org> > Cc: Junhui Tang <tang.jun...@zte.com.cn> > Cc: Hannes Reinecke <h...@suse.com> > --- > drivers/md/bcache/bcache.h | 9 + > drivers/md/bcache/super.c | 82 > -- > drivers/md/bcache/sysfs.c | 17 ++ > 3 files changed, 98 insertions(+), 10 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 7917b3820dd5..263164490833 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -287,6 +287,12 @@ struct io { > sector_tlast; > }; > > +enum stop_on_faliure { > + BCH_CACHED_DEV_STOP_ATUO = 0, > + BCH_CACHED_DEV_STOP_ALWAYS, > + BCH_CACHED_DEV_STOP_MODE_MAX, > +}; > + > struct cached_dev { > struct list_headlist; > struct bcache_devicedisk; > @@ -379,6 +385,8 @@ struct cached_dev { > unsignedwriteback_rate_i_term_inverse; > unsignedwriteback_rate_p_term_inverse; > unsignedwriteback_rate_minimum; > + > + enum stop_on_faliurestop_when_cache_set_failed; > }; > > enum alloc_reserve { > @@ -924,6 +932,7 @@ void bch_write_bdev_super(struct cached_dev *, struct > closure *); > > extern struct workqueue_struct *bcache_wq; > extern const char * const bch_cache_modes[]; > +extern const char * const bch_stop_on_failure_modes[]; > extern struct mutex bch_register_lock; > extern struct list_head bch_cache_sets; > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index f8b0d1196c12..e335433bdfb7 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -47,6 +47,14 @@ const char * const bch_cache_modes[] = { > NULL > }; > > +/* Default is -1; we skip past it for stop_when_cache_set_failed */ > +const char * const bch_stop_on_failure_modes[] = { > + "default", > + "auto", > + "always", > + NULL > +}; > + > static struct kobject *bcache_kobj; > struct mutex bch_register_lock; > LIST_HEAD(bch_cache_sets); > @@ -1187,6 +1195,9 @@ static int cached_dev_init(struct cached_dev *dc, > unsigned block_size) > max(dc->disk.disk->queue->backing_dev_info->
Re: [PATCH v5 03/10] bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
Reviewed-by: Michael Lyle <ml...@lyle.org> On 02/05/2018 08:20 PM, Coly Li wrote: > In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", > cached_dev_get() is called when creating dc->writeback_thread, and > cached_dev_put() is called when exiting dc->writeback_thread. This > modification works well unless people detach the bcache device manually by > 'echo 1 > /sys/block/bcache/bcache/detach' > Because this sysfs interface only calls bch_cached_dev_detach() which wakes > up dc->writeback_thread but does not stop it. The reason is, before patch > "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside > bch_writeback_thread(), if cache is not dirty after writeback, > cached_dev_put() will be called here. And in cached_dev_make_request() when > a new write request makes cache from clean to dirty, cached_dev_get() will > be called there. Since we don't operate dc->count in these locations, > refcount d->count cannot be dropped after cache becomes clean, and > cached_dev_detach_finish() won't be called to detach bcache device. > > This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is > set inside bch_writeback_thread(). If this bit is set and cache is clean > (no existing writeback_keys), break the while-loop, call cached_dev_put() > and quit the writeback thread. > > Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the > writeback thread should continue to perform writeback, this is the original > design of manually detach. > > It is safe to do the following check without locking, let me explain why, > + if (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) && > + (!atomic_read(>has_dirty) || !dc->writeback_running)) { > > If the kenrel thread does not sleep and continue to run due to conditions > are not updated in time on the running CPU core, it just consumes more CPU > cycles and has no hurt. This should-sleep-but-run is safe here. We just > focus on the should-run-but-sleep condition, which means the writeback > thread goes to sleep in mistake while it should continue to run. > 1, First of all, no matter the writeback thread is hung or not, > kthread_stop() from >cached_dev_detach_finish() will wake up it and terminate by making >kthread_should_stop() return true. And in normal run time, bit on index >BCACHE_DEV_DETACHING is always cleared, the condition > !test_bit(BCACHE_DEV_DETACHING, >disk.flags) >is always true and can be ignored as constant value. > 2, If one of the following conditions is true, the writeback thread should >go to sleep, >"!atomic_read(>has_dirty)" or "!dc->writeback_running)" >each of them independently controls the writeback thread should sleep or >not, let's analyse them one by one. > 2.1 condition "!atomic_read(>has_dirty)" >If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will >call bch_writeback_queue() immediately or call bch_writeback_add() which >indirectly calls bch_writeback_queue() too. In bch_writeback_queue(), >wake_up_process(dc->writeback_thread) is called. It sets writeback >thread's task state to TASK_RUNNING and following an implicit memory >barrier, then tries to wake up the writeback thread. >In writeback thread, its task state is set to TASK_INTERRUPTIBLE before >doing the condition check. If other CPU core sets the TASK_RUNNING state >after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread >will be scheduled to run very soon because its state is not >TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before >writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier >of wake_up_process() will make sure modification of dc->has_dirty on >other CPU core is updated and observed on the CPU core of writeback >thread. Therefore the condition check will correctly be false, and >continue writeback code without sleeping. > 2.2 condition "!dc->writeback_running)" >dc->writeback_running can be changed via sysfs file, every time it is >modified, a following bch_writeback_queue() is alwasy called. So the >change is always observed on the CPU core of writeback thread. If >dc->writeback_running is changed from 0 to 1 on other CPU core, this >condition check will observe the modification and allow writeback >thread to continue to run without sleeping. > Now we can see, even without a locking protection, multiple conditions > check is safe here, no deadlock or process hang up will happen. > > I compose a separte patch because that patch "bcache
Re: [PATCH v5 01/10] bcache: set writeback_rate_update_seconds in range [1, 60] seconds
LGTM-- in my staging branch. Reviewed-by: Michael Lyle <ml...@lyle.org> On 02/05/2018 08:20 PM, Coly Li wrote: > dc->writeback_rate_update_seconds can be set via sysfs and its value can > be set to [1, ULONG_MAX]. It does not make sense to set such a large > value, 60 seconds is long enough value considering the default 5 seconds > works well for long time. > > Because dc->writeback_rate_update is a special delayed work, it re-arms > itself inside the delayed work routine update_writeback_rate(). When > stopping it by cancel_delayed_work_sync(), there should be a timeout to > wait and make sure the re-armed delayed work is stopped too. A small max > value of dc->writeback_rate_update_seconds is also helpful to decide a > reasonable small timeout. > > This patch limits sysfs interface to set dc->writeback_rate_update_seconds > in range of [1, 60] seconds, and replaces the hand-coded number by macros. > > Changelog: > v2: fix a rebase typo in v4, which is pointed out by Michael Lyle. > v1: initial version. > > Signed-off-by: Coly Li <col...@suse.de> > Reviewed-by: Hannes Reinecke <h...@suse.com> > Cc: Michael Lyle <ml...@lyle.org> > --- > drivers/md/bcache/sysfs.c | 4 +++- > drivers/md/bcache/writeback.c | 2 +- > drivers/md/bcache/writeback.h | 3 +++ > 3 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index c524305cc9a7..4a6a697e1680 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -218,7 +218,9 @@ STORE(__cached_dev) > sysfs_strtoul_clamp(writeback_rate, > dc->writeback_rate.rate, 1, INT_MAX); > > - d_strtoul_nonzero(writeback_rate_update_seconds); > + sysfs_strtoul_clamp(writeback_rate_update_seconds, > + dc->writeback_rate_update_seconds, > + 1, WRITEBACK_RATE_UPDATE_SECS_MAX); > d_strtoul(writeback_rate_i_term_inverse); > d_strtoul_nonzero(writeback_rate_p_term_inverse); > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index 58218f7e77c3..f1d2fc15abcc 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -655,7 +655,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) > dc->writeback_rate.rate = 1024; > dc->writeback_rate_minimum = 8; > > - dc->writeback_rate_update_seconds = 5; > + dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; > dc->writeback_rate_p_term_inverse = 40; > dc->writeback_rate_i_term_inverse = 1; > > diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h > index 66f1c527fa24..587b25599856 100644 > --- a/drivers/md/bcache/writeback.h > +++ b/drivers/md/bcache/writeback.h > @@ -8,6 +8,9 @@ > #define MAX_WRITEBACKS_IN_PASS 5 > #define MAX_WRITESIZE_IN_PASS 5000 /* *512b */ > > +#define WRITEBACK_RATE_UPDATE_SECS_MAX 60 > +#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5 > + > /* > * 14 (16384ths) is chosen here as something that each backing device > * should be a reasonable fraction of the share, and not to blow up >
Re: [PATCH] [PATCH v2] bcache: fix for allocator and register thread race
Hi Tang Junhui-- On 01/25/2018 04:30 AM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui <tang.jun...@zte.com.cn> [snip] > The stack shows the register thread and allocator thread > were getting stuck when registering cache device. This looks good-- applying to my 4.16-related test branch. Reviewed-by: Michael Lyle <ml...@lyle.org>
Re: [PATCH v4 00/13] bcache: device failure handling improvement
On 01/27/2018 05:56 PM, Coly Li wrote: > Hi maintainers and folks, Hi Coly--- > This patch set tries to improve bcache device failure handling, includes > cache device and backing device failures. I've skimmed this whole patchset and overall it looks good. I have taken some pieces (2,6) for possible-4.16. I understand you'll be sending out a v5 soon-- I'll perform a more detailed review then. Thanks, Mike
Re: [PATCH v4 06/13] bcache: set error_limit correctly
On 01/27/2018 06:23 AM, Coly Li wrote: > Struct cache uses io_errors for two purposes, > - Error decay: when cache set error_decay is set, io_errors is used to > generate a small piece of delay when I/O error happens. > - I/O errors counter: in order to generate big enough value for error > decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a > IO_ERROR_SHIFT). This LGTM Reviewed-by: Michael Lyle <ml...@lyle.org>
Re: [PATCH v4 02/13] bcache: properly set task state in bch_writeback_thread()
On 01/27/2018 06:23 AM, Coly Li wrote: > Kernel thread routine bch_writeback_thread() has the following code block, > > 447 down_write(>writeback_lock); > 448~450 if (check conditions) { > 451 up_write(>writeback_lock); > 452 set_current_state(TASK_INTERRUPTIBLE); > 453 > 454 if (kthread_should_stop()) > 455 return 0; > 456 > 457 schedule(); > 458 continue; > 459 } > > If condition check is true, its task state is set to TASK_INTERRUPTIBLE > and call schedule() to wait for others to wake up it. > > There are 2 issues in current code, > 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if >another process changes the condition and call wake_up_process(dc-> >writeback_thread), then at line 452 task state is set back to >TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be >waken up. > 2, At line 454 if kthread_should_stop() is true, writeback kernel thread >will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and >call do_exit(). It is not good to enter do_exit() with task state >TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a >warning message is reported by __might_sleep(): "WARNING: do not call >blocking ops when !TASK_RUNNING; state=1 set at []". > > For the first issue, task state should be set before condition checks. > Ineed because dc->writeback_lock is required when modifying all the > conditions, calling set_current_state() inside code block where dc-> > writeback_lock is hold is safe. But this is quite implicit, so I still move > set_current_state() before all the condition checks. > > For the second issue, frankley speaking it does not hurt when kernel thread > exits with TASK_INTERRUPTIBLE state, but this warning message scares users, > makes them feel there might be something risky with bcache and hurt their > data. Setting task state to TASK_RUNNING before returning fixes this > problem. > > In alloc.c:allocator_wait(), there is also a similar issue, and is also > fixed in this patch. > > Changelog: > v3: merge two similar fixes into one patch > v2: fix the race issue in v1 patch. > v1: initial buggy fix. > > Signed-off-by: Coly Li <col...@suse.de> > Reviewed-by: Hannes Reinecke <h...@suse.de> > Cc: Michael Lyle <ml...@lyle.org> > Cc: Junhui Tang <tang.jun...@zte.com.cn> This one LGTM-- applying to my staging branch Reviewed-by: Michael Lyle <ml...@lyle.org>
Re: [PATCH v4 01/13] bcache: set writeback_rate_update_seconds in range [1, 60] seconds
On 01/27/2018 06:23 AM, Coly Li wrote: > dc->writeback_rate_update_seconds can be set via sysfs and its value can > be set to [1, ULONG_MAX]. It does not make sense to set such a large > value, 60 seconds is long enough value considering the default 5 seconds > works well for long time. > > Because dc->writeback_rate_update is a special delayed work, it re-arms > itself inside the delayed work routine update_writeback_rate(). When > stopping it by cancel_delayed_work_sync(), there should be a timeout to > wait and make sure the re-armed delayed work is stopped too. A small max > value of dc->writeback_rate_update_seconds is also helpful to decide a > reasonable small timeout. > > This patch limits sysfs interface to set dc->writeback_rate_update_seconds > in range of [1, 60] seconds, and replaces the hand-coded number by macros. > > Signed-off-by: Coly Li> Reviewed-by: Hannes Reinecke [snip] > --- > drivers/md/bcache/sysfs.c | 3 +++ > drivers/md/bcache/writeback.c | 2 +- > drivers/md/bcache/writeback.h | 3 +++ > 3 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index b4184092c727..a74a752c9e0f 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -215,6 +215,9 @@ STORE(__cached_dev) > sysfs_strtoul_clamp(writeback_rate, > dc->writeback_rate.rate, 1, INT_MAX); > > + sysfs_strtoul_clamp(writeback_rate_update_seconds, > + dc->writeback_rate_update_seconds, > + 1, WRITEBACK_RATE_UPDATE_SECS_MAX); > d_strtoul_nonzero(writeback_rate_update_seconds); Doesn't the above line need to be removed for this to work? Mike
Re: [PATCH 2/2] bcache: fix high CPU occupancy during journal
Unfortunately, this doesn't build because of nonexistent call heap_empty (I assume some changes to util.h got left out). I really need clean patches that build and are formatted properly. Mike On 01/26/2018 12:24 AM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > After long time small writing I/O running, we found the occupancy of CPU > is very high and I/O performance has been reduced by about half: > > [root@ceph151 internal]# top > top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 > Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie > %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st > KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache > KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 > 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 > 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 > 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd > 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 > 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 > 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 > 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 > 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 > 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 > 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 > 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 > 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd > 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd > 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 > 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd > 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd > > The kernel work consumed a lot of CPU, and I found they are running journal > work, The journal is reclaiming source and flush btree node with surprising > frequency. > > Through further analysis, we found that in btree_flush_write(), we try to > get a btree node with the smallest fifo idex to flush by traverse all the > btree nodein c->bucket_hash, after we getting it, since no locker protects > it, this btree node may have been written to cache device by other works, > and if this occurred, we retry to traverse in c->bucket_hash and get > another btree node. When the problem occurrd, the retry times is very high, > and we consume a lot of CPU in looking for a appropriate btree node. > > In this patch, we try to record 128 btree nodes with the smallest fifo idex > in heap, and pop one by one when we need to flush btree node. It greatly > reduces the time for the loop to find the appropriate BTREE node, and also > reduce the occupancy of CPU. > > Signed-off-by: Tang Junhui > --- > drivers/md/bcache/bcache.h | 2 ++ > drivers/md/bcache/journal.c | 47 > ++--- > 2 files changed, 34 insertions(+), 15 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 0432e28..b343ba4 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -669,6 +669,8 @@ struct cache_set { > > #define BUCKET_HASH_BITS 12 > struct hlist_head bucket_hash[1 << BUCKET_HASH_BITS]; > + > + DECLARE_HEAP(struct btree *, flush_btree); > }; > > struct bbio { > diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c > index 47fd0b8..f42d3ea 100644 > --- a/drivers/md/bcache/journal.c > +++ b/drivers/md/bcache/journal.c > @@ -363,6 +363,12 @@ int bch_journal_replay(struct cache_set *s, struct > list_head *list) > } > > /* Journalling */ > +#define journal_max_cmp(l, r) \ > + (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \ > + fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) > +#define journal_min_cmp(l, r) \ > + (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \ > + fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) > > static void btree_flush_write(struct cache_set *c) > { > @@ -370,25 +376,35 @@ static void btree_flush_write(struct cache_set *c) >* Try to find the btree node with that references the oldest journal >* entry, best is our current candidate and is locked if non NULL: >*/ > - struct btree *b, *best; > - unsigned i; > + struct btree *b; > + int i; > > atomic_long_inc(>flush_write); > + > retry: > - best = NULL; > - > - for_each_cached_btree(b, c, i) > -
Re: [PATCH 2/2] bcache: fix high CPU occupancy during journal
LGTM on first read-- I'll read it again and test in test branch. Reviewed-by: Michael Lyle <ml...@lyle.org> On 01/26/2018 12:24 AM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui <tang.jun...@zte.com.cn> > > After long time small writing I/O running, we found the occupancy of CPU > is very high and I/O performance has been reduced by about half: > > [root@ceph151 internal]# top > top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 > Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie > %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st > KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache > KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 > 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 > 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 > 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd > 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 > 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 > 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 > 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 > 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 > 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 > 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 > 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 > 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd > 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd > 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 > 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd > 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd > > The kernel work consumed a lot of CPU, and I found they are running journal > work, The journal is reclaiming source and flush btree node with surprising > frequency. > > Through further analysis, we found that in btree_flush_write(), we try to > get a btree node with the smallest fifo idex to flush by traverse all the > btree nodein c->bucket_hash, after we getting it, since no locker protects > it, this btree node may have been written to cache device by other works, > and if this occurred, we retry to traverse in c->bucket_hash and get > another btree node. When the problem occurrd, the retry times is very high, > and we consume a lot of CPU in looking for a appropriate btree node. > > In this patch, we try to record 128 btree nodes with the smallest fifo idex > in heap, and pop one by one when we need to flush btree node. It greatly > reduces the time for the loop to find the appropriate BTREE node, and also > reduce the occupancy of CPU. > > Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> > --- > drivers/md/bcache/bcache.h | 2 ++ > drivers/md/bcache/journal.c | 47 > ++--- > 2 files changed, 34 insertions(+), 15 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 0432e28..b343ba4 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -669,6 +669,8 @@ struct cache_set { > > #define BUCKET_HASH_BITS 12 > struct hlist_head bucket_hash[1 << BUCKET_HASH_BITS]; > + > + DECLARE_HEAP(struct btree *, flush_btree); > }; > > struct bbio { > diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c > index 47fd0b8..f42d3ea 100644 > --- a/drivers/md/bcache/journal.c > +++ b/drivers/md/bcache/journal.c > @@ -363,6 +363,12 @@ int bch_journal_replay(struct cache_set *s, struct > list_head *list) > } > > /* Journalling */ > +#define journal_max_cmp(l, r) \ > + (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \ > + fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) > +#define journal_min_cmp(l, r) \ > + (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \ > + fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) > > static void btree_flush_write(struct cache_set *c) > { > @@ -370,25 +376,35 @@ static void btree_flush_write(struct cache_set *c) >* Try to find the btree node with that references the oldest journal >* entry, best is our current candidate and i
Re: [PATCH 1/2] bcache: add journal statistic
LGTM except for formatting / an extra newline (will fix) -- in my test branch for possible 4.16 Reviewed-by: Michael Lyle <ml...@lyle.org> On 01/26/2018 12:23 AM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui <tang.jun...@zte.com.cn> > > Sometimes, Journal takes up a lot of CPU, we need statistics > to know what's the journal is doing. So this patch provide > some journal statistics: > 1) reclaim: how many times the journal try to reclaim resource, >usually the journal bucket or/and the pin are exhausted. > 2) flush_write: how many times the journal try to flush btree node >to cache device, usually the journal bucket are exhausted. > 3) retry_flush_write: how many times the journal retry to flush >the next btree node, usually the previous tree node have been >flushed by other thread. > we show these statistic by sysfs interface. Through these statistics > We can totally see the status of journal module when the CPU is too > high. > > Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> > --- > drivers/md/bcache/bcache.h | 5 + > drivers/md/bcache/journal.c | 5 + > drivers/md/bcache/sysfs.c | 15 +++ > 3 files changed, 25 insertions(+) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index abd31e8..0432e28 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -647,6 +647,11 @@ struct cache_set { > atomic_long_t writeback_keys_done; > atomic_long_t writeback_keys_failed; > > + > + atomic_long_t reclaim; > + atomic_long_t flush_write; > + atomic_long_t retry_flush_write; > + > enum{ > ON_ERROR_UNREGISTER, > ON_ERROR_PANIC, > diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c > index 02a98dd..47fd0b8 100644 > --- a/drivers/md/bcache/journal.c > +++ b/drivers/md/bcache/journal.c > @@ -372,6 +372,8 @@ static void btree_flush_write(struct cache_set *c) >*/ > struct btree *b, *best; > unsigned i; > + > + atomic_long_inc(>flush_write); > retry: > best = NULL; > > @@ -392,6 +394,7 @@ static void btree_flush_write(struct cache_set *c) > if (!btree_current_write(b)->journal) { > mutex_unlock(>write_lock); > /* We raced */ > + atomic_long_inc(>retry_flush_write); > goto retry; > } > > @@ -471,6 +474,8 @@ static void journal_reclaim(struct cache_set *c) > unsigned iter, n = 0; > atomic_t p; > > + atomic_long_inc(>reclaim); > + > while (!atomic_read(_front(>journal.pin))) > fifo_pop(>journal.pin, p); > > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index 234b2f5..0afbf1a 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -65,6 +65,9 @@ > > read_attribute(state); > read_attribute(cache_read_races); > +read_attribute(reclaim); > +read_attribute(flush_write); > +read_attribute(retry_flush_write); > read_attribute(writeback_keys_done); > read_attribute(writeback_keys_failed); > read_attribute(io_errors); > @@ -543,6 +546,15 @@ static unsigned bch_average_key_size(struct cache_set *c) > sysfs_print(cache_read_races, > atomic_long_read(>cache_read_races)); > > + sysfs_print(reclaim, > + atomic_long_read(>reclaim)); > + > + sysfs_print(flush_write, > + atomic_long_read(>flush_write)); > + > + sysfs_print(retry_flush_write, > + atomic_long_read(>retry_flush_write)); > + > sysfs_print(writeback_keys_done, > atomic_long_read(>writeback_keys_done)); > sysfs_print(writeback_keys_failed, > @@ -729,6 +741,9 @@ static void bch_cache_set_internal_release(struct kobject > *k) > > _bset_tree_stats, > _cache_read_races, > + _reclaim, > + _flush_write, > + _retry_flush_write, > _writeback_keys_done, > _writeback_keys_failed, > >
Re: how to understand bcache
On Wed, Jan 24, 2018 at 1:34 AM,wrote: > you also can ask our dear maintainer Mike, who knows all about bcache, > but sometiomes he is very busy, It is normally you would hear nothing > from him for one weeks or two, sometimes maybe three, but never five, > since the merge window comes. ;) I didn't think I was this bad. I'll try to do better. Mike
Re: [PATCH v3 00/13] bcache: device failure handling improvement
Hey Coly, On Thu, Jan 25, 2018 at 8:56 PM, Coly Liwrote: > In order to make the failure handling simple and fast, I will not > distinct each bcache device whether it has dirty data on cache set. Once > the cache set is dirty, all attached bcache devices will be stopped. Surely it shouldn't stop any writethrough/writearound ones though, right? > It seems when this option is enabled by users, it should work for > writeback and writethrough, and no side effect to writearound and non. Writethrough should be the same as writearound, shouldn't it? Thanks, Mike
[PATCH] bcache: closures: move control bits one bit right
Otherwise, architectures that do negated adds of atomics (e.g. s390) to do atomic_sub fail in closure_set_stopped. Signed-off-by: Michael Lyle <ml...@lyle.org> Cc: Kent Overstreet <kent.overstr...@gmail.com> Reported-by: kbuild test robot <l...@intel.com> --- drivers/md/bcache/closure.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h index 392a87cf1b92..3b9dfc9962ad 100644 --- a/drivers/md/bcache/closure.h +++ b/drivers/md/bcache/closure.h @@ -127,10 +127,10 @@ enum closure_state { * annotate where references are being transferred. */ - CLOSURE_BITS_START = (1U << 27), - CLOSURE_DESTRUCTOR = (1U << 27), - CLOSURE_WAITING = (1U << 29), - CLOSURE_RUNNING = (1U << 31), + CLOSURE_BITS_START = (1U << 26), + CLOSURE_DESTRUCTOR = (1U << 26), + CLOSURE_WAITING = (1U << 28), + CLOSURE_RUNNING = (1U << 30), }; #define CLOSURE_GUARD_MASK \ -- 2.14.1
[416 PATCH 12/13] bcache: fix misleading error message in bch_count_io_errors()
From: Coly Li <col...@suse.de> Bcache only does recoverable I/O for read operations by calling cached_dev_read_error(). For write opertions there is no I/O recovery for failed requests. But in bch_count_io_errors() no matter read or write I/Os, before errors counter reaches io error limit, pr_err() always prints "IO error on %, recoverying". For write requests this information is misleading, because there is no I/O recovery at all. This patch adds a parameter 'is_read' to bch_count_io_errors(), and only prints "recovering" by pr_err() when the bio direction is READ. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> --- drivers/md/bcache/bcache.h| 2 +- drivers/md/bcache/io.c| 13 + drivers/md/bcache/super.c | 4 +++- drivers/md/bcache/writeback.c | 4 +++- 4 files changed, 16 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 9117da5f494b..5e2d4e80198e 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -862,7 +862,7 @@ static inline void wake_up_allocators(struct cache_set *c) /* Forward declarations */ -void bch_count_io_errors(struct cache *, blk_status_t, const char *); +void bch_count_io_errors(struct cache *, blk_status_t, int, const char *); void bch_bbio_count_io_errors(struct cache_set *, struct bio *, blk_status_t, const char *); void bch_bbio_endio(struct cache_set *, struct bio *, blk_status_t, diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index fac97ec2d0e2..a783c5a41ff1 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -51,7 +51,10 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c, /* IO errors */ -void bch_count_io_errors(struct cache *ca, blk_status_t error, const char *m) +void bch_count_io_errors(struct cache *ca, +blk_status_t error, +int is_read, +const char *m) { /* * The halflife of an error is: @@ -94,8 +97,9 @@ void bch_count_io_errors(struct cache *ca, blk_status_t error, const char *m) errors >>= IO_ERROR_SHIFT; if (errors < ca->set->error_limit) - pr_err("%s: IO error on %s, recovering", - bdevname(ca->bdev, buf), m); + pr_err("%s: IO error on %s%s", + bdevname(ca->bdev, buf), m, + is_read ? ", recovering." : "."); else bch_cache_set_error(ca->set, "%s: too many IO errors %s", @@ -108,6 +112,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, { struct bbio *b = container_of(bio, struct bbio, bio); struct cache *ca = PTR_CACHE(c, >key, 0); + int is_read = (bio_data_dir(bio) == READ ? 1 : 0); unsigned threshold = op_is_write(bio_op(bio)) ? c->congested_write_threshold_us @@ -129,7 +134,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio, atomic_inc(>congested); } - bch_count_io_errors(ca, error, m); + bch_count_io_errors(ca, error, is_read, m); } void bch_bbio_endio(struct cache_set *c, struct bio *bio, diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index d13e4ccb30a0..133b81225ea9 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -274,7 +274,9 @@ static void write_super_endio(struct bio *bio) { struct cache *ca = bio->bi_private; - bch_count_io_errors(ca, bio->bi_status, "writing superblock"); + /* is_read = 0 */ + bch_count_io_errors(ca, bio->bi_status, 0, + "writing superblock"); closure_put(>set->sb_write); } diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index f82ffb2e9b9b..31b0a292a619 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -244,8 +244,10 @@ static void read_dirty_endio(struct bio *bio) struct keybuf_key *w = bio->bi_private; struct dirty_io *io = w->private; + /* is_read = 1 */ bch_count_io_errors(PTR_CACHE(io->dc->disk.c, >key, 0), - bio->bi_status, "reading dirty data from cache"); + bio->bi_status, 1, + "reading dirty data from cache"); dirty_endio(bio); } -- 2.14.1
[416 PATCH 13/13] bcache: fix writeback target calc on large devices
Bcache needs to scale the dirty data in the cache over the multiple backing disks in order to calculate writeback rates for each. The previous code did this by multiplying the target number of dirty sectors by the backing device size, and expected it to fit into a uint64_t; this blows up on relatively small backing devices. The new approach figures out the bdev's share in 16384ths of the overall cached data. This is chosen to cope well when bdevs drastically vary in size and to ensure that bcache can cross the petabyte boundary for each backing device. This has been improved based on Tang Junhui's feedback to ensure that every device gets a share of dirty data, no matter how small it is compared to the total backing pool. The existing mechanism is very limited; this is purely a bug fix to remove limits on volume size. However, there still needs to be change to make this "fair" over many volumes where some are idle. Reported-by: Jack Douglas <j...@douglastechnology.co.uk> Signed-off-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> --- drivers/md/bcache/writeback.c | 31 +++ drivers/md/bcache/writeback.h | 7 +++ 2 files changed, 34 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 31b0a292a619..51306a19ab03 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -18,17 +18,39 @@ #include /* Rate limiting */ - -static void __update_writeback_rate(struct cached_dev *dc) +static uint64_t __calc_target_rate(struct cached_dev *dc) { struct cache_set *c = dc->disk.c; + + /* +* This is the size of the cache, minus the amount used for +* flash-only devices +*/ uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size - bcache_flash_devs_sectors_dirty(c); + + /* +* Unfortunately there is no control of global dirty data. If the +* user states that they want 10% dirty data in the cache, and has, +* e.g., 5 backing volumes of equal size, we try and ensure each +* backing volume uses about 2% of the cache for dirty data. +*/ + uint32_t bdev_share = + div64_u64(bdev_sectors(dc->bdev) << WRITEBACK_SHARE_SHIFT, + c->cached_dev_sectors); + uint64_t cache_dirty_target = div_u64(cache_sectors * dc->writeback_percent, 100); - int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev), - c->cached_dev_sectors); + /* Ensure each backing dev gets at least one dirty share */ + if (bdev_share < 1) + bdev_share = 1; + + return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT; +} + +static void __update_writeback_rate(struct cached_dev *dc) +{ /* * PI controller: * Figures out the amount that should be written per second. @@ -49,6 +71,7 @@ static void __update_writeback_rate(struct cached_dev *dc) * This acts as a slow, long-term average that is not subject to * variations in usage like the p term. */ + int64_t target = __calc_target_rate(dc); int64_t dirty = bcache_dev_sectors_dirty(>disk); int64_t error = dirty - target; int64_t proportional_scaled = diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index f102b1f9bc51..66f1c527fa24 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -8,6 +8,13 @@ #define MAX_WRITEBACKS_IN_PASS 5 #define MAX_WRITESIZE_IN_PASS 5000 /* *512b */ +/* + * 14 (16384ths) is chosen here as something that each backing device + * should be a reasonable fraction of the share, and not to blow up + * until individual backing devices are a petabyte. + */ +#define WRITEBACK_SHARE_SHIFT 14 + static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d) { uint64_t i, ret = 0; -- 2.14.1
[416 PATCH 10/13] bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()
From: Zhai Zhaoxuan <kxuan...@gmail.com> The function cached_dev_make_request() and flash_dev_make_request() call generic_start_io_acct() with (struct bcache_device)->disk when they start a closure. Then the function bio_complete() calls generic_end_io_acct() with (struct search)->orig_bio->bi_disk when the closure has done. Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is called with a wrong device queue. It causes the "inflight" (in struct hd_struct) counter keep increasing without decreasing. This patch fix the problem by calling generic_end_io_acct() with (struct bcache_device)->disk. Signed-off-by: Zhai Zhaoxuan <kxuan...@gmail.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Coly Li <col...@suse.de> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> --- drivers/md/bcache/request.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ddd941056f3c..1a46b41dac70 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -633,8 +633,8 @@ static void request_endio(struct bio *bio) static void bio_complete(struct search *s) { if (s->orig_bio) { - struct request_queue *q = s->orig_bio->bi_disk->queue; - generic_end_io_acct(q, bio_data_dir(s->orig_bio), + generic_end_io_acct(s->d->disk->queue, + bio_data_dir(s->orig_bio), >d->disk->part0, s->start_time); trace_bcache_request_end(s->d, s->orig_bio); -- 2.14.1
[416 PATCH 02/13] bcache: stop writeback thread after detaching
From: Tang Junhui <tang.jun...@zte.com.cn> Currently, when a cached device detaching from cache, writeback thread is not stopped, and writeback_rate_update work is not canceled. For example, after the following command: echo 1 >/sys/block/sdb/bcache/detach you can still see the writeback thread. Then you attach the device to the cache again, bcache will create another writeback thread, for example, after below command: echo ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach then you will see 2 writeback threads. This patch stops writeback thread and cancels writeback_rate_update work when cached device detaching from cache. Compare with patch v1, this v2 patch moves code down into the register lock for safety in case of any future changes as Coly and Mike suggested. [edit by mlyle: commit log spelling/formatting] Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> Signed-off-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/super.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 8399fe0651f2..553e841e897d 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -906,6 +906,12 @@ static void cached_dev_detach_finish(struct work_struct *w) mutex_lock(_register_lock); + cancel_delayed_work_sync(>writeback_rate_update); + if (!IS_ERR_OR_NULL(dc->writeback_thread)) { + kthread_stop(dc->writeback_thread); + dc->writeback_thread = NULL; + } + memset(>sb.set_uuid, 0, 16); SET_BDEV_STATE(>sb, BDEV_STATE_NONE); -- 2.14.1
[416 PATCH 11/13] bcache: reduce cache_set devices iteration by devices_max_used
From: Coly Li <col...@suse.de> Member devices of struct cache_set is used to reference all attached bcache devices to this cache set. If it is treated as array of pointers, size of devices[] is indicated by member nr_uuids of struct cache_set. nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(), bucket_bytes(c) / sizeof(struct uuid_entry) Bucket size is determined by user space tool "make-bcache", by default it is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default nr_uuids value is 4096 from the above calculation. Every time when bcache code iterates bcache devices of a cache set, all the 4096 pointers are checked even only 1 bcache device is attached to the cache set, that's a wast of time and unncessary. This patch adds a member devices_max_used to struct cache_set. Its value is 1 + the maximum used index of devices[] in a cache set. When iterating all valid bcache devices of a cache set, use c->devices_max_used in for-loop may reduce a lot of useless checking. Personally, my motivation of this patch is not for performance, I use it in bcache debugging, which helps me to narrow down the scape to check valid bcached devices of a cache set. Signed-off-by: Coly Li <col...@suse.de> Reviewed-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> --- drivers/md/bcache/bcache.h| 1 + drivers/md/bcache/btree.c | 2 +- drivers/md/bcache/super.c | 9 ++--- drivers/md/bcache/writeback.h | 2 +- 4 files changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 5f7b0b2513cc..9117da5f494b 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -497,6 +497,7 @@ struct cache_set { int caches_loaded; struct bcache_device**devices; + unsigneddevices_max_used; struct list_headcached_devs; uint64_tcached_dev_sectors; struct closure caching; diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 9e30713dbdb8..bf3a48aa9a9a 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1679,7 +1679,7 @@ static void bch_btree_gc_finish(struct cache_set *c) /* don't reclaim buckets to which writeback keys point */ rcu_read_lock(); - for (i = 0; i < c->nr_uuids; i++) { + for (i = 0; i < c->devices_max_used; i++) { struct bcache_device *d = c->devices[i]; struct cached_dev *dc; struct keybuf_key *w, *n; diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 553e841e897d..d13e4ccb30a0 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -721,6 +721,9 @@ static void bcache_device_attach(struct bcache_device *d, struct cache_set *c, d->c = c; c->devices[id] = d; + if (id >= c->devices_max_used) + c->devices_max_used = id + 1; + closure_get(>caching); } @@ -1267,7 +1270,7 @@ static int flash_devs_run(struct cache_set *c) struct uuid_entry *u; for (u = c->uuids; -u < c->uuids + c->nr_uuids && !ret; +u < c->uuids + c->devices_max_used && !ret; u++) if (UUID_FLASH_ONLY(u)) ret = flash_dev_run(c, u); @@ -1433,7 +1436,7 @@ static void __cache_set_unregister(struct closure *cl) mutex_lock(_register_lock); - for (i = 0; i < c->nr_uuids; i++) + for (i = 0; i < c->devices_max_used; i++) if (c->devices[i]) { if (!UUID_FLASH_ONLY(>uuids[i]) && test_bit(CACHE_SET_UNREGISTERING, >flags)) { @@ -1496,7 +1499,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) c->bucket_bits = ilog2(sb->bucket_size); c->block_bits = ilog2(sb->block_size); c->nr_uuids = bucket_bytes(c) / sizeof(struct uuid_entry); - + c->devices_max_used = 0; c->btree_pages = bucket_pages(c); if (c->btree_pages > BTREE_MAX_PAGES) c->btree_pages = max_t(int, c->btree_pages / 4, diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 6d26927267f8..f102b1f9bc51 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -24,7 +24,7 @@ static inline uint64_t bcache_flash_devs_sectors_dirty(struct cache_set *c) mutex_lock(_register_lock); - for (i = 0; i < c->nr_uuids; i++) { + for (i = 0; i < c->devices_max_used; i++) { struct bcache_device *d = c->devices[i]; if (!d || !UUID_FLASH_ONLY(>uuids[i])) -- 2.14.1
[416 PATCH 09/13] bcache: mark closure_sync() __sched
From: Kent Overstreet <kent.overstr...@gmail.com> [edit by mlyle: include sched/debug.h to get __sched] Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com> Signed-off-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/closure.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index ca7ace6962a4..7f12920c14f7 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -8,6 +8,7 @@ #include #include #include +#include #include "closure.h" @@ -107,7 +108,7 @@ static void closure_sync_fn(struct closure *cl) wake_up_process(cl->s->task); } -void __closure_sync(struct closure *cl) +void __sched __closure_sync(struct closure *cl) { struct closure_syncer s = { .task = current }; -- 2.14.1
[416 PATCH 08/13] bcache: Fix, improve efficiency of closure_sync()
From: Kent Overstreet <kent.overstr...@gmail.com> Eliminates cases where sync can race and fail to complete / get stuck. Removes many status flags and simplifies entering-and-exiting closure sleeping behaviors. [mlyle: fixed conflicts due to changed return behavior in mainline. extended commit comment, and squashed down two commits that were mostly contradictory to get to this state. Changed __set_current_state to _set_current_state per Jens review comment] Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com> Signed-off-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/closure.c | 46 +- drivers/md/bcache/closure.h | 60 + 2 files changed, 46 insertions(+), 60 deletions(-) diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index 1841d0359bac..ca7ace6962a4 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -18,10 +18,6 @@ static inline void closure_put_after_sub(struct closure *cl, int flags) BUG_ON(flags & CLOSURE_GUARD_MASK); BUG_ON(!r && (flags & ~CLOSURE_DESTRUCTOR)); - /* Must deliver precisely one wakeup */ - if (r == 1 && (flags & CLOSURE_SLEEPING)) - wake_up_process(cl->task); - if (!r) { if (cl->fn && !(flags & CLOSURE_DESTRUCTOR)) { atomic_set(>remaining, @@ -100,28 +96,34 @@ bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl) } EXPORT_SYMBOL(closure_wait); -/** - * closure_sync - sleep until a closure has nothing left to wait on - * - * Sleeps until the refcount hits 1 - the thread that's running the closure owns - * the last refcount. - */ -void closure_sync(struct closure *cl) +struct closure_syncer { + struct task_struct *task; + int done; +}; + +static void closure_sync_fn(struct closure *cl) { - while (1) { - __closure_start_sleep(cl); - closure_set_ret_ip(cl); + cl->s->done = 1; + wake_up_process(cl->s->task); +} - if ((atomic_read(>remaining) & -CLOSURE_REMAINING_MASK) == 1) - break; +void __closure_sync(struct closure *cl) +{ + struct closure_syncer s = { .task = current }; + cl->s = + continue_at(cl, closure_sync_fn, NULL); + + while (1) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (s.done) + break; schedule(); } - __closure_end_sleep(cl); + __set_current_state(TASK_RUNNING); } -EXPORT_SYMBOL(closure_sync); +EXPORT_SYMBOL(__closure_sync); #ifdef CONFIG_BCACHE_CLOSURES_DEBUG @@ -168,12 +170,10 @@ static int debug_seq_show(struct seq_file *f, void *data) cl, (void *) cl->ip, cl->fn, cl->parent, r & CLOSURE_REMAINING_MASK); - seq_printf(f, "%s%s%s%s\n", + seq_printf(f, "%s%s\n", test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(>work)) ? "Q" : "", - r & CLOSURE_RUNNING ? "R" : "", - r & CLOSURE_STACK? "S" : "", - r & CLOSURE_SLEEPING ? "Sl" : ""); + r & CLOSURE_RUNNING ? "R" : ""); if (r & CLOSURE_WAITING) seq_printf(f, " W %pF\n", diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h index ccfbea6f9f6b..392a87cf1b92 100644 --- a/drivers/md/bcache/closure.h +++ b/drivers/md/bcache/closure.h @@ -103,6 +103,7 @@ */ struct closure; +struct closure_syncer; typedef void (closure_fn) (struct closure *); struct closure_waitlist { @@ -115,10 +116,6 @@ enum closure_state { * the thread that owns the closure, and cleared by the thread that's * waking up the closure. * -* CLOSURE_SLEEPING: Must be set before a thread uses a closure to sleep -* - indicates that cl->task is valid and closure_put() may wake it up. -* Only set or cleared by the thread that owns the closure. -* * The rest are for debugging and don't affect behaviour: * * CLOSURE_RUNNING: Set when a closure is running (i.e. by @@ -128,22 +125,16 @@ enum closure_state { * continue_at() and closure_return() clear it for you, if you're doing * something unusual you can use closure_set_dead() which also helps * annotate where references are being transferred. -* -* CLOSURE_STA
[416 PATCH 05/13] bcache: fix wrong return value in bch_debug_init()
From: Tang Junhui <tang.jun...@zte.com.cn> in bch_debug_init(), ret is always 0, and the return value is useless, change it to return 0 if be success after calling debugfs_create_dir(), else return a non-zero value. Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/bcache.h| 6 --- drivers/md/bcache/debug.c | 5 +- drivers/md/bcache/writeback.c | 120 +- drivers/md/bcache/writeback.h | 3 ++ 4 files changed, 87 insertions(+), 47 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 843877e017e1..1784e50eb857 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -323,12 +323,6 @@ struct cached_dev { struct bch_ratelimitwriteback_rate; struct delayed_work writeback_rate_update; - /* -* Internal to the writeback code, so read_dirty() can keep track of -* where it's at. -*/ - sector_tlast_read; - /* Limit number of writeback bios in flight */ struct semaphorein_flight; struct task_struct *writeback_thread; diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 879ab21074c6..af89408befe8 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -251,8 +251,7 @@ void bch_debug_exit(void) int __init bch_debug_init(struct kobject *kobj) { - int ret = 0; - debug = debugfs_create_dir("bcache", NULL); - return ret; + + return IS_ERR_OR_NULL(debug); } diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 1ac2af6128b1..479095987f22 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -237,7 +237,9 @@ static void read_dirty_submit(struct closure *cl) static void read_dirty(struct cached_dev *dc) { unsigned delay = 0; - struct keybuf_key *w; + struct keybuf_key *next, *keys[MAX_WRITEBACKS_IN_PASS], *w; + size_t size; + int nk, i; struct dirty_io *io; struct closure cl; @@ -248,45 +250,87 @@ static void read_dirty(struct cached_dev *dc) * mempools. */ - while (!kthread_should_stop()) { - - w = bch_keybuf_next(>writeback_keys); - if (!w) - break; - - BUG_ON(ptr_stale(dc->disk.c, >key, 0)); - - if (KEY_START(>key) != dc->last_read || - jiffies_to_msecs(delay) > 50) - while (!kthread_should_stop() && delay) - delay = schedule_timeout_interruptible(delay); - - dc->last_read = KEY_OFFSET(>key); - - io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec) -* DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS), -GFP_KERNEL); - if (!io) - goto err; - - w->private = io; - io->dc = dc; - - dirty_init(w); - bio_set_op_attrs(>bio, REQ_OP_READ, 0); - io->bio.bi_iter.bi_sector = PTR_OFFSET(>key, 0); - bio_set_dev(>bio, PTR_CACHE(dc->disk.c, >key, 0)->bdev); - io->bio.bi_end_io = read_dirty_endio; - - if (bch_bio_alloc_pages(>bio, GFP_KERNEL)) - goto err_free; - - trace_bcache_writeback(>key); + next = bch_keybuf_next(>writeback_keys); + + while (!kthread_should_stop() && next) { + size = 0; + nk = 0; + + do { + BUG_ON(ptr_stale(dc->disk.c, >key, 0)); + + /* +* Don't combine too many operations, even if they +* are all small. +*/ + if (nk >= MAX_WRITEBACKS_IN_PASS) + break; + + /* +* If the current operation is very large, don't +* further combine operations. +*/ + if (size >= MAX_WRITESIZE_IN_PASS) + break; + + /* +* Operations are only eligible to be combined +* if they are contiguous. +* +* TODO: add a heuristic willing to fire a +* certain amount of non-contiguous IO per pass, +* so that we can benefit from backing device +* command queueing. +*/ + if ((nk != 0) && bkey_cmp([nk-1]->key, +
[416 PATCH 04/13] bcache: segregate flash only volume write streams
From: Tang Junhui <tang.jun...@zte.com.cn> In such scenario that there are some flash only volumes , and some cached devices, when many tasks request these devices in writeback mode, the write IOs may fall to the same bucket as bellow: | cached data | flash data | cached data | cached data| flash data| then after writeback of these cached devices, the bucket would be like bellow bucket: | free | flash data | free | free | flash data | So, there are many free space in this bucket, but since data of flash only volumes still exists, so this bucket cannot be reclaimable, which would cause waste of bucket space. In this patch, we segregate flash only volume write streams from cached devices, so data from flash only volumes and cached devices can store in different buckets. Compare to v1 patch, this patch do not add a additionally open bucket list, and it is try best to segregate flash only volume write streams from cached devices, sectors of flash only volumes may still be mixed with dirty sectors of cached device, but the number is very small. [mlyle: fixed commit log formatting, permissions, line endings] Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn> Reviewed-by: Michael Lyle <ml...@lyle.org> Signed-off-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/alloc.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index a0cc1bc6d884..6cc6c0f9c3a9 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -525,15 +525,21 @@ struct open_bucket { /* * We keep multiple buckets open for writes, and try to segregate different - * write streams for better cache utilization: first we look for a bucket where - * the last write to it was sequential with the current write, and failing that - * we look for a bucket that was last used by the same task. + * write streams for better cache utilization: first we try to segregate flash + * only volume write streams from cached devices, secondly we look for a bucket + * where the last write to it was sequential with the current write, and + * failing that we look for a bucket that was last used by the same task. * * The ideas is if you've got multiple tasks pulling data into the cache at the * same time, you'll get better cache utilization if you try to segregate their * data and preserve locality. * - * For example, say you've starting Firefox at the same time you're copying a + * For example, dirty sectors of flash only volume is not reclaimable, if their + * dirty sectors mixed with dirty sectors of cached device, such buckets will + * be marked as dirty and won't be reclaimed, though the dirty data of cached + * device have been written back to backend device. + * + * And say you've starting Firefox at the same time you're copying a * bunch of files. Firefox will likely end up being fairly hot and stay in the * cache awhile, but the data you copied might not be; if you wrote all that * data to the same buckets it'd get invalidated at the same time. @@ -550,7 +556,10 @@ static struct open_bucket *pick_data_bucket(struct cache_set *c, struct open_bucket *ret, *ret_task = NULL; list_for_each_entry_reverse(ret, >data_buckets, list) - if (!bkey_cmp(>key, search)) + if (UUID_FLASH_ONLY(>uuids[KEY_INODE(>key)]) != + UUID_FLASH_ONLY(>uuids[KEY_INODE(search)])) + continue; + else if (!bkey_cmp(>key, search)) goto found; else if (ret->last_write_point == write_point) ret_task = ret; -- 2.14.1
[416 PATCH 07/13] bcache: allow quick writeback when backing idle
If the control system would wait for at least half a second, and there's been no reqs hitting the backing disk for awhile: use an alternate mode where we have at most one contiguous set of writebacks in flight at a time. (But don't otherwise delay). If front-end IO appears, it will still be quick, as it will only have to contend with one real operation in flight. But otherwise, we'll be sending data to the backing disk as quickly as it can accept it (with one op at a time). Signed-off-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> Acked-by: Coly Li <col...@suse.de> --- drivers/md/bcache/bcache.h| 7 +++ drivers/md/bcache/request.c | 1 + drivers/md/bcache/writeback.c | 21 + 3 files changed, 29 insertions(+) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 3be0fcc19b1f..5f7b0b2513cc 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -320,6 +320,13 @@ struct cached_dev { */ atomic_thas_dirty; + /* +* Set to zero by things that touch the backing volume-- except +* writeback. Incremented by writeback. Used to determine when to +* accelerate idle writeback. +*/ + atomic_tbacking_idle; + struct bch_ratelimitwriteback_rate; struct delayed_work writeback_rate_update; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 52b4ce24f9e2..ddd941056f3c 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -996,6 +996,7 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q, struct cached_dev *dc = container_of(d, struct cached_dev, disk); int rw = bio_data_dir(bio); + atomic_set(>backing_idle, 0); generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0); bio_set_dev(bio, dc->bdev); diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 6e1d2fde43df..f82ffb2e9b9b 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -356,6 +356,27 @@ static void read_dirty(struct cached_dev *dc) delay = writeback_delay(dc, size); + /* If the control system would wait for at least half a +* second, and there's been no reqs hitting the backing disk +* for awhile: use an alternate mode where we have at most +* one contiguous set of writebacks in flight at a time. If +* someone wants to do IO it will be quick, as it will only +* have to contend with one operation in flight, and we'll +* be round-tripping data to the backing disk as quickly as +* it can accept it. +*/ + if (delay >= HZ / 2) { + /* 3 means at least 1.5 seconds, up to 7.5 if we +* have slowed way down. +*/ + if (atomic_inc_return(>backing_idle) >= 3) { + /* Wait for current I/Os to finish */ + closure_sync(); + /* And immediately launch a new set. */ + delay = 0; + } + } + while (!kthread_should_stop() && delay) { schedule_timeout_interruptible(delay); delay = writeback_delay(dc, 0); -- 2.14.1
[416 PATCH 06/13] bcache: writeback: properly order backing device IO
Writeback keys are presently iterated and dispatched for writeback in order of the logical block address on the backing device. Multiple may be, in parallel, read from the cache device and then written back (especially when there are contiguous I/O). However-- there was no guarantee with the existing code that the writes would be issued in LBA order, as the reads from the cache device are often re-ordered. In turn, when writing back quickly, the backing disk often has to seek backwards-- this slows writeback and increases utilization. This patch introduces an ordering mechanism that guarantees that the original order of issue is maintained for the write portion of the I/O. Performance for writeback is significantly improved when there are multiple contiguous keys or high writeback rates. Signed-off-by: Michael Lyle <ml...@lyle.org> Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn> Tested-by: Tang Junhui <tang.jun...@zte.com.cn> --- drivers/md/bcache/bcache.h| 8 drivers/md/bcache/writeback.c | 29 + 2 files changed, 37 insertions(+) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 1784e50eb857..3be0fcc19b1f 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -330,6 +330,14 @@ struct cached_dev { struct keybuf writeback_keys; + /* +* Order the write-half of writeback operations strongly in dispatch +* order. (Maintain LBA order; don't allow reads completing out of +* order to re-order the writes...) +*/ + struct closure_waitlist writeback_ordering_wait; + atomic_twriteback_sequence_next; + /* For tracking sequential IO */ #define RECENT_IO_BITS 7 #define RECENT_IO (1 << RECENT_IO_BITS) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 479095987f22..6e1d2fde43df 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -116,6 +116,7 @@ static unsigned writeback_delay(struct cached_dev *dc, unsigned sectors) struct dirty_io { struct closure cl; struct cached_dev *dc; + uint16_tsequence; struct bio bio; }; @@ -194,6 +195,27 @@ static void write_dirty(struct closure *cl) { struct dirty_io *io = container_of(cl, struct dirty_io, cl); struct keybuf_key *w = io->bio.bi_private; + struct cached_dev *dc = io->dc; + + uint16_t next_sequence; + + if (atomic_read(>writeback_sequence_next) != io->sequence) { + /* Not our turn to write; wait for a write to complete */ + closure_wait(>writeback_ordering_wait, cl); + + if (atomic_read(>writeback_sequence_next) == io->sequence) { + /* +* Edge case-- it happened in indeterminate order +* relative to when we were added to wait list.. +*/ + closure_wake_up(>writeback_ordering_wait); + } + + continue_at(cl, write_dirty, io->dc->writeback_write_wq); + return; + } + + next_sequence = io->sequence + 1; /* * IO errors are signalled using the dirty bit on the key. @@ -211,6 +233,9 @@ static void write_dirty(struct closure *cl) closure_bio_submit(>bio, cl); } + atomic_set(>writeback_sequence_next, next_sequence); + closure_wake_up(>writeback_ordering_wait); + continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq); } @@ -242,7 +267,10 @@ static void read_dirty(struct cached_dev *dc) int nk, i; struct dirty_io *io; struct closure cl; + uint16_t sequence = 0; + BUG_ON(!llist_empty(>writeback_ordering_wait.list)); + atomic_set(>writeback_sequence_next, sequence); closure_init_stack(); /* @@ -303,6 +331,7 @@ static void read_dirty(struct cached_dev *dc) w->private = io; io->dc = dc; + io->sequence= sequence++; dirty_init(w); bio_set_op_attrs(>bio, REQ_OP_READ, 0); -- 2.14.1
[416 PATCH 01/13] bcache: ret IOERR when read meets metadata error
From: Rui Hua <huarui@gmail.com> The read request might meet error when searching the btree, but the error was not handled in cache_lookup(), and this kind of metadata failure will not go into cached_dev_read_error(), finally, the upper layer will receive bi_status=0. In this patch we judge the metadata error by the return value of bch_btree_map_keys(), there are two potential paths give rise to the error: 1. Because the btree is not totally cached in memery, we maybe get error when read btree node from cache device (see bch_btree_node_get()), the likely errno is -EIO, -ENOMEM 2. When read miss happens, bch_btree_insert_check_key() will be called to insert a "replace_key" to btree(see cached_dev_cache_miss(), just for doing preparatory work before insert the missed data to cache device), a failure can also happen in this situation, the likely errno is -ENOMEM bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will get either -EIO or -ENOMEM in above two cases. if this happened, we should NOT recover data from backing device (when cache device is dirty) because we don't know whether bkeys the read request covered are all clean. And after that happened, s->iop.status is still its initially value(0) before we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into cached_dev_read_error(), and finally it can be passed to upper layer, or recovered by reread from backing device. [edit by mlyle: patch formatting, word-wrap, comment spelling, commit log format] Signed-off-by: Hua Rui <huarui@gmail.com> Reviewed-by: Michael Lyle <ml...@lyle.org> Signed-off-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/request.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index c493fb947dc9..52b4ce24f9e2 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -576,6 +576,7 @@ static void cache_lookup(struct closure *cl) { struct search *s = container_of(cl, struct search, iop.cl); struct bio *bio = >bio.bio; + struct cached_dev *dc; int ret; bch_btree_op_init(>op, -1); @@ -588,6 +589,27 @@ static void cache_lookup(struct closure *cl) return; } + /* +* We might meet err when searching the btree, If that happens, we will +* get negative ret, in this scenario we should not recover data from +* backing device (when cache device is dirty) because we don't know +* whether bkeys the read request covered are all clean. +* +* And after that happened, s->iop.status is still its initial value +* before we submit s->bio.bio +*/ + if (ret < 0) { + BUG_ON(ret == -EINTR); + if (s->d && s->d->c && + !UUID_FLASH_ONLY(>d->c->uuids[s->d->id])) { + dc = container_of(s->d, struct cached_dev, disk); + if (dc && atomic_read(>has_dirty)) + s->recoverable = false; + } + if (!s->iop.status) + s->iop.status = BLK_STS_IOERR; + } + closure_return(cl); } -- 2.14.1
[416 PATCH 00/13] Bcache changes for 4.16
Jens, Please pick up the following reviewed changes for 4.16. There's some small cleanliness changes, a few minor bug fixes (some in preparation for larger work), and ongoing work on writeback performance: 11 files changed, 303 insertions(+), 133 deletions(-) Coly Li (2): bcache: reduce cache_set devices iteration by devices_max_used bcache: fix misleading error message in bch_count_io_errors() Kent Overstreet (2): bcache: Fix, improve efficiency of closure_sync() bcache: mark closure_sync() __sched Michael Lyle (3): bcache: writeback: properly order backing device IO bcache: allow quick writeback when backing idle bcache: fix writeback target calc on large devices Rui Hua (1): bcache: ret IOERR when read meets metadata error Tang Junhui (3): bcache: stop writeback thread after detaching bcache: segregate flash only volume write streams bcache: fix wrong return value in bch_debug_init() Vasyl Gomonovych (1): bcache: Use PTR_ERR_OR_ZERO() Zhai Zhaoxuan (1): bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() Thanks, Mike
[416 PATCH 03/13] bcache: Use PTR_ERR_OR_ZERO()
From: Vasyl Gomonovych <gomonov...@gmail.com> Fix ptr_ret.cocci warnings: drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Vasyl Gomonovych <gomonov...@gmail.com> Reviewed-by: Michael Lyle <ml...@lyle.org> --- drivers/md/bcache/btree.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index ebb1874218e7..9e30713dbdb8 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1804,10 +1804,7 @@ static int bch_gc_thread(void *arg) int bch_gc_thread_start(struct cache_set *c) { c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc"); - if (IS_ERR(c->gc_thread)) - return PTR_ERR(c->gc_thread); - - return 0; + return PTR_ERR_OR_ZERO(c->gc_thread); } /* Initial partial gc */ -- 2.14.1