Re: [PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Michael Lyle
On 04/19/2018 11:10 AM, Jens Axboe wrote:
> On 4/19/18 11:59 AM, Michael Lyle wrote:
>> Too much to do with other projects.  I've enjoyed working with everyone
>> here, and hope to occasionally contribute on bcache.

Jens---

> It's been a pleasure having someone maintain it upstream, so this is
> pretty sad.

I've really enjoyed working with everyone here.  The queue keeps
stacking up and I feel pretty bad when people need to wait a couple
weeks for reviews and assistance.

> Do you have anyone in mind for taking over the upstream part? The
> community seems fairly active, and there are regular contributors.
> 
> Coly and Tang Junhui come to mind.

I think Coly would be a great maintainer.

Mike


Re: extra part in bcache patch commit 539d39eb2708

2018-04-19 Thread Michael Lyle
Hi everyone--

On Wed, Apr 18, 2018 at 10:17 PM, Coly Li  wrote:
> Hi Michael and Jens
>
> When I do back port of bcache patches, I find commit 539d39eb2708
> ("bcache: fix wrong return value in bch_debug_init()") has extra part
> from the original patch which Junhui Tanng posted.

Sorry, I must have messed up a rebase.  The patch it was combined with is here:
https://www.spinics.net/lists/linux-bcache/msg05464.html  and was
properly signed off and reviewed.

Mike


[PATCH] MAINTAINERS: Remove me as maintainer of bcache

2018-04-19 Thread Michael Lyle
Too much to do with other projects.  I've enjoyed working with everyone
here, and hope to occasionally contribute on bcache.

Signed-off-by: Michael Lyle <ml...@lyle.org>
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index f9bf4651db41..9fd861906832 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2617,7 +2617,6 @@ S:Maintained
 F: drivers/net/hamradio/baycom*
 
 BCACHE (BLOCK LAYER CACHE)
-M: Michael Lyle <ml...@lyle.org>
 M: Kent Overstreet <kent.overstr...@gmail.com>
 L: linux-bca...@vger.kernel.org
 W: http://bcache.evilpiepirate.org
-- 
2.17.0



Re: bcache and hibernation

2018-04-05 Thread Michael Lyle
On Thu, Apr 5, 2018 at 12:51 PM, Nikolaus Rath  wrote:
> Hi Michael,
>
> Could you explain why this isn't a problem with writethrough? It seems
> to me that the trouble happens when the hibernation image is *read*, so
> why does it matter what kind of write caching is used?

With writethrough you can set up your loader to read it directly from
the backing device-- e.g. you don't need the cache, and there's at
least some valid configurations; with writeback some of the extents
may be on the cache dev so...

That said, it's not really great to put swap/hibernate on a cache
device... the workloads don't usually benefit much from tiering (since
they tend to be write-once-read-never or write-once-read-once).

>> I am unaware of a mechanism to prohibit this in the kernel-- to say that
>> a given type of block provider can't be involved in a resume operation.
>> Most documentation for hibernation explicitly cautions about the btrfs
>> situation, but use of bcache is less common and as a result generally
>> isn't covered.
>
> Could you maybe add a warning to Documentation/bcache.txt? I think this
> would have saved me.

Yah, I can look at that.

>
> Best,
> -Nikolaus

Mike


Re: bcache and hibernation

2018-04-05 Thread Michael Lyle
Hi Nikolaus (and everyone else),

Sorry I've been slow in responding.  I probably need to step down as
bcache maintainer because so many other things have competed for my time
lately and I've fallen behind on both patches and mailing list.

On 04/05/2018 01:51 AM, Nikolaus Rath wrote:
> Is there a way to prevent this from happening? Could eg the kernel detect 
> that the swap devices is (indirectly) on bcache and refuse to hibernate? Or 
> is there a way to do a "true" read-only mount of a bcache volume so that one 
> can safely resume from it?

I think you're correct.  If you're using bcache in writeback mode, it is
not safe to hibernate there, because some of the blocks involved in the
resume can end up in cache (and dependency issues, like you mention).
There's similar cautions/problems with btrfs.

I am unaware of a mechanism to prohibit this in the kernel-- to say that
a given type of block provider can't be involved in a resume operation.
Most documentation for hibernation explicitly cautions about the btrfs
situation, but use of bcache is less common and as a result generally
isn't covered.

> Best,
> -Nikolaus
Mike


Re: [for-4.17 20/20] bcache: Fix a compiler warning in bcache_device_init()

2018-03-21 Thread Michael Lyle
On 03/18/2018 06:32 PM, Coly Li wrote:
> On 19/03/2018 8:36 AM, Michael Lyle wrote:
>> From: Bart Van Assche <bart.vanass...@wdc.com>
>>
>> Avoid that building with W=1 triggers the following compiler warning:
>>
>> drivers/md/bcache/super.c:776:20: warning: comparison is always false due to 
>> limited range of data type [-Wtype-limits]
>>   d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) {
>> ^
>>
>> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
>> Reviewed-by: Michael Lyle <ml...@lyle.org>
> 
> There is a missing Reviewed-by: Coly Li <col...@suse.de> from me :-)

Hi Coly--- sorry that I lost these.

Mike


[for-4.17 16/20] bcache: Fix kernel-doc warnings

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

Avoid that building with W=1 triggers warnings about the kernel-doc
headers.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/btree.c   |  2 +-
 drivers/md/bcache/closure.c |  8 
 drivers/md/bcache/request.c |  1 +
 drivers/md/bcache/util.c| 18 --
 4 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 143ed5a758e7..17936b2dc7d6 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -962,7 +962,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct 
btree_op *op,
return b;
 }
 
-/**
+/*
  * bch_btree_node_get - find a btree node in the cache and lock it, reading it
  * in from disk if necessary.
  *
diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index c0949c9f843b..0e14969182c6 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -46,7 +46,7 @@ void closure_sub(struct closure *cl, int v)
 }
 EXPORT_SYMBOL(closure_sub);
 
-/**
+/*
  * closure_put - decrement a closure's refcount
  */
 void closure_put(struct closure *cl)
@@ -55,7 +55,7 @@ void closure_put(struct closure *cl)
 }
 EXPORT_SYMBOL(closure_put);
 
-/**
+/*
  * closure_wake_up - wake up all closures on a wait list, without memory 
barrier
  */
 void __closure_wake_up(struct closure_waitlist *wait_list)
@@ -79,9 +79,9 @@ EXPORT_SYMBOL(__closure_wake_up);
 
 /**
  * closure_wait - add a closure to a waitlist
- *
- * @waitlist will own a ref on @cl, which will be released when
+ * @waitlist: will own a ref on @cl, which will be released when
  * closure_wake_up() is called on @waitlist.
+ * @cl: closure pointer.
  *
  */
 bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 5a82237c7025..a65e3365eeb9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -296,6 +296,7 @@ static void bch_data_insert_start(struct closure *cl)
 
 /**
  * bch_data_insert - stick some data in the cache
+ * @cl: closure pointer.
  *
  * This is the starting point for any data to end up in a cache device; it 
could
  * be from a normal write, or a writeback write, or a write to a flash only
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 6198041f0ee2..74febd5230df 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -82,10 +82,9 @@ STRTO_H(strtoll, long long)
 STRTO_H(strtoull, unsigned long long)
 
 /**
- * bch_hprint() - formats @v to human readable string for sysfs.
- *
- * @v - signed 64 bit integer
- * @buf - the (at least 8 byte) buffer to format the result into.
+ * bch_hprint - formats @v to human readable string for sysfs.
+ * @buf: the (at least 8 byte) buffer to format the result into.
+ * @v: signed 64 bit integer
  *
  * Returns the number of bytes used by format.
  */
@@ -225,13 +224,12 @@ void bch_time_stats_update(struct time_stats *stats, 
uint64_t start_time)
 }
 
 /**
- * bch_next_delay() - increment @d by the amount of work done, and return how
- * long to delay until the next time to do some work.
+ * bch_next_delay() - update ratelimiting statistics and calculate next delay
+ * @d: the struct bch_ratelimit to update
+ * @done: the amount of work done, in arbitrary units
  *
- * @d - the struct bch_ratelimit to update
- * @done - the amount of work done, in arbitrary units
- *
- * Returns the amount of time to delay by, in jiffies
+ * Increment @d by the amount of work done, and return how long to delay in
+ * jiffies until the next time to do some work.
  */
 uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
 {
-- 
2.14.1



[for-4.17 03/20] bcache: stop dc->writeback_rate_update properly

2018-03-18 Thread Michael Lyle
From: Coly Li <col...@suse.de>

struct delayed_work writeback_rate_update in struct cache_dev is a delayed
worker to call function update_writeback_rate() in period (the interval is
defined by dc->writeback_rate_update_seconds).

When a metadate I/O error happens on cache device, bcache error handling
routine bch_cache_set_error() will call bch_cache_set_unregister() to
retire whole cache set. On the unregister code path, this delayed work is
stopped by calling cancel_delayed_work_sync(>writeback_rate_update).

dc->writeback_rate_update is a special delayed work from others in bcache.
In its routine update_writeback_rate(), this delayed work is re-armed
itself. That means when cancel_delayed_work_sync() returns, this delayed
work can still be executed after several seconds defined by
dc->writeback_rate_update_seconds.

The problem is, after cancel_delayed_work_sync() returns, the cache set
unregister code path will continue and release memory of struct cache set.
Then the delayed work is scheduled to run, __update_writeback_rate()
will reference the already released cache_set memory, and trigger a NULL
pointer deference fault.

This patch introduces two more bcache device flags,
- BCACHE_DEV_WB_RUNNING
  bit set:  bcache device is in writeback mode and running, it is OK for
dc->writeback_rate_update to re-arm itself.
  bit clear:bcache device is trying to stop dc->writeback_rate_update,
this delayed work should not re-arm itself and quit.
- BCACHE_DEV_RATE_DW_RUNNING
  bit set:  routine update_writeback_rate() is executing.
  bit clear: routine update_writeback_rate() quits.

This patch also adds a function cancel_writeback_rate_update_dwork() to
wait for dc->writeback_rate_update quits before cancel it by calling
cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected
quit dc->writeback_rate_update, after time_out seconds this function will
give up and continue to call cancel_delayed_work_sync().

And here I explain how this patch stops self re-armed delayed work properly
with the above stuffs.

update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning
and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling
cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING.

Before calling cancel_delayed_work_sync() wait utill flag
BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling
cancel_delayed_work_sync(), dc->writeback_rate_update must be already re-
armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases
delayed work routine update_writeback_rate() won't be executed after
cancel_delayed_work_sync() returns.

Inside update_writeback_rate() before calling schedule_delayed_work(), flag
BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means
someone is about to stop the delayed work. Because flag
BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync()
has to wait for this flag to be cleared, we don't need to worry about race
condition here.

If update_writeback_rate() is scheduled to run after checking
BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync()
in cancel_writeback_rate_update_dwork(), it is also safe. Because at this
moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned
previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear
and quit immediately.

Because there are more dependences inside update_writeback_rate() to struct
cache_set memory, dc->writeback_rate_update is not a simple self re-arm
delayed work. After trying many different methods (e.g. hold dc->count, or
use locks), this is the only way I can find which works to properly stop
dc->writeback_rate_update delayed work.

Changelog:
v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING
to bit index, for test_bit().
v2: Try to fix the race issue which is pointed out by Junhui.
v1: The initial version for review

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Junhui Tang <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Michael Lyle <ml...@lyle.org>
Cc: Hannes Reinecke <h...@suse.com>
---
 drivers/md/bcache/bcache.h|  9 +
 drivers/md/bcache/super.c | 38 ++
 drivers/md/bcache/sysfs.c |  3 ++-
 drivers/md/bcache/writeback.c | 29 -
 4 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 12e5197f186c..b5ddb848cd31 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -258,10 +258,11 @@ struct bcache_device {
struct gendisk  *disk;
 
unsigned long   flags;
-#define BCACHE_DEV_CLOSING 0
-#define BCACHE_DEV_DETACHING   1
-#define BCACHE_DEV_UNLINK_DONE 2
-
+#define BCACHE_DEV_CLOSING 0
+#define BCACHE_DEV_

[for-4.17 20/20] bcache: Fix a compiler warning in bcache_device_init()

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

Avoid that building with W=1 triggers the following compiler warning:

drivers/md/bcache/super.c:776:20: warning: comparison is always false due to 
limited range of data type [-Wtype-limits]
  d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) {
^

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/super.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 640ff8072ed8..d90d9e59ca00 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -778,6 +778,8 @@ static int bcache_device_init(struct bcache_device *d, 
unsigned block_size,
  sector_t sectors)
 {
struct request_queue *q;
+   const size_t max_stripes = min_t(size_t, INT_MAX,
+SIZE_MAX / sizeof(atomic_t));
size_t n;
int idx;
 
@@ -786,9 +788,7 @@ static int bcache_device_init(struct bcache_device *d, 
unsigned block_size,
 
d->nr_stripes = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
 
-   if (!d->nr_stripes ||
-   d->nr_stripes > INT_MAX ||
-   d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) {
+   if (!d->nr_stripes || d->nr_stripes > max_stripes) {
pr_err("nr_stripes too large or invalid: %u (start sector 
beyond end of disk?)",
(unsigned)d->nr_stripes);
return -ENOMEM;
-- 
2.14.1



[for-4.17 19/20] bcache: Reduce the number of sparse complaints about lock imbalances

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

Add more annotations for sparse to inform it about which functions do
not have the same number of spin_lock() and spin_unlock() calls.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/journal.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index acd0e5c074dd..18f1b5239620 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -594,6 +594,7 @@ static void journal_write_done(struct closure *cl)
 }
 
 static void journal_write_unlock(struct closure *cl)
+   __releases(>journal.lock)
 {
struct cache_set *c = container_of(cl, struct cache_set, journal.io);
 
@@ -705,6 +706,7 @@ static void journal_try_write(struct cache_set *c)
 
 static struct journal_write *journal_wait_for_write(struct cache_set *c,
unsigned nkeys)
+   __acquires(>journal.lock)
 {
size_t sectors;
struct closure cl;
-- 
2.14.1



[for-4.17 15/20] bcache: Annotate switch fall-through

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

This patch avoids that building with W=1 triggers complaints about
switch fall-throughs.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/util.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index a23cd6a14b74..6198041f0ee2 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -32,20 +32,27 @@ int bch_ ## name ## _h(const char *cp, type *res)   
\
case 'y':   \
case 'z':   \
u++;\
+   /* fall through */  \
case 'e':   \
u++;\
+   /* fall through */  \
case 'p':   \
u++;\
+   /* fall through */  \
case 't':   \
u++;\
+   /* fall through */  \
case 'g':   \
u++;\
+   /* fall through */  \
case 'm':   \
u++;\
+   /* fall through */  \
case 'k':   \
u++;\
if (e++ == cp)  \
return -EINVAL; \
+   /* fall through */  \
case '\n':  \
case '\0':  \
if (*e == '\n') \
-- 
2.14.1



[for-4.17 02/20] bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set

2018-03-18 Thread Michael Lyle
From: Coly Li <col...@suse.de>

In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
cached_dev_get() is called when creating dc->writeback_thread, and
cached_dev_put() is called when exiting dc->writeback_thread. This
modification works well unless people detach the bcache device manually by
'echo 1 > /sys/block/bcache/bcache/detach'
Because this sysfs interface only calls bch_cached_dev_detach() which wakes
up dc->writeback_thread but does not stop it. The reason is, before patch
"bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
bch_writeback_thread(), if cache is not dirty after writeback,
cached_dev_put() will be called here. And in cached_dev_make_request() when
a new write request makes cache from clean to dirty, cached_dev_get() will
be called there. Since we don't operate dc->count in these locations,
refcount d->count cannot be dropped after cache becomes clean, and
cached_dev_detach_finish() won't be called to detach bcache device.

This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
set inside bch_writeback_thread(). If this bit is set and cache is clean
(no existing writeback_keys), break the while-loop, call cached_dev_put()
and quit the writeback thread.

Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
writeback thread should continue to perform writeback, this is the original
design of manually detach.

It is safe to do the following check without locking, let me explain why,
+   if (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
+   (!atomic_read(>has_dirty) || !dc->writeback_running)) {

If the kenrel thread does not sleep and continue to run due to conditions
are not updated in time on the running CPU core, it just consumes more CPU
cycles and has no hurt. This should-sleep-but-run is safe here. We just
focus on the should-run-but-sleep condition, which means the writeback
thread goes to sleep in mistake while it should continue to run.
1, First of all, no matter the writeback thread is hung or not,
   kthread_stop() from cached_dev_detach_finish() will wake up it and
   terminate by making kthread_should_stop() return true. And in normal
   run time, bit on index BCACHE_DEV_DETACHING is always cleared, the
   condition
!test_bit(BCACHE_DEV_DETACHING, >disk.flags)
   is always true and can be ignored as constant value.
2, If one of the following conditions is true, the writeback thread should
   go to sleep,
   "!atomic_read(>has_dirty)" or "!dc->writeback_running)"
   each of them independently controls the writeback thread should sleep or
   not, let's analyse them one by one.
2.1 condition "!atomic_read(>has_dirty)"
   If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will
   call bch_writeback_queue() immediately or call bch_writeback_add() which
   indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
   wake_up_process(dc->writeback_thread) is called. It sets writeback
   thread's task state to TASK_RUNNING and following an implicit memory
   barrier, then tries to wake up the writeback thread.
   In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
   doing the condition check. If other CPU core sets the TASK_RUNNING state
   after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
   will be scheduled to run very soon because its state is not
   TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
   writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
   of wake_up_process() will make sure modification of dc->has_dirty on
   other CPU core is updated and observed on the CPU core of writeback
   thread. Therefore the condition check will correctly be false, and
   continue writeback code without sleeping.
2.2 condition "!dc->writeback_running)"
   dc->writeback_running can be changed via sysfs file, every time it is
   modified, a following bch_writeback_queue() is alwasy called. So the
   change is always observed on the CPU core of writeback thread. If
   dc->writeback_running is changed from 0 to 1 on other CPU core, this
   condition check will observe the modification and allow writeback
   thread to continue to run without sleeping.
Now we can see, even without a locking protection, multiple conditions
check is safe here, no deadlock or process hang up will happen.

I compose a separte patch because that patch "bcache: fix cached_dev->count
usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
Reinecke. Also this fix is not trivial and good for a separate patch.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Hannes Reinecke <h...@suse.com>
Cc: Huijun Tang <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/writeback.c 

[for-4.17 04/20] bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags

2018-03-18 Thread Michael Lyle
From: Coly Li <col...@suse.de>

When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.

Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.

The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.

Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.

A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.

Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
  is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Junhui Tang <tang.jun...@zte.com.cn>
Cc: Michael Lyle <ml...@lyle.org>
Cc: Pavel Vazharov <frea...@gmail.com>
---
 drivers/md/bcache/alloc.c |  3 ++-
 drivers/md/bcache/bcache.h| 33 +
 drivers/md/bcache/btree.c | 11 ---
 drivers/md/bcache/io.c|  2 +-
 drivers/md/bcache/journal.c   |  4 ++--
 drivers/md/bcache/request.c   | 26 +++---
 drivers/md/bcache/super.c |  6 +-
 drivers/md/bcache/sysfs.c | 18 ++
 drivers/md/bcache/util.h  |  6 --
 drivers/md/bcache/writeback.c | 37 -
 10 files changed, 116 insertions(+), 30 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 458e1d38577d..004cc3cc6123 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -287,7 +287,8 @@ do {
\
break;  \
\
mutex_unlock(&(ca)->set->bucket_lock);  \
-   if (kthread_should_stop()) {\
+   if (kthread_should_stop() ||\
+   test_bit(CACHE_SET_IO_DISABLE, >set->flags)) {  \
set_current_state(TASK_RUNNING);\
return 0;   \
}   \
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index b5ddb848cd31..8a0327581d62 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -188,6 +188,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bset.h"
 #include "util.h"
@@ -475,10 +476,15 @@ struct gc_stat {
  *
  * CACHE_SET_RUNNING means all cache devices have been registered and journal
  * replay is complete.
+ *
+ * CACHE_SET_IO_DISABLE is set when bcache is stopping the whold cache set, all
+ * external and internal I/O should be denied when this flag is set.
+ *
  */
 #define CACHE_SET_UNREGISTERING0
 #defineCACHE_SET_STOPPING  1
 #defineCACHE_SET_RUNNING   2
+#define CACHE_SET_IO_DISABLE   3
 
 struct cache_set {
struct closure  cl;
@@ -868,6 +874,33 @@ static inline void wake_up_allocators(struct cache_set *c)
wake_up_process(ca->alloc_thread);
 }
 
+static inline void closure_bio_submit(struct cache_set *c,
+ struct bio *bio,
+ 

[for-4.17 05/20] bcache: add stop_when_cache_set_failed option to backing device

2018-03-18 Thread Michael Lyle
From: Coly Li <col...@suse.de>

When there are too many I/O errors on cache device, current bcache code
will retire the whole cache set, and detach all bcache devices. But the
detached bcache devices are not stopped, which is problematic when bcache
is in writeback mode.

If the retired cache set has dirty data of backing devices, continue
writing to bcache device will write to backing device directly. If the
LBA of write request has a dirty version cached on cache device, next time
when the cache device is re-registered and backing device re-attached to
it again, the stale dirty data on cache device will be written to backing
device, and overwrite latest directly written data. This situation causes
a quite data corruption.

But we cannot simply stop all attached bcache devices when the cache set is
broken or disconnected. For example, use bcache to accelerate performance
of an email service. In such workload, if cache device is broken but no
dirty data lost, keep the bcache device alive and permit email service
continue to access user data might be a better solution for the cache
device failure.

Nix <n...@esperi.org.uk> points out the issue and provides the above example
to explain why it might be necessary to not stop bcache device for broken
cache device. Pavel Goran <via-bca...@pvgoran.name> provides a brilliant
suggestion to provide "always" and "auto" options to per-cached device
sysfs file stop_when_cache_set_failed. If cache set is retiring and the
backing device has no dirty data on cache, it should be safe to keep the
bcache device alive. In this case, if stop_when_cache_set_failed is set to
"auto", the device failure handling code will not stop this bcache device
and permit application to access the backing device with a unattached
bcache device.

Changelog:
[mlyle: edited to not break string constants across lines]
v3: fix typos pointed out by Nix.
v2: change option values of stop_when_cache_set_failed from 1/0 to
"auto"/"always".
v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1
(always stop).

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Signed-off-by: Michael Lyle <ml...@lyle.org>
Cc: Nix <n...@esperi.org.uk>
Cc: Pavel Goran <via-bca...@pvgoran.name>
Cc: Junhui Tang <tang.jun...@zte.com.cn>
Cc: Hannes Reinecke <h...@suse.com>
---
 drivers/md/bcache/bcache.h |  9 ++
 drivers/md/bcache/super.c  | 78 --
 drivers/md/bcache/sysfs.c  | 17 ++
 3 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 8a0327581d62..5e9f3610c6fd 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -288,6 +288,12 @@ struct io {
sector_tlast;
 };
 
+enum stop_on_failure {
+   BCH_CACHED_DEV_STOP_AUTO = 0,
+   BCH_CACHED_DEV_STOP_ALWAYS,
+   BCH_CACHED_DEV_STOP_MODE_MAX,
+};
+
 struct cached_dev {
struct list_headlist;
struct bcache_devicedisk;
@@ -380,6 +386,8 @@ struct cached_dev {
unsignedwriteback_rate_i_term_inverse;
unsignedwriteback_rate_p_term_inverse;
unsignedwriteback_rate_minimum;
+
+   enum stop_on_failurestop_when_cache_set_failed;
 };
 
 enum alloc_reserve {
@@ -939,6 +947,7 @@ void bch_write_bdev_super(struct cached_dev *, struct 
closure *);
 
 extern struct workqueue_struct *bcache_wq;
 extern const char * const bch_cache_modes[];
+extern const char * const bch_stop_on_failure_modes[];
 extern struct mutex bch_register_lock;
 extern struct list_head bch_cache_sets;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index fea283e6389e..dab9f59a1a2f 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -47,6 +47,14 @@ const char * const bch_cache_modes[] = {
NULL
 };
 
+/* Default is -1; we skip past it for stop_when_cache_set_failed */
+const char * const bch_stop_on_failure_modes[] = {
+   "default",
+   "auto",
+   "always",
+   NULL
+};
+
 static struct kobject *bcache_kobj;
 struct mutex bch_register_lock;
 LIST_HEAD(bch_cache_sets);
@@ -1199,6 +1207,9 @@ static int cached_dev_init(struct cached_dev *dc, 
unsigned block_size)
max(dc->disk.disk->queue->backing_dev_info->ra_pages,
q->backing_dev_info->ra_pages);
 
+   /* default to auto */
+   dc->stop_when_cache_set_failed = BCH_CACHED_DEV_STOP_AUTO;
+
bch_cached_dev_request_init(dc);
bch_cached_dev_writeback_init(dc);
return 0;
@@ -1475,25 +1486,72 @@ static void cache_set_flush(struct closure *cl)
closure_return(cl);
 }
 
+/*
+ * This function is only called when CACHE_SE

[for-4.17 11/20] bcache: add backing_request_endio() for bi_end_io

2018-03-18 Thread Michael Lyle
From: Coly Li <col...@suse.de>

In order to catch I/O error of backing device, a separate bi_end_io
call back is required. Then a per backing device counter can record I/O
errors number and retire the backing device if the counter reaches a
per backing device I/O error limit.

This patch adds backing_request_endio() to bcache backing device I/O code
path, this is a preparation for further complicated backing device failure
handling. So far there is no real code logic change, I make this change a
separate patch to make sure it is stable and reliable for further work.

Changelog:
v2: Fix code comments typo, remove a redundant bch_writeback_add() line
added in v4 patch set.
v1: indeed this is new added in this patch set.

[mlyle: truncated commit subject]

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Junhui Tang <tang.jun...@zte.com.cn>
Cc: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/request.c   | 91 ---
 drivers/md/bcache/super.c |  1 +
 drivers/md/bcache/writeback.c |  1 +
 3 files changed, 78 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 5c8ae69c8502..b4a5768afbe9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl)
}
 
op->insert_data_done = true;
+   /* get in bch_data_insert() */
bio_put(bio);
 out:
continue_at(cl, bch_data_insert_keys, op->wq);
@@ -630,6 +631,38 @@ static void request_endio(struct bio *bio)
closure_put(cl);
 }
 
+static void backing_request_endio(struct bio *bio)
+{
+   struct closure *cl = bio->bi_private;
+
+   if (bio->bi_status) {
+   struct search *s = container_of(cl, struct search, cl);
+   /*
+* If a bio has REQ_PREFLUSH for writeback mode, it is
+* speically assembled in cached_dev_write() for a non-zero
+* write request which has REQ_PREFLUSH. we don't set
+* s->iop.status by this failure, the status will be decided
+* by result of bch_data_insert() operation.
+*/
+   if (unlikely(s->iop.writeback &&
+bio->bi_opf & REQ_PREFLUSH)) {
+   char buf[BDEVNAME_SIZE];
+
+   bio_devname(bio, buf);
+   pr_err("Can't flush %s: returned bi_status %i",
+   buf, bio->bi_status);
+   } else {
+   /* set to orig_bio->bi_status in bio_complete() */
+   s->iop.status = bio->bi_status;
+   }
+   s->recoverable = false;
+   /* should count I/O error for backing device here */
+   }
+
+   bio_put(bio);
+   closure_put(cl);
+}
+
 static void bio_complete(struct search *s)
 {
if (s->orig_bio) {
@@ -644,13 +677,21 @@ static void bio_complete(struct search *s)
}
 }
 
-static void do_bio_hook(struct search *s, struct bio *orig_bio)
+static void do_bio_hook(struct search *s,
+   struct bio *orig_bio,
+   bio_end_io_t *end_io_fn)
 {
struct bio *bio = >bio.bio;
 
bio_init(bio, NULL, 0);
__bio_clone_fast(bio, orig_bio);
-   bio->bi_end_io  = request_endio;
+   /*
+* bi_end_io can be set separately somewhere else, e.g. the
+* variants in,
+* - cache_bio->bi_end_io from cached_dev_cache_miss()
+* - n->bi_end_io from cache_lookup_fn()
+*/
+   bio->bi_end_io  = end_io_fn;
bio->bi_private = >cl;
 
bio_cnt_set(bio, 3);
@@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio,
s = mempool_alloc(d->c->search, GFP_NOIO);
 
closure_init(>cl, NULL);
-   do_bio_hook(s, bio);
+   do_bio_hook(s, bio, request_endio);
 
s->orig_bio = bio;
s->cache_miss   = NULL;
@@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl)
trace_bcache_read_retry(s->orig_bio);
 
s->iop.status = 0;
-   do_bio_hook(s, s->orig_bio);
+   do_bio_hook(s, s->orig_bio, backing_request_endio);
 
/* XXX: invalidate cache */
 
+   /* I/O request sent to backing device */
closure_bio_submit(s->iop.c, bio, cl);
}
 
@@ -859,7 +901,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
search *s,
bio_copy_dev(cache_bio, miss);
cache_bio->bi_iter.bi_size  = s->insert_bio_sectors << 9;
 
-   cache_bio->

[for-4.17 06/20] bcache: fix inaccurate io state for detached bcache devices

2018-03-18 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

When we run IO in a detached device,  and run iostat to shows IO status,
normally it will show like bellow (Omitted some fields):
Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd... 15.89 0.531.820.202.23   1.81  52.30
bcache0... 15.89   115.420.000.000.00   2.40  69.60
but after IO stopped, there are still very big avgqu-sz and %util
values as bellow:
Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
bcache0   ...  0   5326.320.000.000.00   0.00 100.10

The reason for this issue is that, only generic_start_io_acct() called
and no generic_end_io_acct() called for detached device in
cached_dev_make_request(). See the code:
//start generic_start_io_acct()
generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0);
if (cached_dev_get(dc)) {
//will callback generic_end_io_acct()
}
else {
//will not call generic_end_io_acct()
}

This patch calls generic_end_io_acct() in the end of IO for detached
devices, so we can show IO state correctly.

(Modified to use GFP_NOIO in kzalloc() by Coly Li)

Changelog:
v2: fix typo.
v1: the initial version.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/request.c | 58 +++--
 1 file changed, 51 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 7aca308bee5b..5c8ae69c8502 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -986,6 +986,55 @@ static void cached_dev_nodata(struct closure *cl)
continue_at(cl, cached_dev_bio_complete, NULL);
 }
 
+struct detached_dev_io_private {
+   struct bcache_device*d;
+   unsigned long   start_time;
+   bio_end_io_t*bi_end_io;
+   void*bi_private;
+};
+
+static void detached_dev_end_io(struct bio *bio)
+{
+   struct detached_dev_io_private *ddip;
+
+   ddip = bio->bi_private;
+   bio->bi_end_io = ddip->bi_end_io;
+   bio->bi_private = ddip->bi_private;
+
+   generic_end_io_acct(ddip->d->disk->queue,
+   bio_data_dir(bio),
+   >d->disk->part0, ddip->start_time);
+
+   kfree(ddip);
+
+   bio->bi_end_io(bio);
+}
+
+static void detached_dev_do_request(struct bcache_device *d, struct bio *bio)
+{
+   struct detached_dev_io_private *ddip;
+   struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+
+   /*
+* no need to call closure_get(>disk.cl),
+* because upper layer had already opened bcache device,
+* which would call closure_get(>disk.cl)
+*/
+   ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO);
+   ddip->d = d;
+   ddip->start_time = jiffies;
+   ddip->bi_end_io = bio->bi_end_io;
+   ddip->bi_private = bio->bi_private;
+   bio->bi_end_io = detached_dev_end_io;
+   bio->bi_private = ddip;
+
+   if ((bio_op(bio) == REQ_OP_DISCARD) &&
+   !blk_queue_discard(bdev_get_queue(dc->bdev)))
+   bio->bi_end_io(bio);
+   else
+   generic_make_request(bio);
+}
+
 /* Cached devices - read & write stuff */
 
 static blk_qc_t cached_dev_make_request(struct request_queue *q,
@@ -1028,13 +1077,8 @@ static blk_qc_t cached_dev_make_request(struct 
request_queue *q,
else
cached_dev_read(dc, s);
}
-   } else {
-   if ((bio_op(bio) == REQ_OP_DISCARD) &&
-   !blk_queue_discard(bdev_get_queue(dc->bdev)))
-   bio_endio(bio);
-   else
-   generic_make_request(bio);
-   }
+   } else
+   detached_dev_do_request(d, bio);
 
return BLK_QC_T_NONE;
 }
-- 
2.14.1



[for-4.17 07/20] bcache: fix incorrect sysfs output value of strip size

2018-03-18 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

Stripe size is shown as zero when no strip in back end device:
[root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size
0.0k

Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs
interface, stripe_size was changed from sectors to bytes, and move
9 bits left, so the 32 bits variable overflows.

This patch change the variable to a 64 bits type before moving bits.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 414129f7c49f..8c3fd05db87a 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -181,7 +181,7 @@ SHOW(__bch_cached_dev)
sysfs_hprint(dirty_data,
 bcache_dev_sectors_dirty(>disk) << 9);
 
-   sysfs_hprint(stripe_size,   dc->disk.stripe_size << 9);
+   sysfs_hprint(stripe_size,((uint64_t)dc->disk.stripe_size) << 9);
var_printf(partial_stripes_expensive,   "%u");
 
var_hprint(sequential_cutoff);
-- 
2.14.1



[for-4.17 00/20] First 20 commits for bcache

2018-03-18 Thread Michael Lyle
Hi Jens---

Here's a series of patches for 4.17.  All have been reviewed and tested.
There's more in my review queue, but they're all things that are
"trickier" and will take a bit more work to convince myself are safe.

All have received simple creation/attach/detach/reboot scenario testing
and will continue to be tested.  Please apply at your leisure.

Thanks,

Mike

--
Bart Van Assche (8):
  bcache: Fix indentation
  bcache: Add __printf annotation to __bch_check_keys()
  bcache: Annotate switch fall-through
  bcache: Fix kernel-doc warnings
  bcache: Remove an unused variable
  bcache: Suppress more warnings about set-but-not-used variables
  bcache: Reduce the number of sparse complaints about lock imbalances
  bcache: Fix a compiler warning in bcache_device_init()

Chengguang Xu (1):
  bcache: move closure debug file into debug directory

Coly Li (7):
  bcache: fix cached_dev->count usage for bch_cache_set_error()
  bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
  bcache: stop dc->writeback_rate_update properly
  bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
  bcache: add stop_when_cache_set_failed option to backing device
  bcache: add backing_request_endio() for bi_end_io
  bcache: add io_disable to struct cached_dev

Tang Junhui (4):
  bcache: fix inaccurate io state for detached bcache devices
  bcache: fix incorrect sysfs output value of strip size
  bcache: fix error return value in memory shrink
  bcache: fix using of loop variable in memory shrink

 drivers/md/bcache/alloc.c |   3 +-
 drivers/md/bcache/bcache.h|  57 -
 drivers/md/bcache/bset.c  |   4 +-
 drivers/md/bcache/bset.h  |   5 +-
 drivers/md/bcache/btree.c |  26 +++---
 drivers/md/bcache/closure.c   |  17 ++--
 drivers/md/bcache/closure.h   |   5 +-
 drivers/md/bcache/debug.c |  14 ++--
 drivers/md/bcache/extents.c   |   2 -
 drivers/md/bcache/io.c|  16 +++-
 drivers/md/bcache/journal.c   |   8 +-
 drivers/md/bcache/request.c   | 184 +++---
 drivers/md/bcache/super.c | 154 ++-
 drivers/md/bcache/sysfs.c |  55 -
 drivers/md/bcache/util.c  |  25 +++---
 drivers/md/bcache/util.h  |   6 --
 drivers/md/bcache/writeback.c |  92 ++---
 drivers/md/bcache/writeback.h |   4 +-
 18 files changed, 552 insertions(+), 125 deletions(-)

-- 
2.14.1



[for-4.17 17/20] bcache: Remove an unused variable

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/extents.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/md/bcache/extents.c b/drivers/md/bcache/extents.c
index f9d391711595..c334e461 100644
--- a/drivers/md/bcache/extents.c
+++ b/drivers/md/bcache/extents.c
@@ -534,7 +534,6 @@ static bool bch_extent_bad_expensive(struct btree *b, const 
struct bkey *k,
 static bool bch_extent_bad(struct btree_keys *bk, const struct bkey *k)
 {
struct btree *b = container_of(bk, struct btree, keys);
-   struct bucket *g;
unsigned i, stale;
 
if (!KEY_PTRS(k) ||
@@ -549,7 +548,6 @@ static bool bch_extent_bad(struct btree_keys *bk, const 
struct bkey *k)
return false;
 
for (i = 0; i < KEY_PTRS(k); i++) {
-   g = PTR_BUCKET(b->c, k, i);
stale = ptr_stale(b->c, k, i);
 
btree_bug_on(stale > 96, b,
-- 
2.14.1



[for-4.17 14/20] bcache: Add __printf annotation to __bch_check_keys()

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

Make it possible for the compiler to verify the consistency of the
format string passed to __bch_check_keys() and the arguments that
should be formatted according to that format string.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/bset.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/bset.h b/drivers/md/bcache/bset.h
index fa506c1aa524..0c24280f3b98 100644
--- a/drivers/md/bcache/bset.h
+++ b/drivers/md/bcache/bset.h
@@ -531,14 +531,15 @@ int __bch_keylist_realloc(struct keylist *, unsigned);
 #ifdef CONFIG_BCACHE_DEBUG
 
 int __bch_count_data(struct btree_keys *);
-void __bch_check_keys(struct btree_keys *, const char *, ...);
+void __printf(2, 3) __bch_check_keys(struct btree_keys *, const char *, ...);
 void bch_dump_bset(struct btree_keys *, struct bset *, unsigned);
 void bch_dump_bucket(struct btree_keys *);
 
 #else
 
 static inline int __bch_count_data(struct btree_keys *b) { return -1; }
-static inline void __bch_check_keys(struct btree_keys *b, const char *fmt, 
...) {}
+static inline void __printf(2, 3)
+   __bch_check_keys(struct btree_keys *b, const char *fmt, ...) {}
 static inline void bch_dump_bucket(struct btree_keys *b) {}
 void bch_dump_bset(struct btree_keys *, struct bset *, unsigned);
 
-- 
2.14.1



[for-4.17 18/20] bcache: Suppress more warnings about set-but-not-used variables

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/bset.c| 4 ++--
 drivers/md/bcache/journal.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index e56d3ecdbfcb..579c696a5fe0 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -1072,7 +1072,7 @@ EXPORT_SYMBOL(bch_btree_iter_init);
 static inline struct bkey *__bch_btree_iter_next(struct btree_iter *iter,
 btree_iter_cmp_fn *cmp)
 {
-   struct btree_iter_set unused;
+   struct btree_iter_set b __maybe_unused;
struct bkey *ret = NULL;
 
if (!btree_iter_end(iter)) {
@@ -1087,7 +1087,7 @@ static inline struct bkey *__bch_btree_iter_next(struct 
btree_iter *iter,
}
 
if (iter->data->k == iter->data->end)
-   heap_pop(iter, unused, cmp);
+   heap_pop(iter, b, cmp);
else
heap_sift(iter, 0, cmp);
}
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index c94085f400a4..acd0e5c074dd 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -493,7 +493,7 @@ static void journal_reclaim(struct cache_set *c)
struct cache *ca;
uint64_t last_seq;
unsigned iter, n = 0;
-   atomic_t p;
+   atomic_t p __maybe_unused;
 
atomic_long_inc(>reclaim);
 
-- 
2.14.1



[for-4.17 08/20] bcache: fix error return value in memory shrink

2018-03-18 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

In bch_mca_scan(), the return value should not be the number of freed btree
nodes, but the number of pages of freed btree nodes.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/btree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 39cc8a549091..b2d4899f48d5 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -719,7 +719,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
}
 out:
mutex_unlock(>bucket_lock);
-   return freed;
+   return freed * c->btree_pages;
 }
 
 static unsigned long bch_mca_count(struct shrinker *shrink,
-- 
2.14.1



[for-4.17 09/20] bcache: fix using of loop variable in memory shrink

2018-03-18 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

In bch_mca_scan(), There are some confusion and logical error in the use of
loop variables. In this patch, we clarify them as:
1) nr: the number of btree nodes needs to scan, which will decrease after
we scan a btree node, and should not be less than 0;
2) i: the number of btree nodes have scanned, includes both
btree_cache_freeable and btree_cache, which should not be bigger than
btree_cache_used;
3) freed: the number of btree nodes have freed.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/btree.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index b2d4899f48d5..d64aff0b8abc 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -665,6 +665,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
struct btree *b, *t;
unsigned long i, nr = sc->nr_to_scan;
unsigned long freed = 0;
+   unsigned int btree_cache_used;
 
if (c->shrinker_disabled)
return SHRINK_STOP;
@@ -689,9 +690,10 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
nr = min_t(unsigned long, nr, mca_can_free(c));
 
i = 0;
+   btree_cache_used = c->btree_cache_used;
list_for_each_entry_safe(b, t, >btree_cache_freeable, list) {
-   if (freed >= nr)
-   break;
+   if (nr <= 0)
+   goto out;
 
if (++i > 3 &&
!mca_reap(b, 0, false)) {
@@ -699,9 +701,10 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
rw_unlock(true, b);
freed++;
}
+   nr--;
}
 
-   for (i = 0; (nr--) && i < c->btree_cache_used; i++) {
+   for (;  (nr--) && i < btree_cache_used; i++) {
if (list_empty(>btree_cache))
goto out;
 
-- 
2.14.1



[for-4.17 10/20] bcache: move closure debug file into debug directory

2018-03-18 Thread Michael Lyle
From: Chengguang Xu <cgxu...@gmx.com>

In current code closure debug file is outside of debug directory
and when unloading module there is lack of removing operation
for closure debug file, so it will cause creating error when trying
to reload  module.

This patch move closure debug file into "bcache" debug direcory
so that the file can get deleted properly.

Signed-off-by: Chengguang Xu <cgxu...@gmx.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/closure.c |  9 +
 drivers/md/bcache/closure.h |  5 +++--
 drivers/md/bcache/debug.c   | 14 +++---
 drivers/md/bcache/super.c   |  3 +--
 4 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 7f12920c14f7..c0949c9f843b 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -157,7 +157,7 @@ void closure_debug_destroy(struct closure *cl)
 }
 EXPORT_SYMBOL(closure_debug_destroy);
 
-static struct dentry *debug;
+static struct dentry *closure_debug;
 
 static int debug_seq_show(struct seq_file *f, void *data)
 {
@@ -199,11 +199,12 @@ static const struct file_operations debug_ops = {
.release= single_release
 };
 
-void __init closure_debug_init(void)
+int __init closure_debug_init(void)
 {
-   debug = debugfs_create_file("closures", 0400, NULL, NULL, _ops);
+   closure_debug = debugfs_create_file("closures",
+   0400, bcache_debug, NULL, _ops);
+   return IS_ERR_OR_NULL(closure_debug);
 }
-
 #endif
 
 MODULE_AUTHOR("Kent Overstreet <koverstr...@google.com>");
diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 3b9dfc9962ad..71427eb5fdae 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -105,6 +105,7 @@
 struct closure;
 struct closure_syncer;
 typedef void (closure_fn) (struct closure *);
+extern struct dentry *bcache_debug;
 
 struct closure_waitlist {
struct llist_head   list;
@@ -185,13 +186,13 @@ static inline void closure_sync(struct closure *cl)
 
 #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
 
-void closure_debug_init(void);
+int closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
+static inline int closure_debug_init(void) { return 0; }
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index af89408befe8..028f7b386e01 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -17,7 +17,7 @@
 #include 
 #include 
 
-static struct dentry *debug;
+struct dentry *bcache_debug;
 
 #ifdef CONFIG_BCACHE_DEBUG
 
@@ -232,11 +232,11 @@ static const struct file_operations cache_set_debug_ops = 
{
 
 void bch_debug_init_cache_set(struct cache_set *c)
 {
-   if (!IS_ERR_OR_NULL(debug)) {
+   if (!IS_ERR_OR_NULL(bcache_debug)) {
char name[50];
snprintf(name, 50, "bcache-%pU", c->sb.set_uuid);
 
-   c->debug = debugfs_create_file(name, 0400, debug, c,
+   c->debug = debugfs_create_file(name, 0400, bcache_debug, c,
   _set_debug_ops);
}
 }
@@ -245,13 +245,13 @@ void bch_debug_init_cache_set(struct cache_set *c)
 
 void bch_debug_exit(void)
 {
-   if (!IS_ERR_OR_NULL(debug))
-   debugfs_remove_recursive(debug);
+   if (!IS_ERR_OR_NULL(bcache_debug))
+   debugfs_remove_recursive(bcache_debug);
 }
 
 int __init bch_debug_init(struct kobject *kobj)
 {
-   debug = debugfs_create_dir("bcache", NULL);
+   bcache_debug = debugfs_create_dir("bcache", NULL);
 
-   return IS_ERR_OR_NULL(debug);
+   return IS_ERR_OR_NULL(bcache_debug);
 }
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index dab9f59a1a2f..26b7a7d4df15 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2239,7 +2239,6 @@ static int __init bcache_init(void)
mutex_init(_register_lock);
init_waitqueue_head(_wait);
register_reboot_notifier();
-   closure_debug_init();
 
bcache_major = register_blkdev(0, "bcache");
if (bcache_major < 0) {
@@ -2251,7 +2250,7 @@ static int __init bcache_init(void)
if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) ||
!(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
bch_request_init() ||
-   bch_debug_init(bcache_kobj) ||
+   bch_debug_init(bcache_kobj) || closure_debug_init() ||
sysfs_create_files(bcache_kobj, files))
goto err;
 
-- 
2.14.1



[for-4.17 01/20] bcache: fix cached_dev->count usage for bch_cache_set_error()

2018-03-18 Thread Michael Lyle
From: Coly Li <col...@suse.de>

When bcache metadata I/O fails, bcache will call bch_cache_set_error()
to retire the whole cache set. The expected behavior to retire a cache
set is to unregister the cache set, and unregister all backing device
attached to this cache set, then remove sysfs entries of the cache set
and all attached backing devices, finally release memory of structs
cache_set, cache, cached_dev and bcache_device.

In my testing when journal I/O failure triggered by disconnected cache
device, sometimes the cache set cannot be retired, and its sysfs
entry /sys/fs/bcache/ still exits and the backing device also
references it. This is not expected behavior.

When metadata I/O failes, the call senquence to retire whole cache set is,
bch_cache_set_error()
bch_cache_set_unregister()
bch_cache_set_stop()
__cache_set_unregister() <- called as callback by calling
clousre_queue(>caching)
cache_set_flush()<- called as a callback when refcount
of cache_set->caching is 0
cache_set_free() <- called as a callback when refcount
of catch_set->cl is 0
bch_cache_set_release()  <- called as a callback when refcount
of catch_set->kobj is 0

I find if kernel thread bch_writeback_thread() quits while-loop when
kthread_should_stop() is true and searched_full_index is false, clousre
callback cache_set_flush() set by continue_at() will never be called. The
result is, bcache fails to retire whole cache set.

cache_set_flush() will be called when refcount of closure c->caching is 0,
and in function bcache_device_detach() refcount of closure c->caching is
released to 0 by clousre_put(). In metadata error code path, function
bcache_device_detach() is called by cached_dev_detach_finish(). This is a
callback routine being called when cached_dev->count is 0. This refcount
is decreased by cached_dev_put().

The above dependence indicates, cache_set_flush() will be called when
refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
when refcount of cache_dev->count is 0.

The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
of cache_dev is not decreased properly.

In bch_writeback_thread(), cached_dev_put() is called only when
searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
there is no dirty data on cache. In most of run time it is correct, but
when bch_writeback_thread() quits the while-loop while cache is still
dirty, current code forget to call cached_dev_put() before this kernel
thread exits. This is why sometimes cache_set_flush() is not executed and
cache set fails to be retired.

The reason to call cached_dev_put() in bch_writeback_rate() is, when the
cache device changes from clean to dirty, cached_dev_get() is called, to
make sure during writeback operatiions both backing and cache devices
won't be released.

Adding following code in bch_writeback_thread() does not work,
   static int bch_writeback_thread(void *arg)
}

+   if (atomic_read(>has_dirty))
+   cached_dev_put()
+
return 0;
 }
because writeback kernel thread can be waken up and start via sysfs entry:
echo 1 > /sys/block/bcache/bcache/writeback_running
It is difficult to check whether backing device is dirty without race and
extra lock. So the above modification will introduce potential refcount
underflow in some conditions.

The correct fix is, to take cached dev refcount when creating the kernel
thread, and put it before the kernel thread exits. Then bcache does not
need to take a cached dev refcount when cache turns from clean to dirty,
or to put a cached dev refcount when cache turns from ditry to clean. The
writeback kernel thread is alwasy safe to reference data structure from
cache set, cache and cached device (because a refcount of cache device is
taken for it already), and no matter the kernel thread is stopped by I/O
errors or system reboot, cached_dev->count can always be used correctly.

The patch is simple, but understanding how it works is quite complicated.

Changelog:
v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes.
v1: initial version for review.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Michael Lyle <ml...@lyle.org>
Cc: Junhui Tang <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/super.c |  1 -
 drivers/md/bcache/writeback.c | 11 ---
 drivers/md/bcache/writeback.h |  2 --
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f

[for-4.17 12/20] bcache: add io_disable to struct cached_dev

2018-03-18 Thread Michael Lyle
From: Coly Li <col...@suse.de>

If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.

In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.

This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
  cached_dev. Add another io_disable to struct cached_dev to disable I/Os
  on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
  if io_errors reaches error_limit, set cache_dev->io_disable to true, and
  stop the bcache device.

The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.

Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Michael Lyle <ml...@lyle.org>
Cc: Junhui Tang <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/bcache.h  |  6 ++
 drivers/md/bcache/io.c  | 14 ++
 drivers/md/bcache/request.c | 14 --
 drivers/md/bcache/super.c   | 21 +
 drivers/md/bcache/sysfs.c   | 15 ++-
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5e9f3610c6fd..d338b7086013 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -367,6 +367,7 @@ struct cached_dev {
unsignedsequential_cutoff;
unsignedreadahead;
 
+   unsignedio_disable:1;
unsignedverify:1;
unsignedbypass_torture_test:1;
 
@@ -388,6 +389,9 @@ struct cached_dev {
unsignedwriteback_rate_minimum;
 
enum stop_on_failurestop_when_cache_set_failed;
+#define DEFAULT_CACHED_DEV_ERROR_LIMIT 64
+   atomic_tio_errors;
+   unsignederror_limit;
 };
 
 enum alloc_reserve {
@@ -911,6 +915,7 @@ static inline void wait_for_kthread_stop(void)
 
 /* Forward declarations */
 
+void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio);
 void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
 void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
  blk_status_t, const char *);
@@ -938,6 +943,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned,
 struct bkey *, int, bool);
 bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned,
   unsigned, unsigned, bool);
+bool bch_cached_dev_error(struct cached_dev *dc);
 
 __printf(2, 3)
 bool bch_cache_set_error(struct cache_set *, const char *, ...);
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index 8013ecbcdbda..7fac97ae036e 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
 }
 
 /* IO errors */
+void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio)
+{
+   char buf[BDEVNAME_SIZE];
+   unsigned errors;
+
+   WARN_ONCE(!dc, "NULL pointer of struct cached_dev");
+
+   errors = atomic_add_return(1, >io_errors);
+   if (errors < dc->error_limit)
+   pr_err("%s: IO error on backing device, unrecoverable",
+   bio_devname(bio, buf));
+   else
+   bch_cached_dev_error(dc);
+}
 
 void bch_count_io_errors(struct cache *ca,
 blk_status_t error,
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index b4a5768afbe9..5a82237c7025 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio)
 
if (bio->bi_status) {
struct search *s = container_of(cl, struct search, cl);
+   struct cached_dev *dc = container_of(s->d,
+struct cached_dev, disk);
/*
 * If a bio has REQ_PREFLUSH for writeback mode, it is
 * speically assembled in cached_dev_write() for a non-zero
@@ -657,6 +659,7 @@ static void backing_request_endio(struct bio *bio)
}
s->

[for-4.17 13/20] bcache: Fix indentation

2018-03-18 Thread Michael Lyle
From: Bart Van Assche <bart.vanass...@wdc.com>

This patch avoids that smatch complains about inconsistent indentation.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Coly Li <col...@suse.de>
---
 drivers/md/bcache/btree.c | 2 +-
 drivers/md/bcache/writeback.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index d64aff0b8abc..143ed5a758e7 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2178,7 +2178,7 @@ int bch_btree_insert_check_key(struct btree *b, struct 
btree_op *op,
 
if (b->key.ptr[0] != btree_ptr ||
b->seq != seq + 1) {
-   op->lock = b->level;
+   op->lock = b->level;
goto out;
}
}
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 0bba8f1c6cdf..610fb01de629 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -39,7 +39,7 @@ static inline uint64_t  
bcache_flash_devs_sectors_dirty(struct cache_set *c)
 
if (!d || !UUID_FLASH_ONLY(>uuids[i]))
continue;
-  ret += bcache_dev_sectors_dirty(d);
+   ret += bcache_dev_sectors_dirty(d);
}
 
mutex_unlock(_register_lock);
-- 
2.14.1



Re: [PATCH 07/16] bcache: Reduce the number of sparse complaints about lock imbalances

2018-03-16 Thread Michael Lyle
On 03/15/2018 08:08 AM, Bart Van Assche wrote:
> Add more annotations for sparse to inform it about which functions do
> not have the same number of spin_lock() and spin_unlock() calls.
> 
> Signed-off-by: Bart Van Assche 

LGTM, applying.  -mpl


Re: [PATCH 05/16] bcache: Remove an unused variable

2018-03-16 Thread Michael Lyle
On 03/15/2018 08:08 AM, Bart Van Assche wrote:
> Signed-off-by: Bart Van Assche 

LGTM, applying.  (I don't see a problem with empty body).


Re: [PATCH 03/16] bcache: Annotate switch fall-through

2018-03-16 Thread Michael Lyle
On 03/15/2018 08:08 AM, Bart Van Assche wrote:
> This patch avoids that building with W=1 triggers complaints about
> switch fall-throughs.
> 
> Signed-off-by: Bart Van Assche 

LGTM, applying.


Re: [PATCH 04/16] bcache: Fix kernel-doc warnings

2018-03-16 Thread Michael Lyle
On 03/15/2018 08:08 AM, Bart Van Assche wrote:
> Avoid that building with W=1 triggers warnings about the kernel-doc
> headers.
> 
> Signed-off-by: Bart Van Assche 

LGTM, applying.


Re: [PATCH 02/16] bcache: Add __printf annotation to __bch_check_keys()

2018-03-16 Thread Michael Lyle
On Thu, Mar 15, 2018 at 8:08 AM, Bart Van Assche  wrote:
> Make it possible for the compiler to verify the consistency of the
> format string passed to __bch_check_keys() and the arguments that
> should be formatted according to that format string.
>
> Signed-off-by: Bart Van Assche 

LGTM, applying.


Re: [PATCH 01/16] bcache: Fix indentation

2018-03-16 Thread Michael Lyle
On Thu, Mar 15, 2018 at 8:07 AM, Bart Van Assche  wrote:
> This patch avoids that smatch complains about inconsistent indentation.
>
> Signed-off-by: Bart Van Assche 
LGTM, applying


Re: [PATCH 08/16] bcache: Fix a compiler warning in bcache_device_init()

2018-03-16 Thread Michael Lyle
On Thu, Mar 15, 2018 at 8:08 AM, Bart Van Assche  wrote:
> Avoid that building with W=1 triggers the following compiler warning:
>
> drivers/md/bcache/super.c:776:20: warning: comparison is always false due to 
> limited range of data type [-Wtype-limits]
>   d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) {
> ^
>
> Signed-off-by: Bart Van Assche 

LGTM, applying.


Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev

2018-03-14 Thread Michael Lyle
On 02/27/2018 08:55 AM, Coly Li wrote:
> If a bcache device is configured to writeback mode, current code does not
> handle write I/O errors on backing devices properly.
> 
> In writeback mode, write request is written to cache device, and
> latter being flushed to backing device. If I/O failed when writing from
> cache device to the backing device, bcache code just ignores the error and
> upper layer code is NOT noticed that the backing device is broken.

lgtm, applied


Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev

2018-03-14 Thread Michael Lyle
LGTM, applying.

On 02/27/2018 08:55 AM, Coly Li wrote:
> If a bcache device is configured to writeback mode, current code does not
> handle write I/O errors on backing devices properly.
> 
> In writeback mode, write request is written to cache device, and
> latter being flushed to backing device. If I/O failed when writing from
> cache device to the backing device, bcache code just ignores the error and
> upper layer code is NOT noticed that the backing device is broken.
> 
> This patch tries to handle backing device failure like how the cache device
> failure is handled,
> - Add a error counter 'io_errors' and error limit 'error_limit' in struct
>   cached_dev. Add another io_disable to struct cached_dev to disable I/Os
>   on the problematic backing device.
> - When I/O error happens on backing device, increase io_errors counter. And
>   if io_errors reaches error_limit, set cache_dev->io_disable to true, and
>   stop the bcache device.
> 
> The result is, if backing device is broken of disconnected, and I/O errors
> reach its error limit, backing device will be disabled and the associated
> bcache device will be removed from system.
> 
> Changelog:
> v2: remove "bcache: " prefix in pr_error(), and use correct name string to
> print out bcache device gendisk name.
> v1: indeed this is new added in v2 patch set.
> 
> Signed-off-by: Coly Li <col...@suse.de>
> Reviewed-by: Hannes Reinecke <h...@suse.com>
> Cc: Michael Lyle <ml...@lyle.org>
> Cc: Junhui Tang <tang.jun...@zte.com.cn>
> ---
>  drivers/md/bcache/bcache.h  |  6 ++
>  drivers/md/bcache/io.c  | 14 ++
>  drivers/md/bcache/request.c | 14 --
>  drivers/md/bcache/super.c   | 21 +
>  drivers/md/bcache/sysfs.c   | 15 ++-
>  5 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 5e9f3610c6fd..d338b7086013 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -367,6 +367,7 @@ struct cached_dev {
>   unsignedsequential_cutoff;
>   unsignedreadahead;
>  
> + unsignedio_disable:1;
>   unsignedverify:1;
>   unsignedbypass_torture_test:1;
>  
> @@ -388,6 +389,9 @@ struct cached_dev {
>   unsignedwriteback_rate_minimum;
>  
>   enum stop_on_failurestop_when_cache_set_failed;
> +#define DEFAULT_CACHED_DEV_ERROR_LIMIT   64
> + atomic_tio_errors;
> + unsignederror_limit;
>  };
>  
>  enum alloc_reserve {
> @@ -911,6 +915,7 @@ static inline void wait_for_kthread_stop(void)
>  
>  /* Forward declarations */
>  
> +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio);
>  void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
>  void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
> blk_status_t, const char *);
> @@ -938,6 +943,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned,
>struct bkey *, int, bool);
>  bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned,
>  unsigned, unsigned, bool);
> +bool bch_cached_dev_error(struct cached_dev *dc);
>  
>  __printf(2, 3)
>  bool bch_cache_set_error(struct cache_set *, const char *, ...);
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index 8013ecbcdbda..7fac97ae036e 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
>  }
>  
>  /* IO errors */
> +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio)
> +{
> + char buf[BDEVNAME_SIZE];
> + unsigned errors;
> +
> + WARN_ONCE(!dc, "NULL pointer of struct cached_dev");
> +
> + errors = atomic_add_return(1, >io_errors);
> + if (errors < dc->error_limit)
> + pr_err("%s: IO error on backing device, unrecoverable",
> + bio_devname(bio, buf));
> + else
> + bch_cached_dev_error(dc);
> +}
>  
>  void bch_count_io_errors(struct cache *ca,
>blk_status_t error,
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 0c517dd806a5..d7a463e0250e 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio)
>  
>   if (bio->bi_status) {
>   struct search *s = container_o

Re: [PATCH v7 7/9] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O

2018-03-14 Thread Michael Lyle
LGTM, applying

On 02/27/2018 08:55 AM, Coly Li wrote:
> In order to catch I/O error of backing device, a separate bi_end_io
> call back is required. Then a per backing device counter can record I/O
> errors number and retire the backing device if the counter reaches a
> per backing device I/O error limit.
> 
> This patch adds backing_request_endio() to bcache backing device I/O code
> path, this is a preparation for further complicated backing device failure
> handling. So far there is no real code logic change, I make this change a
> separate patch to make sure it is stable and reliable for further work.
> 
> Changelog:
> v2: Fix code comments typo, remove a redundant bch_writeback_add() line
> added in v4 patch set.
> v1: indeed this is new added in this patch set.
> 
> Signed-off-by: Coly Li <col...@suse.de>
> Reviewed-by: Hannes Reinecke <h...@suse.com>
> Cc: Junhui Tang <tang.jun...@zte.com.cn>
> Cc: Michael Lyle <ml...@lyle.org>
> ---
>  drivers/md/bcache/request.c   | 93 
> +++
>  drivers/md/bcache/super.c |  1 +
>  drivers/md/bcache/writeback.c |  1 +
>  3 files changed, 79 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 279c9266bf50..0c517dd806a5 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl)
>   }
>  
>   op->insert_data_done = true;
> + /* get in bch_data_insert() */
>   bio_put(bio);
>  out:
>   continue_at(cl, bch_data_insert_keys, op->wq);
> @@ -630,6 +631,38 @@ static void request_endio(struct bio *bio)
>   closure_put(cl);
>  }
>  
> +static void backing_request_endio(struct bio *bio)
> +{
> + struct closure *cl = bio->bi_private;
> +
> + if (bio->bi_status) {
> + struct search *s = container_of(cl, struct search, cl);
> + /*
> +  * If a bio has REQ_PREFLUSH for writeback mode, it is
> +  * speically assembled in cached_dev_write() for a non-zero
> +  * write request which has REQ_PREFLUSH. we don't set
> +  * s->iop.status by this failure, the status will be decided
> +  * by result of bch_data_insert() operation.
> +  */
> + if (unlikely(s->iop.writeback &&
> +  bio->bi_opf & REQ_PREFLUSH)) {
> + char buf[BDEVNAME_SIZE];
> +
> + bio_devname(bio, buf);
> + pr_err("Can't flush %s: returned bi_status %i",
> + buf, bio->bi_status);
> + } else {
> + /* set to orig_bio->bi_status in bio_complete() */
> + s->iop.status = bio->bi_status;
> + }
> + s->recoverable = false;
> + /* should count I/O error for backing device here */
> + }
> +
> + bio_put(bio);
> + closure_put(cl);
> +}
> +
>  static void bio_complete(struct search *s)
>  {
>   if (s->orig_bio) {
> @@ -644,13 +677,21 @@ static void bio_complete(struct search *s)
>   }
>  }
>  
> -static void do_bio_hook(struct search *s, struct bio *orig_bio)
> +static void do_bio_hook(struct search *s,
> + struct bio *orig_bio,
> + bio_end_io_t *end_io_fn)
>  {
>   struct bio *bio = >bio.bio;
>  
>   bio_init(bio, NULL, 0);
>   __bio_clone_fast(bio, orig_bio);
> - bio->bi_end_io  = request_endio;
> + /*
> +  * bi_end_io can be set separately somewhere else, e.g. the
> +  * variants in,
> +  * - cache_bio->bi_end_io from cached_dev_cache_miss()
> +  * - n->bi_end_io from cache_lookup_fn()
> +  */
> + bio->bi_end_io  = end_io_fn;
>   bio->bi_private = >cl;
>  
>   bio_cnt_set(bio, 3);
> @@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio,
>   s = mempool_alloc(d->c->search, GFP_NOIO);
>  
>   closure_init(>cl, NULL);
> - do_bio_hook(s, bio);
> + do_bio_hook(s, bio, request_endio);
>  
>   s->orig_bio = bio;
>   s->cache_miss   = NULL;
> @@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl)
>   trace_bcache_read_retry(s->orig_bio);
>  
>   s->iop.status = 0;
> - do_bio_hook(s, s->orig_bio);
> + do_bio_hook(s, s->orig_bio, backing_request_endio);
>  

Re: [PATCH v2] bcache: move closure debug file into debug direcotry

2018-03-07 Thread Michael Lyle
On 03/06/2018 07:23 PM, Chengguang Xu wrote:
> In current code closure debug file is outside of debug directory
> and when unloading module there is lack of removing operation
> for closure debug file, so it will cause creating error when trying
> to reload  module.
> 
> This patch move closure debug file into "bcache" debug direcory
> so that the file can get deleted properly.
> 
> Signed-off-by: Chengguang Xu 

Thanks!  This looks much better and is applied.

Mike


Re: [PATCH v2] bcache: move closure debug file into debug direcotry

2018-03-06 Thread Michael Lyle
Sorry- I had to pull/unapply this actually.

On 03/04/2018 11:40 PM, Chengguang Xu wrote:
> -static struct dentry *debug
> +struct dentry *debug;

This conflicts with other symbols called "debug" and doesn't compile.
Please be sure that your patch set compiles before submitting.

Mike

>  
>  #ifdef CONFIG_BCACHE_DEBUG
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 1a9fdab..b784292 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2133,7 +2133,6 @@ static int __init bcache_init(void)
>   mutex_init(_register_lock);
>   init_waitqueue_head(_wait);
>   register_reboot_notifier();
> - closure_debug_init();
>  
>   bcache_major = register_blkdev(0, "bcache");
>   if (bcache_major < 0) {
> @@ -2145,7 +2144,7 @@ static int __init bcache_init(void)
>   if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) ||
>   !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
>   bch_request_init() ||
> - bch_debug_init(bcache_kobj) ||
> + bch_debug_init(bcache_kobj) || closure_debug_init() ||
>   sysfs_create_files(bcache_kobj, files))
>   goto err;
>  
> 



Re: Some bcache patches need reveiw

2018-03-05 Thread Michael Lyle
Hey Tang Junhui--

You're right.  I'm sorry, these had slipped through the cracks / I had
lost track of them.  I also have the rest of Coly's patchset to work
through.  I was able to apply the first two, and will continue to work
on my for-next branch.

Mike


On 03/05/2018 02:39 AM, tang.jun...@zte.com.cn wrote:
> Hello Mike
> 
> I send some patches some times before, with no response,
> 
> Bellow patches are very simple, can you or anybody else have a review?
> [PATCH] bcache: fix incorrect sysfs output value of strip size
> [PATCH] bcache: fix error return value in memory shrink
> [PATCH] bcache: fix error count in memory shrink
> 
> Bellow patches are simple too, and important for me and also important for 
> bcache,
> can you or anybody else have a review?
> [PATCH] bcache: finish incremental GC
> [PATCH] bcache: calculate the number of incremental GC nodes according to the 
> total of btree nodes
> 
> Thanks,
> Tang Junhui
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



Re: [PATCH] bcache: fix error count in memory shrink

2018-03-05 Thread Michael Lyle
Hi Tang Junhui---

I'm not really sure about this one.  It changes the semantics of the
amount of work done-- nr_to_scan now means number of things to free
instead of the number to check.

If the system is under severe memory pressure, and most of the cache is
essential/actively used, this could greatly increase the amount of work
done.

As to the i < btree_cache_used, it's arguably a bug, but it only reduces
the amount of work done if the cache is very, very small.

Mike

On 01/30/2018 06:46 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> In bch_mca_scan(), nr btree nodes are expected to shrink, so the for(;;)
> loop need to satisfy the condition freed < nr.
> And since c->btree_cache_used always decrease after mca_data_free() calling
> in for(;;) loop,  so we need a auto variable to record c->btree_cache_used
> before the for(;;) loop.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/btree.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81e8dc3..7e9ef3d 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -664,6 +664,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
>   struct btree *b, *t;
>   unsigned long i, nr = sc->nr_to_scan;
>   unsigned long freed = 0;
> + unsigned int btree_cache_used;
>  
>   if (c->shrinker_disabled)
>   return SHRINK_STOP;
> @@ -700,7 +701,8 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
>   }
>   }
>  
> - for (i = 0; (nr--) && i < c->btree_cache_used; i++) {
> + btree_cache_used = c->btree_cache_used;
> + for (i = 0; freed < nr && i < btree_cache_used; i++) {
>   if (list_empty(>btree_cache))
>   goto out;
>  
> 



Re: [PATCH] bcache: fix error return value in memory shrink

2018-03-05 Thread Michael Lyle
LGTM, applied

On 01/30/2018 07:30 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> In bch_mca_scan(), the return value should not be the number of freed btree
> nodes, but the number of pages of freed btree nodes.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/btree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81e8dc3..a5fbccc 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -718,7 +718,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
>   }
>  out:
>   mutex_unlock(>bucket_lock);
> - return freed;
> + return freed * c->btree_pages;
>  }
>  
>  static unsigned long bch_mca_count(struct shrinker *shrink,
> 



Re: [PATCH] bcache: fix incorrect sysfs output value of strip size

2018-03-05 Thread Michael Lyle
LGTM, applied.

On 02/10/2018 06:30 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Stripe size is shown as zero when no strip in back end device:
> [root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size
> 0.0k
> 
> Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs
> interface, stripe_size was changed from sectors to bytes, and move
> 9 bits left, so the 32 bits variable overflows.
> 
> This patch change the variable to a 64 bits type before moving bits.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index b418409..240cbec 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -170,7 +170,7 @@
>   sysfs_hprint(dirty_data,
>bcache_dev_sectors_dirty(>disk) << 9);
>  
> - sysfs_hprint(stripe_size,   dc->disk.stripe_size << 9);
> + sysfs_hprint(stripe_size,((uint64_t)dc->disk.stripe_size) << 9);
>   var_printf(partial_stripes_expensive,   "%u");
>  
>   var_hprint(sequential_cutoff);
> 



Re: [PATCH] bcache: don't attach backing with duplicate UUID

2018-03-05 Thread Michael Lyle
Hi Tang Junhui---

Thanks for your review.  I just sent it upstream (with your change) to Jens.

Mike

On 03/04/2018 05:07 PM, tang.jun...@zte.com.cn wrote:
> Hello Mike
> 
> I send the email from my personal mailbox(110950...@qq.com), it may be fail,
> so I resend this email from my office mailbox again. bellow is the mail
> context I send previous.
> 
> 
> 
> I am Tang Junhui(tang.jun...@zte.com.cn), This email comes from my
> personal mailbox, since I am not in office today.
> 
>>> From: Tang Junhui 
>>>
>>> Hello, Mike
>>>
>>> This patch looks good, but has some conflicts with this patch:
>>> bcache: fix for data collapse after re-attaching an attached device
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930
>>> Could you modify your fix base on the previous patch?
> 
>> That doesn't make sense.  This patch was generated from a current tree
>> where it's applied on top of that: (It's based on next when it should
>> really be based on Linus's tree, but it doesn't matter for patch
>> application because there's no changes in next right now to bcache that
>> aren't in Linus's tree).
> 
> Originally, I did not mean merger conflicts, but the 
> code logical conflicts, since the previous patch add a new input parameter
> set_uuid in bch_cached_dev_attach(), and if set_uuid is not NULL,
> we use set_uuid as cache set uuid, otherwise, we use dc->sb.set_uuid as
> the cache set uuid.
> 
> But now, I read your patch again, and realize that you did not use 
> dc->sb.set_uuid, but use dc->sb.uuid to judge whether the device is a
> duplicate backend device, so it's OK for me.
> 
>> May I add your reviewed-by so I can send this (and your fix) upstream?
> Reviewed-by: Tang Junhui 
> 
> Thanks 
> Tang Junhui
> 
> Thanks
> Tang Junhui
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



[PATCH 1/2] bcache: fix crashes in duplicate cache device register

2018-03-05 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

Kernel crashed when register a duplicate cache device, the call trace is
bellow:
[  417.643790] CPU: 1 PID: 16886 Comm: bcache-register Tainted: G
   W  OE4.15.5-amd64-preempt-sysrq-20171018 #2
[  417.643861] Hardware name: LENOVO 20ERCTO1WW/20ERCTO1WW, BIOS
N1DET41W (1.15 ) 12/31/2015
[  417.643870] RIP: 0010:bdevname+0x13/0x1e
[  417.643876] RSP: 0018:a3aa9138fd38 EFLAGS: 00010282
[  417.643884] RAX:  RBX: 8c8f2f2f8000 RCX: d6701f8
c7edf
[  417.643890] RDX: a3aa9138fd88 RSI: a3aa9138fd88 RDI: 000
0
[  417.643895] RBP: a3aa9138fde0 R08: a3aa9138fae8 R09: 000
1850e
[  417.643901] R10: 8c8eed34b271 R11: 8c8eed34b250 R12: 000
0
[  417.643906] R13: d6701f78f940 R14: 8c8f38f8 R15: 8c8ea7d
9
[  417.643913] FS:  7fde7e66f500() GS:8c8f6144() knlGS:

[  417.643919] CS:  0010 DS:  ES:  CR0: 80050033
[  417.643925] CR2: 0314 CR3: 0007e6fa0001 CR4: 003
606e0
[  417.643931] DR0:  DR1:  DR2: 000
0
[  417.643938] DR3:  DR6: fffe0ff0 DR7: 000
00400
[  417.643946] Call Trace:
[  417.643978]  register_bcache+0x1117/0x1270 [bcache]
[  417.643994]  ? slab_pre_alloc_hook+0x15/0x3c
[  417.644001]  ? slab_post_alloc_hook.isra.44+0xa/0x1a
[  417.644013]  ? kernfs_fop_write+0xf6/0x138
[  417.644020]  kernfs_fop_write+0xf6/0x138
[  417.644031]  __vfs_write+0x31/0xcc
[  417.644043]  ? current_kernel_time64+0x10/0x36
[  417.644115]  ? __audit_syscall_entry+0xbf/0xe3
[  417.644124]  vfs_write+0xa5/0xe2
[  417.644133]  SyS_write+0x5c/0x9f
[  417.644144]  do_syscall_64+0x72/0x81
[  417.644161]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  417.644169] RIP: 0033:0x7fde7e1c1974
[  417.644175] RSP: 002b:7fff13009a38 EFLAGS: 0246 ORIG_RAX: 000
1
[  417.644183] RAX: ffda RBX: 01658280 RCX: 7fde7e1c
1974
[  417.644188] RDX: 000a RSI: 01658280 RDI: 
0001
[  417.644193] RBP: 000a R08: 0003 R09: 
0077
[  417.644198] R10: 089e R11: 0246 R12: 
0001
[  417.644203] R13: 000a R14: 7fff R15: 

[  417.644213] Code: c7 c2 83 6f ee 98 be 20 00 00 00 48 89 df e8 6c 27 3b 0
0 48 89 d8 5b c3 0f 1f 44 00 00 48 8b 47 70 48 89 f2 48 8b bf 80 00 00 00 <8
b> b0 14 03 00 00 e9 73 ff ff ff 0f 1f 44 00 00 48 8b 47 40 39
[  417.644302] RIP: bdevname+0x13/0x1e RSP: a3aa9138fd38
[  417.644306] CR2: 0314

When registering duplicate cache device in register_cache(), after failure
on calling register_cache_set(), bch_cache_release() will be called, then
bdev will be freed, so bdevname(bdev, name) caused kernel crash.

Since bch_cache_release() will free bdev, so in this patch we make sure
bdev being freed if register_cache() fail, and do not free bdev again in
register_bcache() when register_cache() fail.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reported-by: Marc MERLIN <m...@merlins.org>
Tested-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: <sta...@vger.kernel.org>
---
 drivers/md/bcache/super.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 312895788036..9c141a8aaacc 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1204,7 +1204,7 @@ static void register_bdev(struct cache_sb *sb, struct 
page *sb_page,
 
return;
 err:
-   pr_notice("error opening %s: %s", bdevname(bdev, name), err);
+   pr_notice("error %s: %s", bdevname(bdev, name), err);
bcache_device_stop(>disk);
 }
 
@@ -1883,6 +1883,8 @@ static int register_cache(struct cache_sb *sb, struct 
page *sb_page,
const char *err = NULL; /* must be set for any error case */
int ret = 0;
 
+   bdevname(bdev, name);
+
memcpy(>sb, sb, sizeof(struct cache_sb));
ca->bdev = bdev;
ca->bdev->bd_holder = ca;
@@ -1891,11 +1893,12 @@ static int register_cache(struct cache_sb *sb, struct 
page *sb_page,
bio_first_bvec_all(>sb_bio)->bv_page = sb_page;
get_page(sb_page);
 
-   if (blk_queue_discard(bdev_get_queue(ca->bdev)))
+   if (blk_queue_discard(bdev_get_queue(bdev)))
ca->discard = CACHE_DISCARD(>sb);
 
ret = cache_alloc(ca);
if (ret != 0) {
+   blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
if (ret == -ENOMEM)
err = "cache_alloc(): -ENOMEM";
else
@@ -1918,14 +1921,14 @@ static int register_cache(struct cache_sb *sb, struct 
page *sb_page,
goto out;
  

[PATCH 2/2] bcache: don't attach backing with duplicate UUID

2018-03-05 Thread Michael Lyle
This can happen e.g. during disk cloning.

This is an incomplete fix: it does not catch duplicate UUIDs earlier
when things are still unattached.  It does not unregister the device.
Further changes to cope better with this are planned but conflict with
Coly's ongoing improvements to handling device errors.  In the meantime,
one can manually stop the device after this has happened.

Attempts to attach a duplicate device result in:

[  136.372404] loop: module loaded
[  136.424461] bcache: register_bdev() registered backing device loop0
[  136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but 
duplicate UUID already attached

My test procedure is:

  dd if=/dev/sdb1 of=imgfile bs=1024 count=262144
  losetup -f imgfile

Signed-off-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
Cc: <sta...@vger.kernel.org>
---
 drivers/md/bcache/super.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 9c141a8aaacc..5cace6892958 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -963,6 +963,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c,
uint32_t rtime = cpu_to_le32(get_seconds());
struct uuid_entry *u;
char buf[BDEVNAME_SIZE];
+   struct cached_dev *exist_dc, *t;
 
bdevname(dc->bdev, buf);
 
@@ -987,6 +988,16 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c,
return -EINVAL;
}
 
+   /* Check whether already attached */
+   list_for_each_entry_safe(exist_dc, t, >cached_devs, list) {
+   if (!memcmp(dc->sb.uuid, exist_dc->sb.uuid, 16)) {
+   pr_err("Tried to attach %s but duplicate UUID already 
attached",
+   buf);
+
+   return -EINVAL;
+   }
+   }
+
u = uuid_find(c, dc->sb.uuid);
 
if (u &&
-- 
2.14.1



[PATCH 0/2] Duplicate UUID fixes for 4.16

2018-03-05 Thread Michael Lyle
Jens--

Sorry for a couple of last-minute changes.  Marc Merlin noticed an issue
with disk imaging where if multiple devices were present with the same
bcache UUID that a kernel panic could result.  Tang Junhui fixed this.
I found a related data corruption issue with duplicate backing devices.

Both are minimal, reviewed, and tested.  As always, thanks for your help.

Mike


Re: [PATCH] bcache: don't attach backing with duplicate UUID

2018-03-02 Thread Michael Lyle
On 03/02/2018 06:55 AM, Michael Lyle wrote:
> Hi Tang Junhui---
> 
> On 03/01/2018 09:45 PM, tang.jun...@zte.com.cn wrote:
>> From: Tang Junhui <tang.jun...@zte.com.cn>
>>
>> Hello, Mike
>>
>> This patch looks good, but has some conflicts with this patch:
>> bcache: fix for data collapse after re-attaching an attached device
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930
>> Could you modify your fix base on the previous patch?
> 
> That doesn't make sense.  This patch was generated from a current tree
> where it's applied on top of that: (It's based on next when it should
> really be based on Linus's tree, but it doesn't matter for patch
> application because there's no changes in next right now to bcache that
> aren't in Linus's tree).
> 
> https://github.com/mlyle/linux/commit/502c38831b42100ac5d3319c792a66c7e01357ae
> 
> May I add your reviewed-by so I can send this (and your fix) upstream?
> 
> Mike

Just to confirm this all works---

mlyle@midnight:~/trees/linux$ git checkout linus/master
Note: checking out 'linus/master'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b 

HEAD is now at 5d60e057d127... Merge tag 'drm-fixes-for-v4.16-rc4' of
git://people.freedesktop.org/~airlied/linux
mlyle@midnight:~/trees/linux$ git am
0001-bcache-fix-crashes-in-duplicate-cache-device-registe.patch
Applying: bcache: fix crashes in duplicate cache device register
mlyle@midnight:~/trees/linux$ git am
0001-bcache-don-t-attach-backing-with-duplicate-UUID.patch
Applying: bcache: don't attach backing with duplicate UUID

Mike


Re: [PATCH] bcache: don't attach backing with duplicate UUID

2018-03-02 Thread Michael Lyle
Hi Tang Junhui---

On 03/01/2018 09:45 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Hello, Mike
> 
> This patch looks good, but has some conflicts with this patch:
> bcache: fix for data collapse after re-attaching an attached device
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930
> Could you modify your fix base on the previous patch?

That doesn't make sense.  This patch was generated from a current tree
where it's applied on top of that: (It's based on next when it should
really be based on Linus's tree, but it doesn't matter for patch
application because there's no changes in next right now to bcache that
aren't in Linus's tree).

https://github.com/mlyle/linux/commit/502c38831b42100ac5d3319c792a66c7e01357ae

May I add your reviewed-by so I can send this (and your fix) upstream?

Mike


[PATCH] bcache: don't attach backing with duplicate UUID

2018-03-01 Thread Michael Lyle
This can happen e.g. during disk cloning.

This is an incomplete fix: it does not catch duplicate UUIDs earlier
when things are still unattached.  It does not unregister the device.
Further changes to cope better with this are planned but conflict with
Coly's ongoing improvements to handling device errors.  In the meantime,
one can manually stop the device after this has happened.

Attempts to attach a duplicate device result in:

[  136.372404] loop: module loaded
[  136.424461] bcache: register_bdev() registered backing device loop0
[  136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but 
duplicate UUID already attached

My test procedure is:

  dd if=/dev/sdb1 of=imgfile bs=1024 count=262144
  losetup -f imgfile

Signed-off-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/super.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 9c141a8aaacc..5cace6892958 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -963,6 +963,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c,
uint32_t rtime = cpu_to_le32(get_seconds());
struct uuid_entry *u;
char buf[BDEVNAME_SIZE];
+   struct cached_dev *exist_dc, *t;
 
bdevname(dc->bdev, buf);
 
@@ -987,6 +988,16 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c,
return -EINVAL;
}
 
+   /* Check whether already attached */
+   list_for_each_entry_safe(exist_dc, t, >cached_devs, list) {
+   if (!memcmp(dc->sb.uuid, exist_dc->sb.uuid, 16)) {
+   pr_err("Tried to attach %s but duplicate UUID already 
attached",
+   buf);
+
+   return -EINVAL;
+   }
+   }
+
u = uuid_find(c, dc->sb.uuid);
 
if (u &&
-- 
2.14.1



Re: [PATCH 06/11] bcache: Use the blk_queue_flag_{set,clear}() functions

2018-02-28 Thread Michael Lyle
LGTM

On Wed, Feb 28, 2018 at 11:28 AM, Bart Van Assche
<bart.vanass...@wdc.com> wrote:
> Use the blk_queue_flag_{set,clear}() functions instead of open-coding
> these.
>
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Michael Lyle <ml...@lyle.org>
> Cc: Kent Overstreet <kent.overstr...@gmail.com>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Hannes Reinecke <h...@suse.de>
> Cc: Johannes Thumshirn <jthumsh...@suse.de>
> Cc: Ming Lei <ming@redhat.com>
> ---
>  drivers/md/bcache/super.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 312895788036..047e30cb93dc 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -833,9 +833,9 @@ static int bcache_device_init(struct bcache_device *d, 
> unsigned block_size,
> q->limits.io_min= block_size;
> q->limits.logical_block_size= block_size;
> q->limits.physical_block_size   = block_size;
> -   set_bit(QUEUE_FLAG_NONROT,  >disk->queue->queue_flags);
> -   clear_bit(QUEUE_FLAG_ADD_RANDOM, >disk->queue->queue_flags);
> -   set_bit(QUEUE_FLAG_DISCARD, >disk->queue->queue_flags);
> +   blk_queue_flag_set(QUEUE_FLAG_NONROT, d->disk->queue);
> +   blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, d->disk->queue);
> +   blk_queue_flag_set(QUEUE_FLAG_DISCARD, d->disk->queue);
>
> blk_queue_write_cache(q, true, true);
>
> --
> 2.16.2
>


Re: [PATCH v7 0/9] bcache: device failure handling improvement

2018-02-27 Thread Michael Lyle
On 02/27/2018 10:33 AM, Michael Lyle wrote:
> On 02/27/2018 10:29 AM, Michael Lyle wrote:
>> Hi Coly Li--
>>
>> On 02/27/2018 08:55 AM, Coly Li wrote:
>>> Hi maintainers and folks,
>>>
>>> This patch set tries to improve bcache device failure handling, includes
>>> cache device and backing device failures.
>>
>> I have applied 1, 2, 4 & 6 from this series to my 4.17 bcache-for-next
>> for testing.
> 
> Correction: I meant 1, 2, 5 & 6.

Sorry everyone for churn.  Present applied state is 1-6:

#6 6a2fb6357eae bcache: fix inaccurate io state for detached bcache devices
#5 6aee8a57d141 bcache: add stop_when_cache_set_failed option to backing
device
#4 3a9c638b1950 bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
#3 1b355007b5ae bcache: stop dc->writeback_rate_update properly
#2 70120583e4b5 bcache: quit dc->writeback_thread when
BCACHE_DEV_DETACHING is set
#1 b8f557d64115 bcache: fix cached_dev->count usage for
bch_cache_set_error()

There's some checkpatch warnings relating to non-preferred style
(splitting strings to hit 80 character limit) but I will fix them up
myself later).  This subset is through review and passes initial testing
(will be testing more / and writing test suite).

Mike


Re: [PATCH v7 3/9] bcache: stop dc->writeback_rate_update properly

2018-02-27 Thread Michael Lyle
OK, I have convinced myself this is safe.

Reviewed-by: Michael Lyle <ml...@lyle.org>

On 02/27/2018 08:55 AM, Coly Li wrote:
> struct delayed_work writeback_rate_update in struct cache_dev is a delayed
> worker to call function update_writeback_rate() in period (the interval is
> defined by dc->writeback_rate_update_seconds).
> 
> When a metadate I/O error happens on cache device, bcache error handling
> routine bch_cache_set_error() will call bch_cache_set_unregister() to
> retire whole cache set. On the unregister code path, this delayed work is
> stopped by calling cancel_delayed_work_sync(>writeback_rate_update).
> 
> dc->writeback_rate_update is a special delayed work from others in bcache.
> In its routine update_writeback_rate(), this delayed work is re-armed
> itself. That means when cancel_delayed_work_sync() returns, this delayed
> work can still be executed after several seconds defined by
> dc->writeback_rate_update_seconds.
> 
> The problem is, after cancel_delayed_work_sync() returns, the cache set
> unregister code path will continue and release memory of struct cache set.
> Then the delayed work is scheduled to run, __update_writeback_rate()
> will reference the already released cache_set memory, and trigger a NULL
> pointer deference fault.
> 
> This patch introduces two more bcache device flags,
> - BCACHE_DEV_WB_RUNNING
>   bit set:  bcache device is in writeback mode and running, it is OK for
> dc->writeback_rate_update to re-arm itself.
>   bit clear:bcache device is trying to stop dc->writeback_rate_update,
> this delayed work should not re-arm itself and quit.
> - BCACHE_DEV_RATE_DW_RUNNING
>   bit set:  routine update_writeback_rate() is executing.
>   bit clear: routine update_writeback_rate() quits.
> 
> This patch also adds a function cancel_writeback_rate_update_dwork() to
> wait for dc->writeback_rate_update quits before cancel it by calling
> cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected
> quit dc->writeback_rate_update, after time_out seconds this function will
> give up and continue to call cancel_delayed_work_sync().
> 
> And here I explain how this patch stops self re-armed delayed work properly
> with the above stuffs.
> 
> update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning
> and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling
> cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING.
> 
> Before calling cancel_delayed_work_sync() wait utill flag
> BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling
> cancel_delayed_work_sync(), dc->writeback_rate_update must be already re-
> armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases
> delayed work routine update_writeback_rate() won't be executed after
> cancel_delayed_work_sync() returns.
> 
> Inside update_writeback_rate() before calling schedule_delayed_work(), flag
> BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means
> someone is about to stop the delayed work. Because flag
> BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync()
> has to wait for this flag to be cleared, we don't need to worry about race
> condition here.
> 
> If update_writeback_rate() is scheduled to run after checking
> BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync()
> in cancel_writeback_rate_update_dwork(), it is also safe. Because at this
> moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned
> previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear
> and quit immediately.
> 
> Because there are more dependences inside update_writeback_rate() to struct
> cache_set memory, dc->writeback_rate_update is not a simple self re-arm
> delayed work. After trying many different methods (e.g. hold dc->count, or
> use locks), this is the only way I can find which works to properly stop
> dc->writeback_rate_update delayed work.
> 
> Changelog:
> v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING
> to bit index, for test_bit().
> v2: Try to fix the race issue which is pointed out by Junhui.
> v1: The initial version for review
> 
> Signed-off-by: Coly Li <col...@suse.de>
> Reviewed-by: Junhui Tang <tang.jun...@zte.com.cn>
> Cc: Michael Lyle <ml...@lyle.org>
> Cc: Hannes Reinecke <h...@suse.com>
> ---
>  drivers/md/bcache/bcache.h|  9 +
>  drivers/md/bcache/super.c | 39 +++
>  drivers/md/bcache/sysfs.c |  3 ++-
>  drivers/md/bcache/writeback.c | 29 -
>  4 files changed, 70 insertions(+), 10 deletions(-

Re: [PATCH v7 0/9] bcache: device failure handling improvement

2018-02-27 Thread Michael Lyle
On 02/27/2018 10:29 AM, Michael Lyle wrote:
> Hi Coly Li--
> 
> On 02/27/2018 08:55 AM, Coly Li wrote:
>> Hi maintainers and folks,
>>
>> This patch set tries to improve bcache device failure handling, includes
>> cache device and backing device failures.
> 
> I have applied 1, 2, 4 & 6 from this series to my 4.17 bcache-for-next
> for testing.

Correction: I meant 1, 2, 5 & 6.

> 
> Mike
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



Re: [PATCH v7 0/9] bcache: device failure handling improvement

2018-02-27 Thread Michael Lyle
Hi Coly Li--

On 02/27/2018 08:55 AM, Coly Li wrote:
> Hi maintainers and folks,
> 
> This patch set tries to improve bcache device failure handling, includes
> cache device and backing device failures.

I have applied 1, 2, 4 & 6 from this series to my 4.17 bcache-for-next
for testing.

Mike


Re: [PATCH v7 9/9] bcache: stop bcache device when backing device is offline

2018-02-27 Thread Michael Lyle
Hi Coly Li--

Just a couple of questions.

On 02/27/2018 08:55 AM, Coly Li wrote:
> +#define BACKING_DEV_OFFLINE_TIMEOUT 5

I think you wanted this to be 30 (per commit message)-- was this turned
down for testing or deliberate?

> +static int cached_dev_status_update(void *arg)
> +{
> + struct cached_dev *dc = arg;
> + struct request_queue *q;
> + char buf[BDEVNAME_SIZE];
> +
> + /*
> +  * If this delayed worker is stopping outside, directly quit here.
> +  * dc->io_disable might be set via sysfs interface, so check it
> +  * here too.
> +  */
> + while (!kthread_should_stop() && !dc->io_disable) {
> + q = bdev_get_queue(dc->bdev);
> + if (blk_queue_dying(q))

I am not sure-- is this the correct test to know if the bdev is offline?
 It's very sparsely used outside of the core block system.  (Is there
any scenario where the queue comes back?  Can the device be "offline"
and still have a live queue?  Another approach might be to set
io_disable when there's an extended period without a successful IO, and
it might be possible to do this with atomics without a thread).

This approach can work if you're certain that blk_queue_dying is an
appropriate test for all failure scenarios (I just don't know).

Mike


Re: [PATCH v7 4/9] bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags

2018-02-27 Thread Michael Lyle
Hi Coly Li--

On 02/27/2018 08:55 AM, Coly Li wrote:
> When too many I/Os failed on cache device, bch_cache_set_error() is called
> in the error handling code path to retire whole problematic cache set. If
> new I/O requests continue to come and take refcount dc->count, the cache
> set won't be retired immediately, this is a problem.
> 
> Further more, there are several kernel thread and self-armed kernel work
> may still running after bch_cache_set_error() is called. It needs to wait
> quite a while for them to stop, or they won't stop at all. They also
> prevent the cache set from being retired.

It's too bad this is necessary-- I wish the IO layer could latch error
for us in some kind of meaningful way instead of us having to do it
ourselves (and for filesystems, etc, having to each do similar things to
prevent just continuously hitting IO timeouts).  That said, the code
looks good.

Reviewed-by: Michael Lyle <ml...@lyle.org>


Re: [PATCH v7 1/9] bcache: fix cached_dev->count usage for bch_cache_set_error()

2018-02-27 Thread Michael Lyle
Hi Coly Li---

Thanks for this.  I've been uncomfortable with the interaction between
the dirty status and the refcount (even aside from this issue), and I
believe you've resolved it.  I'm sorry for the slow review-- it's taken
me some time to convince myself that this is safe.

I'm getting closer to be able to apply these for 4.17.

Reviewed-by: Michael Lyle <ml...@lyle.org>

Mike

On 02/27/2018 08:55 AM, Coly Li wrote:
> When bcache metadata I/O fails, bcache will call bch_cache_set_error()
> to retire the whole cache set. The expected behavior to retire a cache
> set is to unregister the cache set, and unregister all backing device
> attached to this cache set, then remove sysfs entries of the cache set
> and all attached backing devices, finally release memory of structs
> cache_set, cache, cached_dev and bcache_device.
> 
> In my testing when journal I/O failure triggered by disconnected cache
> device, sometimes the cache set cannot be retired, and its sysfs
> entry /sys/fs/bcache/ still exits and the backing device also
> references it. This is not expected behavior.
> 
> When metadata I/O failes, the call senquence to retire whole cache set is,
> bch_cache_set_error()
> bch_cache_set_unregister()
> bch_cache_set_stop()
> __cache_set_unregister() <- called as callback by calling
> clousre_queue(>caching)
> cache_set_flush()<- called as a callback when refcount
> of cache_set->caching is 0
> cache_set_free() <- called as a callback when refcount
> of catch_set->cl is 0
> bch_cache_set_release()  <- called as a callback when refcount
> of catch_set->kobj is 0
> 
> I find if kernel thread bch_writeback_thread() quits while-loop when
> kthread_should_stop() is true and searched_full_index is false, clousre
> callback cache_set_flush() set by continue_at() will never be called. The
> result is, bcache fails to retire whole cache set.
> 
> cache_set_flush() will be called when refcount of closure c->caching is 0,
> and in function bcache_device_detach() refcount of closure c->caching is
> released to 0 by clousre_put(). In metadata error code path, function
> bcache_device_detach() is called by cached_dev_detach_finish(). This is a
> callback routine being called when cached_dev->count is 0. This refcount
> is decreased by cached_dev_put().
> 
> The above dependence indicates, cache_set_flush() will be called when
> refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
> when refcount of cache_dev->count is 0.
> 
> The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
> and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
> of cache_dev is not decreased properly.
> 
> In bch_writeback_thread(), cached_dev_put() is called only when
> searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
> there is no dirty data on cache. In most of run time it is correct, but
> when bch_writeback_thread() quits the while-loop while cache is still
> dirty, current code forget to call cached_dev_put() before this kernel
> thread exits. This is why sometimes cache_set_flush() is not executed and
> cache set fails to be retired.
> 
> The reason to call cached_dev_put() in bch_writeback_rate() is, when the
> cache device changes from clean to dirty, cached_dev_get() is called, to
> make sure during writeback operatiions both backing and cache devices
> won't be released.
> 
> Adding following code in bch_writeback_thread() does not work,
>static int bch_writeback_thread(void *arg)
> }
> 
> +   if (atomic_read(>has_dirty))
> +   cached_dev_put()
> +
> return 0;
>  }
> because writeback kernel thread can be waken up and start via sysfs entry:
> echo 1 > /sys/block/bcache/bcache/writeback_running
> It is difficult to check whether backing device is dirty without race and
> extra lock. So the above modification will introduce potential refcount
> underflow in some conditions.
> 
> The correct fix is, to take cached dev refcount when creating the kernel
> thread, and put it before the kernel thread exits. Then bcache does not
> need to take a cached dev refcount when cache turns from clean to dirty,
> or to put a cached dev refcount when cache turns from ditry to clean. The
> writeback kernel thread is alwasy safe to reference data structure from
> cache set, cache and cached device (because a refcount of cache device is
> taken for it already), and no matter the k

[PATCH 1/2] bcache: correct flash only vols (check all uuids)

2018-02-27 Thread Michael Lyle
From: Coly Li <col...@suse.de>

Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by
devices_max_used") adds c->devices_max_used to reduce iteration of
c->uuids elements, this value is updated in bcache_device_attach().

But for flash only volume, when calling flash_devs_run(), the function
bcache_device_attach() is not called yet and c->devices_max_used is not
updated. The unexpected result is, the flash only volume won't be run
by flash_devs_run().

This patch fixes the issue by iterate all c->uuids elements in
flash_devs_run(). c->devices_max_used will be updated properly when
bcache_device_attach() gets called.

[mlyle: commit subject edited for character limit]

Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by 
devices_max_used")
Reported-by: Tang Junhui <tang.jun...@zte.com.cn>
Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 312895788036..4d1d8dfb2d2a 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1274,7 +1274,7 @@ static int flash_devs_run(struct cache_set *c)
struct uuid_entry *u;
 
for (u = c->uuids;
-u < c->uuids + c->devices_max_used && !ret;
+u < c->uuids + c->nr_uuids && !ret;
 u++)
if (UUID_FLASH_ONLY(u))
ret = flash_dev_run(c, u);
-- 
2.14.1



[PATCH 0/2] Bcache fixes for 4.16

2018-02-27 Thread Michael Lyle
Hi Jens,

Please pick up these two critical fixes to bcache by Tang Junhui.
They're both one-liners and have been reviewed and tested.

The first corrects a regression when flash-only volumes are present
that was introduced in 4.16-RC1.  The second adjusts bio refcount
and completion behavior to work with md RAID5 backing.

Thanks,

Mike


[PATCH 2/2] bcache: fix kcrashes with fio in RAID5 backend dev

2018-02-27 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

Kernel crashed when run fio in a RAID5 backend bcache device, the call
trace is bellow:
[  440.012034] kernel BUG at block/blk-ioc.c:146!
[  440.012696] invalid opcode:  [#1] SMP NOPTI
[  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
[  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
/2015
[  440.028615] RIP: 0010:put_io_context+0x8b/0x90
[  440.029246] RSP: 0018:a8c882b43af8 EFLAGS: 00010246
[  440.029990] RAX:  RBX: a8c88294fca0 RCX: 00
0f4240
[  440.031006] RDX: 0004 RSI: 0286 RDI: a8c882
94fca0
[  440.032030] RBP: a8c882b43b10 R08: 0003 R09: 949cb8
0c1700
[  440.033206] R10: 0104 R11: b71c R12: 000
01000
[  440.034222] R13:  R14: 949cad84db70 R15: 949cb11
bd1e0
[  440.035239] FS:  () GS:949cba28() knlGS:

[  440.060190] CS:  0010 DS:  ES:  CR0: 80050033
[  440.084967] CR2: 7ff0493ef000 CR3: 0002f1e0a002 CR4: 001
606e0
[  440.110498] Call Trace:
[  440.135443]  bio_disassociate_task+0x1b/0x60
[  440.160355]  bio_free+0x1b/0x60
[  440.184666]  bio_put+0x23/0x30
[  440.208272]  search_free+0x23/0x40 [bcache]
[  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
[  440.254468]  closure_put+0xb6/0xd0 [bcache]
[  440.277087]  request_endio+0x30/0x40 [bcache]
[  440.298703]  bio_endio+0xa1/0x120
[  440.319644]  handle_stripe+0x418/0x2270 [raid456]
[  440.340614]  ? load_balance+0x17b/0x9c0
[  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
[  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
[  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
[  440.419193]  ? schedule+0x36/0x80
[  440.437932]  ? schedule_timeout+0x1d2/0x2f0
[  440.456136]  md_thread+0x122/0x150
[  440.473687]  ? wait_woken+0x80/0x80
[  440.491411]  kthread+0x102/0x140
[  440.508636]  ? find_pers+0x70/0x70
[  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
[  440.541791]  ret_from_fork+0x35/0x40
[  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b
0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
[  440.610020] RIP: put_io_context+0x8b/0x90 RSP: a8c882b43af8
[  440.628575] ---[ end trace a1fd79d85643a73e ]--

All the crash issue happened when a bypass IO coming, in such scenario
s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
invalid, then kernel would crash when calling bio_put(). Maybe its upper
layer's faulty, since bio should not be freed before we calling bio_put(),
but we'd better calling bio_put() first before calling bio_complete() to
notify upper layer ending this bio.

This patch moves bio_complete() under bio_put() to avoid kernel crash.

[mlyle: fixed commit subject for character limits]

Reported-by: Matthias Ferdinand <bca...@mfedv.net>
Tested-by: Matthias Ferdinand <bca...@mfedv.net>
Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 1a46b41dac70..6422846b546e 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -659,11 +659,11 @@ static void do_bio_hook(struct search *s, struct bio 
*orig_bio)
 static void search_free(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
-   bio_complete(s);
 
if (s->iop.bio)
bio_put(s->iop.bio);
 
+   bio_complete(s);
closure_debug_destroy(cl);
mempool_free(s, s->d->c->search);
 }
-- 
2.14.1



Re: [PATCH] bcache: check all uuid elements when start flash only volumes

2018-02-26 Thread Michael Lyle
Reviewed-by: Michael Lyle <ml...@lyle.org>

I have this and the other fix in the queue for testing tomorrow and will
pass along a changeset to Jens if all looks good.  Thanks!

Mike

On 02/26/2018 08:15 AM, Coly Li wrote:
> On 27/02/2018 12:10 AM, Coly Li wrote:
>> Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by
>> devices_max_used") adds c->devices_max_used to reduce iteration of
>> c->uuids elements, this value is updated in bcache_device_attach().
>>
>> But for flash only volume, when calling flash_devs_run(), the function
>> bcache_device_attach() is not called yet and c->devices_max_used is not
>> updated. The unexpected result is, the flash only volume won't be run
>> by flash_devs_run().
>>
>> This patch fixes the issue by iterate all c->uuids elements in
>> flash_devs_run(). c->devices_max_used will be updated properly when
>> bcache_device_attach() gets called.
>>
>> Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by 
>> devices_max_used")
>> Reported-by: Tang Junhui <tang.jun...@zte.com.cn>
>> Signed-off-by: Coly Li <col...@suse.de>
>> Cc: Michael Lyle <ml...@lyle.org>
>> ---
>>  drivers/md/bcache/super.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 312895788036..4d1d8dfb2d2a 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1274,7 +1274,7 @@ static int flash_devs_run(struct cache_set *c)
>>  struct uuid_entry *u;
>>  
>>  for (u = c->uuids;
>> - u < c->uuids + c->devices_max_used && !ret;
>> + u < c->uuids + c->nr_uuids && !ret;
>>   u++)
>>  if (UUID_FLASH_ONLY(u))
>>  ret = flash_dev_run(c, u);
>>
> 
> Hi Jens and Michael,
> 
> Could you please have a look and pick this fix into 4.16 ? Commit
> 2831231d4c3f ("bcache: reduce cache_set devices iteration by
> devices_max_used") is just merged in 4.16-rc1, so I don't CC stable for
> this fix.
> 
> And I still need a peer code reviewer on this patch, any Reviewed-by is
> welcome.
> 
> Thanks to Junhui Tang for reporting this issue today.
> 
> Coly Li
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



[PATCH 7/8] bcache: return attach error when no cache set exist

2018-02-07 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

I attach a back-end device to a cache set, and the cache set is not
registered yet, this back-end device did not attach successfully, and no
error returned:
[root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
[root]#

In sysfs_attach(), the return value "v" is initialized to "size" in
the beginning, and if no cache set exist in bch_cache_sets, the "v" value
would not change any more, and return to sysfs, sysfs regard it as success
since the "size" is a positive number.

This patch fixes this issue by assigning "v" with "-ENOENT" in the
initialization.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/sysfs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 4a6a697e1680..a7552f509f50 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -198,7 +198,7 @@ STORE(__cached_dev)
 {
struct cached_dev *dc = container_of(kobj, struct cached_dev,
 disk.kobj);
-   ssize_t v = size;
+   ssize_t v;
struct cache_set *c;
struct kobj_uevent_env *env;
 
@@ -275,6 +275,7 @@ STORE(__cached_dev)
if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16)
return -EINVAL;
 
+   v = -ENOENT;
list_for_each_entry(c, _cache_sets, list) {
v = bch_cached_dev_attach(dc, c);
if (!v)
@@ -282,7 +283,7 @@ STORE(__cached_dev)
}
 
pr_err("Can't attach %s: cache set not found", buf);
-   size = v;
+   return v;
}
 
if (attr == _detach && dc->disk.c)
-- 
2.14.1



[PATCH 5/8] bcache: fix for allocator and register thread race

2018-02-07 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[] closure_sync+0x25/0x90 [bcache]
[] bch_journal+0x118/0x2b0 [bcache]
[] bch_journal_meta+0x47/0x70 [bcache]
[] bch_prio_write+0x237/0x340 [bcache]
[] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[] kthread+0xcf/0xe0
[] ret_from_fork+0x58/0x90
[] 0x
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[] bch_btree_insert+0xf1/0x170 [bcache]
[] bch_journal_replay+0x13f/0x230 [bcache]
[] run_cache_set+0x79a/0x7c2 [bcache]
[] register_bcache+0xd48/0x1310 [bcache]
[] kobj_attr_store+0xf/0x20
[] sysfs_write_file+0xc6/0x140
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.

I reboot the machine several times, the issue always
exsit in this machine.

I debug the code, and found the call trace as bellow:
register_bcache()
   ==>run_cache_set()
  ==>bch_journal_replay()
 ==>bch_btree_insert()
==>__bch_btree_map_nodes()
   ==>btree_insert_fn()
  ==>btree_split() //node need split
 ==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.

Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
   ==>bch_journal_meta()
  ==>bch_journal()
 ==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.

Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
   RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
   can flush btree nodes and get journal bucket.
   then they are all got stuck by waiting for each other.

Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.

So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough?  By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.

This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.

[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.

Signed-off-by: Hua Rui <huarui@gmail.com>
Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/btree.c |  9 ++---
 drivers/md/bcache/super.c | 13 -
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index bf3a48aa9a9a..fad9fe8817eb 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1869,14 +1869,17 @@ void bch_initial_gc_finish(struct cache_set *c)
 */
for_each_cache(ca, c, i) {
for_each_bucket(b, ca) {
-   if (fifo_full(>free[RESERVE_PRIO]))
+   if (fifo_full(>free[RESERVE_PRIO]) &&
+

[PATCH 8/8] bcache: fix for data collapse after re-attaching an attached device

2018-02-07 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
-bash: echo: write error: Invalid argument

After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 > label

Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set

In sysfs_attach(), dc->sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc->sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc->sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.

In this patch, we don't assigned cache set ID to dc->sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc->sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  | 10 ++
 drivers/md/bcache/sysfs.c  |  6 --
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index b8c2e1bef1f1..12e5197f186c 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -924,7 +924,7 @@ void bcache_write_super(struct cache_set *);
 
 int bch_flash_dev_create(struct cache_set *c, uint64_t size);
 
-int bch_cached_dev_attach(struct cached_dev *, struct cache_set *);
+int bch_cached_dev_attach(struct cached_dev *, struct cache_set *, uint8_t *);
 void bch_cached_dev_detach(struct cached_dev *);
 void bch_cached_dev_run(struct cached_dev *);
 void bcache_device_stop(struct bcache_device *);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a2ad37a8afc0..312895788036 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -957,7 +957,8 @@ void bch_cached_dev_detach(struct cached_dev *dc)
cached_dev_put(dc);
 }
 
-int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
+int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
+ uint8_t *set_uuid)
 {
uint32_t rtime = cpu_to_le32(get_seconds());
struct uuid_entry *u;
@@ -965,7 +966,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c)
 
bdevname(dc->bdev, buf);
 
-   if (memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16))
+   if ((set_uuid && memcmp(set_uuid, c->sb.set_uuid, 16)) ||
+   (!set_uuid && memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16)))
return -ENOENT;
 
if (dc->disk.c) {
@@ -1194,7 +1196,7 @@ static void register_bdev(struct cache_sb *sb, struct 
page *sb_page,
 
list_add(>list, _devices);
list_for_each_entry(c, _cache_sets, list)
-   bch_cached_dev_attach(dc, c);
+   bch_cached_dev_attach(dc, c, NULL);
 
if (BDEV_STATE(>sb) == BDEV_STATE_NONE ||
BDEV_STATE(>sb) == BDEV_STATE_STALE)
@@ -1716,7 +1718,7 @@ static void run_cache_set(struct cache_set *c)
bcache_write_super(c);
 
list_for_each_entry_safe(dc, t, _devices, list)
-   bch_cached_dev_attach(dc, c);
+   bch_cached_dev_attach(dc, c, NULL);
 
flash_devs_run(c);
 
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index a7552f509f50..78cd7bd50fdd 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -272,12 +272,14 @@ STORE(__cached_dev)
}
 
if (attr == _attach) {
-   if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16)
+   uint8_t set_uuid[16];
+
+   if (bch_parse_uuid(buf, set_uuid) < 16)
return -EINVAL;
 
v = -ENOENT;
list_for_each_entry(c, _cache_sets, list) {
-   v = bch_cached_dev_attach(dc, c);
+   v = bch_cached_dev_attach(dc, c, set_uuid);
if (!v)
return size;
}
-- 
2.14.1



[PATCH 4/8] bcache: set error_limit correctly

2018-02-07 Thread Michael Lyle
From: Coly Li <col...@suse.de>

Struct cache uses io_errors for two purposes,
- Error decay: when cache set error_decay is set, io_errors is used to
  generate a small piece of delay when I/O error happens.
- I/O errors counter: in order to generate big enough value for error
  decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
  IO_ERROR_SHIFT).

In function bch_count_io_errors(), if I/O errors counter reaches cache set
error limit, bch_cache_set_error() will be called to retire the whold cache
set. But current code is problematic when checking the error limit, see the
following code piece from bch_count_io_errors(),

 90 if (error) {
 91 char buf[BDEVNAME_SIZE];
 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
 93 >io_errors);
 94 errors >>= IO_ERROR_SHIFT;
 95
 96 if (errors < ca->set->error_limit)
 97 pr_err("%s: IO error on %s, recovering",
 98bdevname(ca->bdev, buf), m);
 99 else
100 bch_cache_set_error(ca->set,
101 "%s: too many IO errors %s",
102 bdevname(ca->bdev, buf), m);
103 }

At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
errors counter to compare at line 96. But ca->set->error_limit is initia-
lized with an amplified value in bch_cache_set_alloc(),
1545 c->error_limit  = 8 << IO_ERROR_SHIFT;

It means by default, in bch_count_io_errors(), before 8<<20 errors happened
bch_cache_set_error() won't be called to retire the problematic cache
device. If the average request size is 64KB, it means bcache won't handle
failed device until 512GB data is requested. This is too large to be an I/O
threashold. So I believe the correct error limit should be much less.

This patch sets default cache set error limit to 8, then in
bch_count_io_errors() when errors counter reaches 8 (if it is default
value), function bch_cache_set_error() will be called to retire the whole
cache set. This patch also removes bits shifting when store or show
io_error_limit value via sysfs interface.

Nowadays most of SSDs handle internal flash failure automatically by LBA
address re-indirect mapping. If an I/O error can be observed by upper layer
code, it will be a notable error because that SSD can not re-indirect
map the problematic LBA address to an available flash block. This situation
indicates the whole SSD will be failed very soon. Therefore setting 8 as
the default io error limit value makes sense, it is enough for most of
cache devices.

Changelog:
v2: add reviewed-by from Hannes.
v1: initial version for review.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Junhui Tang <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/bcache.h | 1 +
 drivers/md/bcache/super.c  | 2 +-
 drivers/md/bcache/sysfs.c  | 4 ++--
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index a857eb3c10de..b8c2e1bef1f1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -666,6 +666,7 @@ struct cache_set {
ON_ERROR_UNREGISTER,
ON_ERROR_PANIC,
}   on_error;
+#define DEFAULT_IO_ERROR_LIMIT 8
unsignederror_limit;
unsignederror_decay;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 133b81225ea9..e29150ec17e7 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1553,7 +1553,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 
c->congested_read_threshold_us  = 2000;
c->congested_write_threshold_us = 2;
-   c->error_limit  = 8 << IO_ERROR_SHIFT;
+   c->error_limit  = DEFAULT_IO_ERROR_LIMIT;
 
return c;
 err:
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 46e7a6b3e7c7..c524305cc9a7 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -568,7 +568,7 @@ SHOW(__bch_cache_set)
 
/* See count_io_errors for why 88 */
sysfs_print(io_error_halflife,  c->error_decay * 88);
-   sysfs_print(io_error_limit, c->error_limit >> IO_ERROR_SHIFT);
+   sysfs_print(io_error_limit, c->error_limit);
 
sysfs_hprint(congested,
 ((uint64_t) bch_get_congested(c)) << 9);
@@ -668,7 +668,7 @@ STORE(__bch_cache_set)
}
 
if (attr == _io_error_limit)
-   c->error_limit = strtoul_or_return(buf) << IO_ERROR_SHIFT;
+   c->error_limit = strtoul_or_return(buf);
 
/* See count_io_errors() for why 88 */
if (attr == _io_error_halflife)
-- 
2.14.1



[PATCH 3/8] bcache: properly set task state in bch_writeback_thread()

2018-02-07 Thread Michael Lyle
From: Coly Li <col...@suse.de>

Kernel thread routine bch_writeback_thread() has the following code block,

447 down_write(>writeback_lock);
448~450 if (check conditions) {
451 up_write(>writeback_lock);
452 set_current_state(TASK_INTERRUPTIBLE);
453
454 if (kthread_should_stop())
455 return 0;
456
457 schedule();
458 continue;
459 }

If condition check is true, its task state is set to TASK_INTERRUPTIBLE
and call schedule() to wait for others to wake up it.

There are 2 issues in current code,
1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
   another process changes the condition and call wake_up_process(dc->
   writeback_thread), then at line 452 task state is set back to
   TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
   waken up.
2, At line 454 if kthread_should_stop() is true, writeback kernel thread
   will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
   call do_exit(). It is not good to enter do_exit() with task state
   TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
   warning message is reported by __might_sleep(): "WARNING: do not call
   blocking ops when !TASK_RUNNING; state=1 set at []".

For the first issue, task state should be set before condition checks.
Ineed because dc->writeback_lock is required when modifying all the
conditions, calling set_current_state() inside code block where dc->
writeback_lock is hold is safe. But this is quite implicit, so I still move
set_current_state() before all the condition checks.

For the second issue, frankley speaking it does not hurt when kernel thread
exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
makes them feel there might be something risky with bcache and hurt their
data.  Setting task state to TASK_RUNNING before returning fixes this
problem.

In alloc.c:allocator_wait(), there is also a similar issue, and is also
fixed in this patch.

Changelog:
v3: merge two similar fixes into one patch
v2: fix the race issue in v1 patch.
v1: initial buggy fix.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.de>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Cc: Michael Lyle <ml...@lyle.org>
Cc: Junhui Tang <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/alloc.c | 4 +++-
 drivers/md/bcache/writeback.c | 7 +--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 6cc6c0f9c3a9..458e1d38577d 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -287,8 +287,10 @@ do {   
\
break;  \
\
mutex_unlock(&(ca)->set->bucket_lock);  \
-   if (kthread_should_stop())  \
+   if (kthread_should_stop()) {\
+   set_current_state(TASK_RUNNING);\
return 0;   \
+   }   \
\
schedule(); \
mutex_lock(&(ca)->set->bucket_lock);\
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 51306a19ab03..58218f7e77c3 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -564,18 +564,21 @@ static int bch_writeback_thread(void *arg)
 
while (!kthread_should_stop()) {
down_write(>writeback_lock);
+   set_current_state(TASK_INTERRUPTIBLE);
if (!atomic_read(>has_dirty) ||
(!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
 !dc->writeback_running)) {
up_write(>writeback_lock);
-   set_current_state(TASK_INTERRUPTIBLE);
 
-   if (kthread_should_stop())
+   if (kthread_should_stop()) {
+   set_current_state(TASK_RUNNING);
return 0;
+   }
 
schedule();
continue;
}
+   set_current_state(TASK_RUNNING);
 
searched_full_index = refill_dirty(dc);
 
-- 
2.14.1



[PATCH 1/8] bcache: add journal statistic

2018-02-07 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

Sometimes, Journal takes up a lot of CPU, we need statistics
to know what's the journal is doing. So this patch provide
some journal statistics:
1) reclaim: how many times the journal try to reclaim resource,
   usually the journal bucket or/and the pin are exhausted.
2) flush_write: how many times the journal try to flush btree node
   to cache device, usually the journal bucket are exhausted.
3) retry_flush_write: how many times the journal retry to flush
   the next btree node, usually the previous tree node have been
   flushed by other thread.
we show these statistic by sysfs interface. Through these statistics
We can totally see the status of journal module when the CPU is too
high.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/bcache.h  |  4 
 drivers/md/bcache/journal.c |  5 +
 drivers/md/bcache/sysfs.c   | 15 +++
 3 files changed, 24 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5e2d4e80198e..b98d7705acb6 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -658,6 +658,10 @@ struct cache_set {
atomic_long_t   writeback_keys_done;
atomic_long_t   writeback_keys_failed;
 
+   atomic_long_t   reclaim;
+   atomic_long_t   flush_write;
+   atomic_long_t   retry_flush_write;
+
enum{
ON_ERROR_UNREGISTER,
ON_ERROR_PANIC,
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index a87165c1d8e5..f5296007a9d5 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -377,6 +377,8 @@ static void btree_flush_write(struct cache_set *c)
 */
struct btree *b, *best;
unsigned i;
+
+   atomic_long_inc(>flush_write);
 retry:
best = NULL;
 
@@ -397,6 +399,7 @@ static void btree_flush_write(struct cache_set *c)
if (!btree_current_write(b)->journal) {
mutex_unlock(>write_lock);
/* We raced */
+   atomic_long_inc(>retry_flush_write);
goto retry;
}
 
@@ -476,6 +479,8 @@ static void journal_reclaim(struct cache_set *c)
unsigned iter, n = 0;
atomic_t p;
 
+   atomic_long_inc(>reclaim);
+
while (!atomic_read(_front(>journal.pin)))
fifo_pop(>journal.pin, p);
 
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index b4184092c727..46e7a6b3e7c7 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -65,6 +65,9 @@ read_attribute(bset_tree_stats);
 
 read_attribute(state);
 read_attribute(cache_read_races);
+read_attribute(reclaim);
+read_attribute(flush_write);
+read_attribute(retry_flush_write);
 read_attribute(writeback_keys_done);
 read_attribute(writeback_keys_failed);
 read_attribute(io_errors);
@@ -545,6 +548,15 @@ SHOW(__bch_cache_set)
sysfs_print(cache_read_races,
atomic_long_read(>cache_read_races));
 
+   sysfs_print(reclaim,
+   atomic_long_read(>reclaim));
+
+   sysfs_print(flush_write,
+   atomic_long_read(>flush_write));
+
+   sysfs_print(retry_flush_write,
+   atomic_long_read(>retry_flush_write));
+
sysfs_print(writeback_keys_done,
atomic_long_read(>writeback_keys_done));
sysfs_print(writeback_keys_failed,
@@ -731,6 +743,9 @@ static struct attribute *bch_cache_set_internal_files[] = {
 
_bset_tree_stats,
_cache_read_races,
+   _reclaim,
+   _flush_write,
+   _retry_flush_write,
_writeback_keys_done,
_writeback_keys_failed,
 
-- 
2.14.1



[PATCH 6/8] bcache: set writeback_rate_update_seconds in range [1, 60] seconds

2018-02-07 Thread Michael Lyle
From: Coly Li <col...@suse.de>

dc->writeback_rate_update_seconds can be set via sysfs and its value can
be set to [1, ULONG_MAX].  It does not make sense to set such a large
value, 60 seconds is long enough value considering the default 5 seconds
works well for long time.

Because dc->writeback_rate_update is a special delayed work, it re-arms
itself inside the delayed work routine update_writeback_rate(). When
stopping it by cancel_delayed_work_sync(), there should be a timeout to
wait and make sure the re-armed delayed work is stopped too. A small max
value of dc->writeback_rate_update_seconds is also helpful to decide a
reasonable small timeout.

This patch limits sysfs interface to set dc->writeback_rate_update_seconds
in range of [1, 60] seconds, and replaces the hand-coded number by macros.

Changelog:
v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
v1: initial version.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/sysfs.c | 4 +++-
 drivers/md/bcache/writeback.c | 2 +-
 drivers/md/bcache/writeback.h | 3 +++
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index c524305cc9a7..4a6a697e1680 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -218,7 +218,9 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_rate,
dc->writeback_rate.rate, 1, INT_MAX);
 
-   d_strtoul_nonzero(writeback_rate_update_seconds);
+   sysfs_strtoul_clamp(writeback_rate_update_seconds,
+   dc->writeback_rate_update_seconds,
+   1, WRITEBACK_RATE_UPDATE_SECS_MAX);
d_strtoul(writeback_rate_i_term_inverse);
d_strtoul_nonzero(writeback_rate_p_term_inverse);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 58218f7e77c3..f1d2fc15abcc 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -655,7 +655,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate.rate = 1024;
dc->writeback_rate_minimum  = 8;
 
-   dc->writeback_rate_update_seconds = 5;
+   dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
dc->writeback_rate_p_term_inverse = 40;
dc->writeback_rate_i_term_inverse = 1;
 
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 66f1c527fa24..587b25599856 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -8,6 +8,9 @@
 #define MAX_WRITEBACKS_IN_PASS  5
 #define MAX_WRITESIZE_IN_PASS   5000   /* *512b */
 
+#define WRITEBACK_RATE_UPDATE_SECS_MAX 60
+#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5
+
 /*
  * 14 (16384ths) is chosen here as something that each backing device
  * should be a reasonable fraction of the share, and not to blow up
-- 
2.14.1



[PATCH 2/8] bcache: fix high CPU occupancy during journal

2018-02-07 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:

[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem

  PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
 2023 root 20  0   0  0  0 S 55.1  0.0   0:04.42 kworker/11:191
14126 root 20  0   0  0  0 S 42.9  0.0   0:08.72 kworker/10:3
 9292 root 20  0   0  0  0 S 30.4  0.0   1:10.99 kworker/6:1
 8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
12287 root 20  0   0  0  0 S 26.7  0.0   0:28.13 kworker/7:85
31019 root 20  0   0  0  0 S 26.1  0.0   1:30.79 kworker/22:1
 1787 root 20  0   0  0  0 R 25.7  0.0   5:18.45 kworker/8:7
32169 root 20  0   0  0  0 S 14.5  0.0   1:01.92 kworker/23:1
21476 root 20  0   0  0  0 S 13.9  0.0   0:05.09 kworker/1:54
 2204 root 20  0   0  0  0 S 12.5  0.0   1:25.17 kworker/9:10
16994 root 20  0   0  0  0 S 12.2  0.0   0:06.27 kworker/5:106
15714 root 20  0   0  0  0 R 10.9  0.0   0:01.85 kworker/19:2
 9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
 9960 root 20  0   0  0  0 S 10.2  0.0   0:02.75 kworker/2:139
11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
 6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd

The kernel work consumed a lot of CPU, and I found they are running journal
work, The journal is reclaiming source and flush btree node with surprising
frequency.

Through further analysis, we found that in btree_flush_write(), we try to
get a btree node with the smallest fifo idex to flush by traverse all the
btree nodein c->bucket_hash, after we getting it, since no locker protects
it, this btree node may have been written to cache device by other works,
and if this occurred, we retry to traverse in c->bucket_hash and get
another btree node. When the problem occurrd, the retry times is very high,
and we consume a lot of CPU in looking for a appropriate btree node.

In this patch, we try to record 128 btree nodes with the smallest fifo idex
in heap, and pop one by one when we need to flush btree node. It greatly
reduces the time for the loop to find the appropriate BTREE node, and also
reduce the occupancy of CPU.

[note by mpl: this triggers a checkpatch error because of adjacent,
pre-existing style violations]

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/bcache.h  |  2 ++
 drivers/md/bcache/journal.c | 47 ++---
 drivers/md/bcache/util.h|  2 ++
 3 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index b98d7705acb6..a857eb3c10de 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -679,6 +679,8 @@ struct cache_set {
 
 #define BUCKET_HASH_BITS   12
struct hlist_head   bucket_hash[1 << BUCKET_HASH_BITS];
+
+   DECLARE_HEAP(struct btree *, flush_btree);
 };
 
 struct bbio {
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index f5296007a9d5..1b736b860739 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -368,6 +368,12 @@ int bch_journal_replay(struct cache_set *s, struct 
list_head *list)
 }
 
 /* Journalling */
+#define journal_max_cmp(l, r) \
+   (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \
+fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
+#define journal_min_cmp(l, r) \
+   (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \
+fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
 
 static void btree_flush_write(struct cache_set *c)
 {
@@ -375,25 +381,35 @@ static void btree_flush_write(struct cache_set *c)
 * Try to find the btree node with that references the oldest journal
 * entry, best is our current candidate and is locked if non NULL:
 */
-   struct btree *b, *best;
-   unsigned i;
+   struct btree *b;
+   int i;
 
atomic_long_inc(>flush_write);
+
 retry:
-   best = NULL;
-
-   for_each_cached_btree(b, c, i)
-   if (btree_current_write(b)->journal) {
-   if (!best)
- 

Bcache patches for inclusion in 4.16

2018-02-07 Thread Michael Lyle
Hi Jens---

Here are a few more changes for 4.16 from Coly Li and Tang Junhui.

Probably the most "functional" is 2/8, which makes a change to improve a
performance bug under sustained small writes.  Still, it's a small, safe
change that seems good to take.  The rest are small bugfixes.

All have seen some degree of test (which is, of course, always ongoing).

Note that there's one checkpatch error in this, but it was already present
and is just triggered by altering an adjacent line.

 9 files changed, 111 insertions(+), 36 deletions(-)

Coly Li (3):
  bcache: properly set task state in bch_writeback_thread()
  bcache: set error_limit correctly
  bcache: set writeback_rate_update_seconds in range [1, 60] seconds

Tang Junhui (5):
  bcache: add journal statistic
  bcache: fix high CPU occupancy during journal
  bcache: fix for allocator and register thread race
  bcache: return attach error when no cache set exist
  bcache: fix for data collapse after re-attaching an attached device

Thanks, as always,

Mike


Re: [PATCH v5 07/10] bcache: fix inaccurate io state for detached bcache devices

2018-02-07 Thread Michael Lyle
detatched should be detached, otherwise lgtm.

Reviewed-by: Michael Lyle <ml...@lyle.org>

On 02/05/2018 08:20 PM, Coly Li wrote:
> From: Tang Junhui <tang.jun...@zte.com.cn>
> 
> When we run IO in a detached device,  and run iostat to shows IO status,
> normally it will show like bellow (Omitted some fields):
> Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdd... 15.89 0.531.820.202.23   1.81  52.30
> bcache0... 15.89   115.420.000.000.00   2.40  69.60
> but after IO stopped, there are still very big avgqu-sz and %util
> values as bellow:
> Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> bcache0   ...  0   5326.320.000.000.00   0.00 100.10
> 
> The reason for this issue is that, only generic_start_io_acct() called
> and no generic_end_io_acct() called for detached device in
> cached_dev_make_request(). See the code:
> //start generic_start_io_acct()
> generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0);
> if (cached_dev_get(dc)) {
>   //will callback generic_end_io_acct()
> }
> else {
>   //will not call generic_end_io_acct()
> }
> 
> This patch calls generic_end_io_acct() in the end of IO for detached
> devices, so we can show IO state correctly.
> 
> (Modified to use GFP_NOIO in kzalloc() by Coly Li)
> 
> Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
> Reviewed-by: Coly Li <col...@suse.de>
> Reviewed-by: Hannes Reinecke <h...@suse.com>
> ---
>  drivers/md/bcache/request.c | 58 
> +++--
>  1 file changed, 51 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 02296bda6384..e09c5ae745be 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -986,6 +986,55 @@ static void cached_dev_nodata(struct closure *cl)
>   continue_at(cl, cached_dev_bio_complete, NULL);
>  }
>  
> +struct detached_dev_io_private {
> + struct bcache_device*d;
> + unsigned long   start_time;
> + bio_end_io_t*bi_end_io;
> + void*bi_private;
> +};
> +
> +static void detatched_dev_end_io(struct bio *bio)
> +{
> + struct detached_dev_io_private *ddip;
> +
> + ddip = bio->bi_private;
> + bio->bi_end_io = ddip->bi_end_io;
> + bio->bi_private = ddip->bi_private;
> +
> + generic_end_io_acct(ddip->d->disk->queue,
> + bio_data_dir(bio),
> + >d->disk->part0, ddip->start_time);
> +
> + kfree(ddip);
> +
> + bio->bi_end_io(bio);
> +}
> +
> +static void detached_dev_do_request(struct bcache_device *d, struct bio *bio)
> +{
> + struct detached_dev_io_private *ddip;
> + struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +
> + /*
> +  * no need to call closure_get(>disk.cl),
> +  * because upper layer had already opened bcache device,
> +  * which would call closure_get(>disk.cl)
> +  */
> + ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO);
> + ddip->d = d;
> + ddip->start_time = jiffies;
> + ddip->bi_end_io = bio->bi_end_io;
> + ddip->bi_private = bio->bi_private;
> + bio->bi_end_io = detatched_dev_end_io;
> + bio->bi_private = ddip;
> +
> + if ((bio_op(bio) == REQ_OP_DISCARD) &&
> + !blk_queue_discard(bdev_get_queue(dc->bdev)))
> + bio->bi_end_io(bio);
> + else
> + generic_make_request(bio);
> +}
> +
>  /* Cached devices - read & write stuff */
>  
>  static blk_qc_t cached_dev_make_request(struct request_queue *q,
> @@ -1028,13 +1077,8 @@ static blk_qc_t cached_dev_make_request(struct 
> request_queue *q,
>   else
>   cached_dev_read(dc, s);
>   }
> - } else {
> - if ((bio_op(bio) == REQ_OP_DISCARD) &&
> - !blk_queue_discard(bdev_get_queue(dc->bdev)))
> - bio_endio(bio);
> - else
> - generic_make_request(bio);
> - }
> + } else
> + detached_dev_do_request(d, bio);
>  
>   return BLK_QC_T_NONE;
>  }
> 



Re: [PATCH v5 06/10] bcache: add stop_when_cache_set_failed option to backing device

2018-02-07 Thread Michael Lyle
I agree about the typos-- after they're fixed,

Reviewed-by: Michael Lyle <ml...@lyle.org>

On 02/05/2018 08:20 PM, Coly Li wrote:
> When there are too many I/O errors on cache device, current bcache code
> will retire the whole cache set, and detach all bcache devices. But the
> detached bcache devices are not stopped, which is problematic when bcache
> is in writeback mode.
> 
> If the retired cache set has dirty data of backing devices, continue
> writing to bcache device will write to backing device directly. If the
> LBA of write request has a dirty version cached on cache device, next time
> when the cache device is re-registered and backing device re-attached to
> it again, the stale dirty data on cache device will be written to backing
> device, and overwrite latest directly written data. This situation causes
> a quite data corruption.
> 
> But we cannot simply stop all attached bcache devices when the cache set is
> broken or disconnected. For example, use bcache to accelerate performance
> of an email service. In such workload, if cache device is broken but no
> dirty data lost, keep the bcache device alive and permit email service
> continue to access user data might be a better solution for the cache
> device failure.
> 
> Nix <n...@esperi.org.uk> points out the issue and provides the above example
> to explain why it might be necessary to not stop bcache device for broken
> cache device. Pavel Goran <via-bca...@pvgoran.name> provides a brilliant
> suggestion to provide "always" and "auto" options to per-cached device
> sysfs file stop_when_cache_set_failed. If cache set is retiring and the
> backing device has no dirty data on cache, it should be safe to keep the
> bcache device alive. In this case, if stop_when_cache_set_failed is set to
> "auto", the device failure handling code will not stop this bcache device
> and permit application to access the backing device with a unattached
> bcache device.
> 
> Changelog:
> v2: change option values of stop_when_cache_set_failed from 1/0 to
> "auto"/"always".
> v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1
> (always stop).
> 
> Signed-off-by: Coly Li <col...@suse.de>
> Cc: Nix <n...@esperi.org.uk>
> Cc: Pavel Goran <via-bca...@pvgoran.name>
> Cc: Michael Lyle <ml...@lyle.org>
> Cc: Junhui Tang <tang.jun...@zte.com.cn>
> Cc: Hannes Reinecke <h...@suse.com>
> ---
>  drivers/md/bcache/bcache.h |  9 +
>  drivers/md/bcache/super.c  | 82 
> --
>  drivers/md/bcache/sysfs.c  | 17 ++
>  3 files changed, 98 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 7917b3820dd5..263164490833 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -287,6 +287,12 @@ struct io {
>   sector_tlast;
>  };
>  
> +enum stop_on_faliure {
> + BCH_CACHED_DEV_STOP_ATUO = 0,
> + BCH_CACHED_DEV_STOP_ALWAYS,
> + BCH_CACHED_DEV_STOP_MODE_MAX,
> +};
> +
>  struct cached_dev {
>   struct list_headlist;
>   struct bcache_devicedisk;
> @@ -379,6 +385,8 @@ struct cached_dev {
>   unsignedwriteback_rate_i_term_inverse;
>   unsignedwriteback_rate_p_term_inverse;
>   unsignedwriteback_rate_minimum;
> +
> + enum stop_on_faliurestop_when_cache_set_failed;
>  };
>  
>  enum alloc_reserve {
> @@ -924,6 +932,7 @@ void bch_write_bdev_super(struct cached_dev *, struct 
> closure *);
>  
>  extern struct workqueue_struct *bcache_wq;
>  extern const char * const bch_cache_modes[];
> +extern const char * const bch_stop_on_failure_modes[];
>  extern struct mutex bch_register_lock;
>  extern struct list_head bch_cache_sets;
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index f8b0d1196c12..e335433bdfb7 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -47,6 +47,14 @@ const char * const bch_cache_modes[] = {
>   NULL
>  };
>  
> +/* Default is -1; we skip past it for stop_when_cache_set_failed */
> +const char * const bch_stop_on_failure_modes[] = {
> + "default",
> + "auto",
> + "always",
> + NULL
> +};
> +
>  static struct kobject *bcache_kobj;
>  struct mutex bch_register_lock;
>  LIST_HEAD(bch_cache_sets);
> @@ -1187,6 +1195,9 @@ static int cached_dev_init(struct cached_dev *dc, 
> unsigned block_size)
>   max(dc->disk.disk->queue->backing_dev_info->

Re: [PATCH v5 03/10] bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set

2018-02-07 Thread Michael Lyle
Reviewed-by: Michael Lyle <ml...@lyle.org>

On 02/05/2018 08:20 PM, Coly Li wrote:
> In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
> cached_dev_get() is called when creating dc->writeback_thread, and
> cached_dev_put() is called when exiting dc->writeback_thread. This
> modification works well unless people detach the bcache device manually by
> 'echo 1 > /sys/block/bcache/bcache/detach'
> Because this sysfs interface only calls bch_cached_dev_detach() which wakes
> up dc->writeback_thread but does not stop it. The reason is, before patch
> "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
> bch_writeback_thread(), if cache is not dirty after writeback,
> cached_dev_put() will be called here. And in cached_dev_make_request() when
> a new write request makes cache from clean to dirty, cached_dev_get() will
> be called there. Since we don't operate dc->count in these locations,
> refcount d->count cannot be dropped after cache becomes clean, and
> cached_dev_detach_finish() won't be called to detach bcache device.
> 
> This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
> set inside bch_writeback_thread(). If this bit is set and cache is clean
> (no existing writeback_keys), break the while-loop, call cached_dev_put()
> and quit the writeback thread.
> 
> Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
> writeback thread should continue to perform writeback, this is the original
> design of manually detach.
> 
> It is safe to do the following check without locking, let me explain why,
> + if (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
> + (!atomic_read(>has_dirty) || !dc->writeback_running)) {
> 
> If the kenrel thread does not sleep and continue to run due to conditions
> are not updated in time on the running CPU core, it just consumes more CPU
> cycles and has no hurt. This should-sleep-but-run is safe here. We just
> focus on the should-run-but-sleep condition, which means the writeback
> thread goes to sleep in mistake while it should continue to run.
> 1, First of all, no matter the writeback thread is hung or not, 
> kthread_stop() from
>cached_dev_detach_finish() will wake up it and terminate by making
>kthread_should_stop() return true. And in normal run time, bit on index
>BCACHE_DEV_DETACHING is always cleared, the condition
>   !test_bit(BCACHE_DEV_DETACHING, >disk.flags)
>is always true and can be ignored as constant value.
> 2, If one of the following conditions is true, the writeback thread should
>go to sleep,
>"!atomic_read(>has_dirty)" or "!dc->writeback_running)"
>each of them independently controls the writeback thread should sleep or
>not, let's analyse them one by one.
> 2.1 condition "!atomic_read(>has_dirty)"
>If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will
>call bch_writeback_queue() immediately or call bch_writeback_add() which
>indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
>wake_up_process(dc->writeback_thread) is called. It sets writeback
>thread's task state to TASK_RUNNING and following an implicit memory
>barrier, then tries to wake up the writeback thread.
>In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
>doing the condition check. If other CPU core sets the TASK_RUNNING state
>after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
>will be scheduled to run very soon because its state is not
>TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
>writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
>of wake_up_process() will make sure modification of dc->has_dirty on
>other CPU core is updated and observed on the CPU core of writeback
>thread. Therefore the condition check will correctly be false, and
>continue writeback code without sleeping.
> 2.2 condition "!dc->writeback_running)"
>dc->writeback_running can be changed via sysfs file, every time it is
>modified, a following bch_writeback_queue() is alwasy called. So the
>change is always observed on the CPU core of writeback thread. If
>dc->writeback_running is changed from 0 to 1 on other CPU core, this
>condition check will observe the modification and allow writeback
>thread to continue to run without sleeping.
> Now we can see, even without a locking protection, multiple conditions
> check is safe here, no deadlock or process hang up will happen.
> 
> I compose a separte patch because that patch "bcache

Re: [PATCH v5 01/10] bcache: set writeback_rate_update_seconds in range [1, 60] seconds

2018-02-07 Thread Michael Lyle
LGTM-- in my staging branch.

Reviewed-by: Michael Lyle <ml...@lyle.org>

On 02/05/2018 08:20 PM, Coly Li wrote:
> dc->writeback_rate_update_seconds can be set via sysfs and its value can
> be set to [1, ULONG_MAX].  It does not make sense to set such a large
> value, 60 seconds is long enough value considering the default 5 seconds
> works well for long time.
> 
> Because dc->writeback_rate_update is a special delayed work, it re-arms
> itself inside the delayed work routine update_writeback_rate(). When
> stopping it by cancel_delayed_work_sync(), there should be a timeout to
> wait and make sure the re-armed delayed work is stopped too. A small max
> value of dc->writeback_rate_update_seconds is also helpful to decide a
> reasonable small timeout.
> 
> This patch limits sysfs interface to set dc->writeback_rate_update_seconds
> in range of [1, 60] seconds, and replaces the hand-coded number by macros.
> 
> Changelog:
> v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
> v1: initial version.
> 
> Signed-off-by: Coly Li <col...@suse.de>
> Reviewed-by: Hannes Reinecke <h...@suse.com>
> Cc: Michael Lyle <ml...@lyle.org>
> ---
>  drivers/md/bcache/sysfs.c | 4 +++-
>  drivers/md/bcache/writeback.c | 2 +-
>  drivers/md/bcache/writeback.h | 3 +++
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index c524305cc9a7..4a6a697e1680 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -218,7 +218,9 @@ STORE(__cached_dev)
>   sysfs_strtoul_clamp(writeback_rate,
>   dc->writeback_rate.rate, 1, INT_MAX);
>  
> - d_strtoul_nonzero(writeback_rate_update_seconds);
> + sysfs_strtoul_clamp(writeback_rate_update_seconds,
> + dc->writeback_rate_update_seconds,
> + 1, WRITEBACK_RATE_UPDATE_SECS_MAX);
>   d_strtoul(writeback_rate_i_term_inverse);
>   d_strtoul_nonzero(writeback_rate_p_term_inverse);
>  
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 58218f7e77c3..f1d2fc15abcc 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -655,7 +655,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>   dc->writeback_rate.rate = 1024;
>   dc->writeback_rate_minimum  = 8;
>  
> - dc->writeback_rate_update_seconds = 5;
> + dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
>   dc->writeback_rate_p_term_inverse = 40;
>   dc->writeback_rate_i_term_inverse = 1;
>  
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index 66f1c527fa24..587b25599856 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -8,6 +8,9 @@
>  #define MAX_WRITEBACKS_IN_PASS  5
>  #define MAX_WRITESIZE_IN_PASS   5000 /* *512b */
>  
> +#define WRITEBACK_RATE_UPDATE_SECS_MAX   60
> +#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT   5
> +
>  /*
>   * 14 (16384ths) is chosen here as something that each backing device
>   * should be a reasonable fraction of the share, and not to blow up
> 



Re: [PATCH] [PATCH v2] bcache: fix for allocator and register thread race

2018-02-01 Thread Michael Lyle
Hi Tang Junhui--

On 01/25/2018 04:30 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui <tang.jun...@zte.com.cn>
[snip]
> The stack shows the register thread and allocator thread
> were getting stuck when registering cache device.

This looks good-- applying to my 4.16-related test branch.

Reviewed-by: Michael Lyle <ml...@lyle.org>


Re: [PATCH v4 00/13] bcache: device failure handling improvement

2018-02-01 Thread Michael Lyle
On 01/27/2018 05:56 PM, Coly Li wrote:
> Hi maintainers and folks,

Hi Coly---

> This patch set tries to improve bcache device failure handling, includes
> cache device and backing device failures.

I've skimmed this whole patchset and overall it looks good.  I have
taken some pieces (2,6) for possible-4.16.  I understand you'll be
sending out a v5 soon-- I'll perform a more detailed review then.

Thanks,

Mike


Re: [PATCH v4 06/13] bcache: set error_limit correctly

2018-02-01 Thread Michael Lyle
On 01/27/2018 06:23 AM, Coly Li wrote:
> Struct cache uses io_errors for two purposes,
> - Error decay: when cache set error_decay is set, io_errors is used to
>   generate a small piece of delay when I/O error happens.
> - I/O errors counter: in order to generate big enough value for error
>   decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
>   IO_ERROR_SHIFT).

This LGTM

Reviewed-by: Michael Lyle <ml...@lyle.org>


Re: [PATCH v4 02/13] bcache: properly set task state in bch_writeback_thread()

2018-02-01 Thread Michael Lyle
On 01/27/2018 06:23 AM, Coly Li wrote:
> Kernel thread routine bch_writeback_thread() has the following code block,
> 
> 447 down_write(>writeback_lock);
> 448~450 if (check conditions) {
> 451 up_write(>writeback_lock);
> 452 set_current_state(TASK_INTERRUPTIBLE);
> 453
> 454 if (kthread_should_stop())
> 455 return 0;
> 456
> 457 schedule();
> 458 continue;
> 459 }
> 
> If condition check is true, its task state is set to TASK_INTERRUPTIBLE
> and call schedule() to wait for others to wake up it.
> 
> There are 2 issues in current code,
> 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
>another process changes the condition and call wake_up_process(dc->
>writeback_thread), then at line 452 task state is set back to
>TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
>waken up.
> 2, At line 454 if kthread_should_stop() is true, writeback kernel thread
>will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
>call do_exit(). It is not good to enter do_exit() with task state
>TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
>warning message is reported by __might_sleep(): "WARNING: do not call
>blocking ops when !TASK_RUNNING; state=1 set at []".
> 
> For the first issue, task state should be set before condition checks.
> Ineed because dc->writeback_lock is required when modifying all the
> conditions, calling set_current_state() inside code block where dc->
> writeback_lock is hold is safe. But this is quite implicit, so I still move
> set_current_state() before all the condition checks.
> 
> For the second issue, frankley speaking it does not hurt when kernel thread
> exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
> makes them feel there might be something risky with bcache and hurt their
> data.  Setting task state to TASK_RUNNING before returning fixes this
> problem.
> 
> In alloc.c:allocator_wait(), there is also a similar issue, and is also
> fixed in this patch.
> 
> Changelog:
> v3: merge two similar fixes into one patch
> v2: fix the race issue in v1 patch.
> v1: initial buggy fix.
> 
> Signed-off-by: Coly Li <col...@suse.de>
> Reviewed-by: Hannes Reinecke <h...@suse.de>
> Cc: Michael Lyle <ml...@lyle.org>
> Cc: Junhui Tang <tang.jun...@zte.com.cn>

This one LGTM-- applying to my staging branch

Reviewed-by: Michael Lyle <ml...@lyle.org>


Re: [PATCH v4 01/13] bcache: set writeback_rate_update_seconds in range [1, 60] seconds

2018-02-01 Thread Michael Lyle
On 01/27/2018 06:23 AM, Coly Li wrote:
> dc->writeback_rate_update_seconds can be set via sysfs and its value can
> be set to [1, ULONG_MAX].  It does not make sense to set such a large
> value, 60 seconds is long enough value considering the default 5 seconds
> works well for long time.
> 
> Because dc->writeback_rate_update is a special delayed work, it re-arms
> itself inside the delayed work routine update_writeback_rate(). When
> stopping it by cancel_delayed_work_sync(), there should be a timeout to
> wait and make sure the re-armed delayed work is stopped too. A small max
> value of dc->writeback_rate_update_seconds is also helpful to decide a
> reasonable small timeout.
> 
> This patch limits sysfs interface to set dc->writeback_rate_update_seconds
> in range of [1, 60] seconds, and replaces the hand-coded number by macros.
> 
> Signed-off-by: Coly Li 
> Reviewed-by: Hannes Reinecke 

[snip]

> ---
>  drivers/md/bcache/sysfs.c | 3 +++
>  drivers/md/bcache/writeback.c | 2 +-
>  drivers/md/bcache/writeback.h | 3 +++
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index b4184092c727..a74a752c9e0f 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -215,6 +215,9 @@ STORE(__cached_dev)
>   sysfs_strtoul_clamp(writeback_rate,
>   dc->writeback_rate.rate, 1, INT_MAX);
>  
> + sysfs_strtoul_clamp(writeback_rate_update_seconds,
> + dc->writeback_rate_update_seconds,
> + 1, WRITEBACK_RATE_UPDATE_SECS_MAX);
>   d_strtoul_nonzero(writeback_rate_update_seconds);

Doesn't the above line need to be removed for this to work?

Mike


Re: [PATCH 2/2] bcache: fix high CPU occupancy during journal

2018-01-31 Thread Michael Lyle
Unfortunately, this doesn't build because of nonexistent call heap_empty
(I assume some changes to util.h got left out).  I really need clean
patches that build and are formatted properly.

Mike

On 01/26/2018 12:24 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> After long time small writing I/O running, we found the occupancy of CPU
> is very high and I/O performance has been reduced by about half:
> 
> [root@ceph151 internal]# top
> top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
> Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
> %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
> KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
> KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem
> 
>   PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
>  2023 root 20  0   0  0  0 S 55.1  0.0   0:04.42 kworker/11:191
> 14126 root 20  0   0  0  0 S 42.9  0.0   0:08.72 kworker/10:3
>  9292 root 20  0   0  0  0 S 30.4  0.0   1:10.99 kworker/6:1   
>  8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
> 12287 root 20  0   0  0  0 S 26.7  0.0   0:28.13 kworker/7:85
> 31019 root 20  0   0  0  0 S 26.1  0.0   1:30.79 kworker/22:1
>  1787 root 20  0   0  0  0 R 25.7  0.0   5:18.45 kworker/8:7
> 32169 root 20  0   0  0  0 S 14.5  0.0   1:01.92 kworker/23:1
> 21476 root 20  0   0  0  0 S 13.9  0.0   0:05.09 kworker/1:54
>  2204 root 20  0   0  0  0 S 12.5  0.0   1:25.17 kworker/9:10
> 16994 root 20  0   0  0  0 S 12.2  0.0   0:06.27 kworker/5:106
> 15714 root 20  0   0  0  0 R 10.9  0.0   0:01.85 kworker/19:2
>  9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
> 11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
>  9960 root 20  0   0  0  0 S 10.2  0.0   0:02.75 kworker/2:139
> 11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
>  6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
> 
> The kernel work consumed a lot of CPU, and I found they are running journal
> work, The journal is reclaiming source and flush btree node with surprising
> frequency.
> 
> Through further analysis, we found that in btree_flush_write(), we try to
> get a btree node with the smallest fifo idex to flush by traverse all the
> btree nodein c->bucket_hash, after we getting it, since no locker protects
> it, this btree node may have been written to cache device by other works,
> and if this occurred, we retry to traverse in c->bucket_hash and get
> another btree node. When the problem occurrd, the retry times is very high,
> and we consume a lot of CPU in looking for a appropriate btree node.
> 
> In this patch, we try to record 128 btree nodes with the smallest fifo idex
> in heap, and pop one by one when we need to flush btree node. It greatly
> reduces the time for the loop to find the appropriate BTREE node, and also
> reduce the occupancy of CPU.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/bcache.h  |  2 ++
>  drivers/md/bcache/journal.c | 47 
> ++---
>  2 files changed, 34 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 0432e28..b343ba4 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -669,6 +669,8 @@ struct cache_set {
>  
>  #define BUCKET_HASH_BITS 12
>   struct hlist_head   bucket_hash[1 << BUCKET_HASH_BITS];
> +
> + DECLARE_HEAP(struct btree *, flush_btree);
>  };
>  
>  struct bbio {
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 47fd0b8..f42d3ea 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -363,6 +363,12 @@ int bch_journal_replay(struct cache_set *s, struct 
> list_head *list)
>  }
>  
>  /* Journalling */
> +#define journal_max_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> +#define journal_min_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>  
>  static void btree_flush_write(struct cache_set *c)
>  {
> @@ -370,25 +376,35 @@ static void btree_flush_write(struct cache_set *c)
>* Try to find the btree node with that references the oldest journal
>* entry, best is our current candidate and is locked if non NULL:
>*/
> - struct btree *b, *best;
> - unsigned i;
> + struct btree *b;
> + int i;
>  
>   atomic_long_inc(>flush_write);
> +
>  retry:
> - best = NULL;
> -
> - for_each_cached_btree(b, c, i)
> -  

Re: [PATCH 2/2] bcache: fix high CPU occupancy during journal

2018-01-31 Thread Michael Lyle
LGTM on first read--  I'll read it again and test in test branch.

Reviewed-by: Michael Lyle <ml...@lyle.org>

On 01/26/2018 12:24 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui <tang.jun...@zte.com.cn>
> 
> After long time small writing I/O running, we found the occupancy of CPU
> is very high and I/O performance has been reduced by about half:
> 
> [root@ceph151 internal]# top
> top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
> Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
> %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
> KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
> KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem
> 
>   PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
>  2023 root 20  0   0  0  0 S 55.1  0.0   0:04.42 kworker/11:191
> 14126 root 20  0   0  0  0 S 42.9  0.0   0:08.72 kworker/10:3
>  9292 root 20  0   0  0  0 S 30.4  0.0   1:10.99 kworker/6:1   
>  8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
> 12287 root 20  0   0  0  0 S 26.7  0.0   0:28.13 kworker/7:85
> 31019 root 20  0   0  0  0 S 26.1  0.0   1:30.79 kworker/22:1
>  1787 root 20  0   0  0  0 R 25.7  0.0   5:18.45 kworker/8:7
> 32169 root 20  0   0  0  0 S 14.5  0.0   1:01.92 kworker/23:1
> 21476 root 20  0   0  0  0 S 13.9  0.0   0:05.09 kworker/1:54
>  2204 root 20  0   0  0  0 S 12.5  0.0   1:25.17 kworker/9:10
> 16994 root 20  0   0  0  0 S 12.2  0.0   0:06.27 kworker/5:106
> 15714 root 20  0   0  0  0 R 10.9  0.0   0:01.85 kworker/19:2
>  9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
> 11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
>  9960 root 20  0   0  0  0 S 10.2  0.0   0:02.75 kworker/2:139
> 11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
>  6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
> 
> The kernel work consumed a lot of CPU, and I found they are running journal
> work, The journal is reclaiming source and flush btree node with surprising
> frequency.
> 
> Through further analysis, we found that in btree_flush_write(), we try to
> get a btree node with the smallest fifo idex to flush by traverse all the
> btree nodein c->bucket_hash, after we getting it, since no locker protects
> it, this btree node may have been written to cache device by other works,
> and if this occurred, we retry to traverse in c->bucket_hash and get
> another btree node. When the problem occurrd, the retry times is very high,
> and we consume a lot of CPU in looking for a appropriate btree node.
> 
> In this patch, we try to record 128 btree nodes with the smallest fifo idex
> in heap, and pop one by one when we need to flush btree node. It greatly
> reduces the time for the loop to find the appropriate BTREE node, and also
> reduce the occupancy of CPU.
> 
> Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
> ---
>  drivers/md/bcache/bcache.h  |  2 ++
>  drivers/md/bcache/journal.c | 47 
> ++---
>  2 files changed, 34 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 0432e28..b343ba4 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -669,6 +669,8 @@ struct cache_set {
>  
>  #define BUCKET_HASH_BITS 12
>   struct hlist_head   bucket_hash[1 << BUCKET_HASH_BITS];
> +
> + DECLARE_HEAP(struct btree *, flush_btree);
>  };
>  
>  struct bbio {
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 47fd0b8..f42d3ea 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -363,6 +363,12 @@ int bch_journal_replay(struct cache_set *s, struct 
> list_head *list)
>  }
>  
>  /* Journalling */
> +#define journal_max_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> +#define journal_min_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>  
>  static void btree_flush_write(struct cache_set *c)
>  {
> @@ -370,25 +376,35 @@ static void btree_flush_write(struct cache_set *c)
>* Try to find the btree node with that references the oldest journal
>* entry, best is our current candidate and i

Re: [PATCH 1/2] bcache: add journal statistic

2018-01-31 Thread Michael Lyle
LGTM except for formatting / an extra newline (will fix) -- in my test
branch for possible 4.16

Reviewed-by: Michael Lyle <ml...@lyle.org>

On 01/26/2018 12:23 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui <tang.jun...@zte.com.cn>
> 
> Sometimes, Journal takes up a lot of CPU, we need statistics
> to know what's the journal is doing. So this patch provide
> some journal statistics:
> 1) reclaim: how many times the journal try to reclaim resource,
>usually the journal bucket or/and the pin are exhausted.
> 2) flush_write: how many times the journal try to flush btree node
>to cache device, usually the journal bucket are exhausted.
> 3) retry_flush_write: how many times the journal retry to flush
>the next btree node, usually the previous tree node have been
>flushed by other thread.
> we show these statistic by sysfs interface. Through these statistics
> We can totally see the status of journal module when the CPU is too
> high.
> 
> Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
> ---
>  drivers/md/bcache/bcache.h  |  5 +
>  drivers/md/bcache/journal.c |  5 +
>  drivers/md/bcache/sysfs.c   | 15 +++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index abd31e8..0432e28 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -647,6 +647,11 @@ struct cache_set {
>   atomic_long_t   writeback_keys_done;
>   atomic_long_t   writeback_keys_failed;
>  
> +
> + atomic_long_t   reclaim;
> + atomic_long_t   flush_write;
> + atomic_long_t   retry_flush_write;
> +
>   enum{
>   ON_ERROR_UNREGISTER,
>   ON_ERROR_PANIC,
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 02a98dd..47fd0b8 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -372,6 +372,8 @@ static void btree_flush_write(struct cache_set *c)
>*/
>   struct btree *b, *best;
>   unsigned i;
> +
> + atomic_long_inc(>flush_write);
>  retry:
>   best = NULL;
>  
> @@ -392,6 +394,7 @@ static void btree_flush_write(struct cache_set *c)
>   if (!btree_current_write(b)->journal) {
>   mutex_unlock(>write_lock);
>   /* We raced */
> + atomic_long_inc(>retry_flush_write);
>   goto retry;
>   }
>  
> @@ -471,6 +474,8 @@ static void journal_reclaim(struct cache_set *c)
>   unsigned iter, n = 0;
>   atomic_t p;
>  
> + atomic_long_inc(>reclaim);
> +
>   while (!atomic_read(_front(>journal.pin)))
>   fifo_pop(>journal.pin, p);
>  
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 234b2f5..0afbf1a 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -65,6 +65,9 @@
>  
>  read_attribute(state);
>  read_attribute(cache_read_races);
> +read_attribute(reclaim);
> +read_attribute(flush_write);
> +read_attribute(retry_flush_write);
>  read_attribute(writeback_keys_done);
>  read_attribute(writeback_keys_failed);
>  read_attribute(io_errors);
> @@ -543,6 +546,15 @@ static unsigned bch_average_key_size(struct cache_set *c)
>   sysfs_print(cache_read_races,
>   atomic_long_read(>cache_read_races));
>  
> + sysfs_print(reclaim,
> + atomic_long_read(>reclaim));
> +
> + sysfs_print(flush_write,
> + atomic_long_read(>flush_write));
> +
> + sysfs_print(retry_flush_write,
> + atomic_long_read(>retry_flush_write));
> +
>   sysfs_print(writeback_keys_done,
>   atomic_long_read(>writeback_keys_done));
>   sysfs_print(writeback_keys_failed,
> @@ -729,6 +741,9 @@ static void bch_cache_set_internal_release(struct kobject 
> *k)
>  
>   _bset_tree_stats,
>   _cache_read_races,
> + _reclaim,
> + _flush_write,
> + _retry_flush_write,
>   _writeback_keys_done,
>   _writeback_keys_failed,
>  
> 



Re: how to understand bcache

2018-01-25 Thread Michael Lyle
On Wed, Jan 24, 2018 at 1:34 AM,   wrote:
> you also can ask our dear maintainer Mike, who knows all about bcache,
> but sometiomes he is very busy, It is normally you would hear nothing
> from him for one weeks or two, sometimes maybe three, but never five,
> since the merge window comes.

;)  I didn't think I was this bad.  I'll try to do better.

Mike


Re: [PATCH v3 00/13] bcache: device failure handling improvement

2018-01-25 Thread Michael Lyle
Hey Coly,

On Thu, Jan 25, 2018 at 8:56 PM, Coly Li  wrote:
> In order to make the failure handling simple and fast, I will not
> distinct each bcache device whether it has dirty data on cache set. Once
> the cache set is dirty, all attached bcache devices will be stopped.

Surely it shouldn't stop any writethrough/writearound ones though, right?

> It seems when this option is enabled by users, it should work for
> writeback and writethrough, and no side effect to writearound and non.

Writethrough should be the same as writearound, shouldn't it?

Thanks,

Mike


[PATCH] bcache: closures: move control bits one bit right

2018-01-09 Thread Michael Lyle
Otherwise, architectures that do negated adds of atomics (e.g. s390)
to do atomic_sub fail in closure_set_stopped.

Signed-off-by: Michael Lyle <ml...@lyle.org>
Cc: Kent Overstreet <kent.overstr...@gmail.com>
Reported-by: kbuild test robot <l...@intel.com>
---
 drivers/md/bcache/closure.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 392a87cf1b92..3b9dfc9962ad 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -127,10 +127,10 @@ enum closure_state {
 * annotate where references are being transferred.
 */
 
-   CLOSURE_BITS_START  = (1U << 27),
-   CLOSURE_DESTRUCTOR  = (1U << 27),
-   CLOSURE_WAITING = (1U << 29),
-   CLOSURE_RUNNING = (1U << 31),
+   CLOSURE_BITS_START  = (1U << 26),
+   CLOSURE_DESTRUCTOR  = (1U << 26),
+   CLOSURE_WAITING = (1U << 28),
+   CLOSURE_RUNNING = (1U << 30),
 };
 
 #define CLOSURE_GUARD_MASK \
-- 
2.14.1



[416 PATCH 12/13] bcache: fix misleading error message in bch_count_io_errors()

2018-01-08 Thread Michael Lyle
From: Coly Li <col...@suse.de>

Bcache only does recoverable I/O for read operations by calling
cached_dev_read_error(). For write opertions there is no I/O recovery for
failed requests.

But in bch_count_io_errors() no matter read or write I/Os, before errors
counter reaches io error limit, pr_err() always prints "IO error on %,
recoverying". For write requests this information is misleading, because
there is no I/O recovery at all.

This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
prints "recovering" by pr_err() when the bio direction is READ.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/bcache.h|  2 +-
 drivers/md/bcache/io.c| 13 +
 drivers/md/bcache/super.c |  4 +++-
 drivers/md/bcache/writeback.c |  4 +++-
 4 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 9117da5f494b..5e2d4e80198e 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -862,7 +862,7 @@ static inline void wake_up_allocators(struct cache_set *c)
 
 /* Forward declarations */
 
-void bch_count_io_errors(struct cache *, blk_status_t, const char *);
+void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
 void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
  blk_status_t, const char *);
 void bch_bbio_endio(struct cache_set *, struct bio *, blk_status_t,
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fac97ec2d0e2..a783c5a41ff1 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -51,7 +51,10 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
 
 /* IO errors */
 
-void bch_count_io_errors(struct cache *ca, blk_status_t error, const char *m)
+void bch_count_io_errors(struct cache *ca,
+blk_status_t error,
+int is_read,
+const char *m)
 {
/*
 * The halflife of an error is:
@@ -94,8 +97,9 @@ void bch_count_io_errors(struct cache *ca, blk_status_t 
error, const char *m)
errors >>= IO_ERROR_SHIFT;
 
if (errors < ca->set->error_limit)
-   pr_err("%s: IO error on %s, recovering",
-  bdevname(ca->bdev, buf), m);
+   pr_err("%s: IO error on %s%s",
+  bdevname(ca->bdev, buf), m,
+  is_read ? ", recovering." : ".");
else
bch_cache_set_error(ca->set,
"%s: too many IO errors %s",
@@ -108,6 +112,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct 
bio *bio,
 {
struct bbio *b = container_of(bio, struct bbio, bio);
struct cache *ca = PTR_CACHE(c, >key, 0);
+   int is_read = (bio_data_dir(bio) == READ ? 1 : 0);
 
unsigned threshold = op_is_write(bio_op(bio))
? c->congested_write_threshold_us
@@ -129,7 +134,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct 
bio *bio,
atomic_inc(>congested);
}
 
-   bch_count_io_errors(ca, error, m);
+   bch_count_io_errors(ca, error, is_read, m);
 }
 
 void bch_bbio_endio(struct cache_set *c, struct bio *bio,
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d13e4ccb30a0..133b81225ea9 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -274,7 +274,9 @@ static void write_super_endio(struct bio *bio)
 {
struct cache *ca = bio->bi_private;
 
-   bch_count_io_errors(ca, bio->bi_status, "writing superblock");
+   /* is_read = 0 */
+   bch_count_io_errors(ca, bio->bi_status, 0,
+   "writing superblock");
closure_put(>set->sb_write);
 }
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f82ffb2e9b9b..31b0a292a619 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -244,8 +244,10 @@ static void read_dirty_endio(struct bio *bio)
struct keybuf_key *w = bio->bi_private;
struct dirty_io *io = w->private;
 
+   /* is_read = 1 */
bch_count_io_errors(PTR_CACHE(io->dc->disk.c, >key, 0),
-   bio->bi_status, "reading dirty data from cache");
+   bio->bi_status, 1,
+   "reading dirty data from cache");
 
dirty_endio(bio);
 }
-- 
2.14.1



[416 PATCH 13/13] bcache: fix writeback target calc on large devices

2018-01-08 Thread Michael Lyle
Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.

The new approach figures out the bdev's share in 16384ths of the overall
cached data.  This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.

This has been improved based on Tang Junhui's feedback to ensure that
every device gets a share of dirty data, no matter how small it is
compared to the total backing pool.

The existing mechanism is very limited; this is purely a bug fix to
remove limits on volume size.  However, there still needs to be change
to make this "fair" over many volumes where some are idle.

Reported-by: Jack Douglas <j...@douglastechnology.co.uk>
Signed-off-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/writeback.c | 31 +++
 drivers/md/bcache/writeback.h |  7 +++
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 31b0a292a619..51306a19ab03 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -18,17 +18,39 @@
 #include 
 
 /* Rate limiting */
-
-static void __update_writeback_rate(struct cached_dev *dc)
+static uint64_t __calc_target_rate(struct cached_dev *dc)
 {
struct cache_set *c = dc->disk.c;
+
+   /*
+* This is the size of the cache, minus the amount used for
+* flash-only devices
+*/
uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
bcache_flash_devs_sectors_dirty(c);
+
+   /*
+* Unfortunately there is no control of global dirty data.  If the
+* user states that they want 10% dirty data in the cache, and has,
+* e.g., 5 backing volumes of equal size, we try and ensure each
+* backing volume uses about 2% of the cache for dirty data.
+*/
+   uint32_t bdev_share =
+   div64_u64(bdev_sectors(dc->bdev) << WRITEBACK_SHARE_SHIFT,
+   c->cached_dev_sectors);
+
uint64_t cache_dirty_target =
div_u64(cache_sectors * dc->writeback_percent, 100);
-   int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
-  c->cached_dev_sectors);
 
+   /* Ensure each backing dev gets at least one dirty share */
+   if (bdev_share < 1)
+   bdev_share = 1;
+
+   return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT;
+}
+
+static void __update_writeback_rate(struct cached_dev *dc)
+{
/*
 * PI controller:
 * Figures out the amount that should be written per second.
@@ -49,6 +71,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
 * This acts as a slow, long-term average that is not subject to
 * variations in usage like the p term.
 */
+   int64_t target = __calc_target_rate(dc);
int64_t dirty = bcache_dev_sectors_dirty(>disk);
int64_t error = dirty - target;
int64_t proportional_scaled =
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index f102b1f9bc51..66f1c527fa24 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -8,6 +8,13 @@
 #define MAX_WRITEBACKS_IN_PASS  5
 #define MAX_WRITESIZE_IN_PASS   5000   /* *512b */
 
+/*
+ * 14 (16384ths) is chosen here as something that each backing device
+ * should be a reasonable fraction of the share, and not to blow up
+ * until individual backing devices are a petabyte.
+ */
+#define WRITEBACK_SHARE_SHIFT   14
+
 static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d)
 {
uint64_t i, ret = 0;
-- 
2.14.1



[416 PATCH 10/13] bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()

2018-01-08 Thread Michael Lyle
From: Zhai Zhaoxuan <kxuan...@gmail.com>

The function cached_dev_make_request() and flash_dev_make_request() call
generic_start_io_acct() with (struct bcache_device)->disk when they start a
closure. Then the function bio_complete() calls generic_end_io_acct() with
(struct search)->orig_bio->bi_disk when the closure has done.
Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
called with a wrong device queue.

It causes the "inflight" (in struct hd_struct) counter keep increasing
without decreasing.

This patch fix the problem by calling generic_end_io_acct() with
(struct bcache_device)->disk.

Signed-off-by: Zhai Zhaoxuan <kxuan...@gmail.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Coly Li <col...@suse.de>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/request.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ddd941056f3c..1a46b41dac70 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -633,8 +633,8 @@ static void request_endio(struct bio *bio)
 static void bio_complete(struct search *s)
 {
if (s->orig_bio) {
-   struct request_queue *q = s->orig_bio->bi_disk->queue;
-   generic_end_io_acct(q, bio_data_dir(s->orig_bio),
+   generic_end_io_acct(s->d->disk->queue,
+   bio_data_dir(s->orig_bio),
>d->disk->part0, s->start_time);
 
trace_bcache_request_end(s->d, s->orig_bio);
-- 
2.14.1



[416 PATCH 02/13] bcache: stop writeback thread after detaching

2018-01-08 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

Currently, when a cached device detaching from cache, writeback thread is
not stopped, and writeback_rate_update work is not canceled. For example,
after the following command:
echo 1 >/sys/block/sdb/bcache/detach
you can still see the writeback thread. Then you attach the device to the
cache again, bcache will create another writeback thread, for example,
after below command:
echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
then you will see 2 writeback threads.
This patch stops writeback thread and cancels writeback_rate_update work
when cached device detaching from cache.

Compare with patch v1, this v2 patch moves code down into the register
lock for safety in case of any future changes as Coly and Mike suggested.

[edit by mlyle: commit log spelling/formatting]

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Signed-off-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/super.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8399fe0651f2..553e841e897d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -906,6 +906,12 @@ static void cached_dev_detach_finish(struct work_struct *w)
 
mutex_lock(_register_lock);
 
+   cancel_delayed_work_sync(>writeback_rate_update);
+   if (!IS_ERR_OR_NULL(dc->writeback_thread)) {
+   kthread_stop(dc->writeback_thread);
+   dc->writeback_thread = NULL;
+   }
+
memset(>sb.set_uuid, 0, 16);
SET_BDEV_STATE(>sb, BDEV_STATE_NONE);
 
-- 
2.14.1



[416 PATCH 11/13] bcache: reduce cache_set devices iteration by devices_max_used

2018-01-08 Thread Michael Lyle
From: Coly Li <col...@suse.de>

Member devices of struct cache_set is used to reference all attached
bcache devices to this cache set. If it is treated as array of pointers,
size of devices[] is indicated by member nr_uuids of struct cache_set.

nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
bucket_bytes(c) / sizeof(struct uuid_entry)
Bucket size is determined by user space tool "make-bcache", by default it
is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
nr_uuids value is 4096 from the above calculation.

Every time when bcache code iterates bcache devices of a cache set, all
the 4096 pointers are checked even only 1 bcache device is attached to the
cache set, that's a wast of time and unncessary.

This patch adds a member devices_max_used to struct cache_set. Its value
is 1 + the maximum used index of devices[] in a cache set. When iterating
all valid bcache devices of a cache set, use c->devices_max_used in
for-loop may reduce a lot of useless checking.

Personally, my motivation of this patch is not for performance, I use it
in bcache debugging, which helps me to narrow down the scape to check
valid bcached devices of a cache set.

Signed-off-by: Coly Li <col...@suse.de>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/bcache.h| 1 +
 drivers/md/bcache/btree.c | 2 +-
 drivers/md/bcache/super.c | 9 ++---
 drivers/md/bcache/writeback.h | 2 +-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5f7b0b2513cc..9117da5f494b 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -497,6 +497,7 @@ struct cache_set {
int caches_loaded;
 
struct bcache_device**devices;
+   unsigneddevices_max_used;
struct list_headcached_devs;
uint64_tcached_dev_sectors;
struct closure  caching;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 9e30713dbdb8..bf3a48aa9a9a 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1679,7 +1679,7 @@ static void bch_btree_gc_finish(struct cache_set *c)
 
/* don't reclaim buckets to which writeback keys point */
rcu_read_lock();
-   for (i = 0; i < c->nr_uuids; i++) {
+   for (i = 0; i < c->devices_max_used; i++) {
struct bcache_device *d = c->devices[i];
struct cached_dev *dc;
struct keybuf_key *w, *n;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 553e841e897d..d13e4ccb30a0 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -721,6 +721,9 @@ static void bcache_device_attach(struct bcache_device *d, 
struct cache_set *c,
d->c = c;
c->devices[id] = d;
 
+   if (id >= c->devices_max_used)
+   c->devices_max_used = id + 1;
+
closure_get(>caching);
 }
 
@@ -1267,7 +1270,7 @@ static int flash_devs_run(struct cache_set *c)
struct uuid_entry *u;
 
for (u = c->uuids;
-u < c->uuids + c->nr_uuids && !ret;
+u < c->uuids + c->devices_max_used && !ret;
 u++)
if (UUID_FLASH_ONLY(u))
ret = flash_dev_run(c, u);
@@ -1433,7 +1436,7 @@ static void __cache_set_unregister(struct closure *cl)
 
mutex_lock(_register_lock);
 
-   for (i = 0; i < c->nr_uuids; i++)
+   for (i = 0; i < c->devices_max_used; i++)
if (c->devices[i]) {
if (!UUID_FLASH_ONLY(>uuids[i]) &&
test_bit(CACHE_SET_UNREGISTERING, >flags)) {
@@ -1496,7 +1499,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
c->bucket_bits  = ilog2(sb->bucket_size);
c->block_bits   = ilog2(sb->block_size);
c->nr_uuids = bucket_bytes(c) / sizeof(struct uuid_entry);
-
+   c->devices_max_used = 0;
c->btree_pages  = bucket_pages(c);
if (c->btree_pages > BTREE_MAX_PAGES)
c->btree_pages = max_t(int, c->btree_pages / 4,
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 6d26927267f8..f102b1f9bc51 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -24,7 +24,7 @@ static inline uint64_t  
bcache_flash_devs_sectors_dirty(struct cache_set *c)
 
mutex_lock(_register_lock);
 
-   for (i = 0; i < c->nr_uuids; i++) {
+   for (i = 0; i < c->devices_max_used; i++) {
struct bcache_device *d = c->devices[i];
 
if (!d || !UUID_FLASH_ONLY(>uuids[i]))
-- 
2.14.1



[416 PATCH 09/13] bcache: mark closure_sync() __sched

2018-01-08 Thread Michael Lyle
From: Kent Overstreet <kent.overstr...@gmail.com>

[edit by mlyle: include sched/debug.h to get __sched]

Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com>
Signed-off-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/closure.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index ca7ace6962a4..7f12920c14f7 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "closure.h"
 
@@ -107,7 +108,7 @@ static void closure_sync_fn(struct closure *cl)
wake_up_process(cl->s->task);
 }
 
-void __closure_sync(struct closure *cl)
+void __sched __closure_sync(struct closure *cl)
 {
struct closure_syncer s = { .task = current };
 
-- 
2.14.1



[416 PATCH 08/13] bcache: Fix, improve efficiency of closure_sync()

2018-01-08 Thread Michael Lyle
From: Kent Overstreet <kent.overstr...@gmail.com>

Eliminates cases where sync can race and fail to complete / get stuck.
Removes many status flags and simplifies entering-and-exiting closure
sleeping behaviors.

[mlyle: fixed conflicts due to changed return behavior in mainline.
extended commit comment, and squashed down two commits that were mostly
contradictory to get to this state.  Changed __set_current_state to
_set_current_state per Jens review comment]

Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com>
Signed-off-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/closure.c | 46 +-
 drivers/md/bcache/closure.h | 60 +
 2 files changed, 46 insertions(+), 60 deletions(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 1841d0359bac..ca7ace6962a4 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -18,10 +18,6 @@ static inline void closure_put_after_sub(struct closure *cl, 
int flags)
BUG_ON(flags & CLOSURE_GUARD_MASK);
BUG_ON(!r && (flags & ~CLOSURE_DESTRUCTOR));
 
-   /* Must deliver precisely one wakeup */
-   if (r == 1 && (flags & CLOSURE_SLEEPING))
-   wake_up_process(cl->task);
-
if (!r) {
if (cl->fn && !(flags & CLOSURE_DESTRUCTOR)) {
atomic_set(>remaining,
@@ -100,28 +96,34 @@ bool closure_wait(struct closure_waitlist *waitlist, 
struct closure *cl)
 }
 EXPORT_SYMBOL(closure_wait);
 
-/**
- * closure_sync - sleep until a closure has nothing left to wait on
- *
- * Sleeps until the refcount hits 1 - the thread that's running the closure 
owns
- * the last refcount.
- */
-void closure_sync(struct closure *cl)
+struct closure_syncer {
+   struct task_struct  *task;
+   int done;
+};
+
+static void closure_sync_fn(struct closure *cl)
 {
-   while (1) {
-   __closure_start_sleep(cl);
-   closure_set_ret_ip(cl);
+   cl->s->done = 1;
+   wake_up_process(cl->s->task);
+}
 
-   if ((atomic_read(>remaining) &
-CLOSURE_REMAINING_MASK) == 1)
-   break;
+void __closure_sync(struct closure *cl)
+{
+   struct closure_syncer s = { .task = current };
 
+   cl->s = 
+   continue_at(cl, closure_sync_fn, NULL);
+
+   while (1) {
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   if (s.done)
+   break;
schedule();
}
 
-   __closure_end_sleep(cl);
+   __set_current_state(TASK_RUNNING);
 }
-EXPORT_SYMBOL(closure_sync);
+EXPORT_SYMBOL(__closure_sync);
 
 #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
 
@@ -168,12 +170,10 @@ static int debug_seq_show(struct seq_file *f, void *data)
   cl, (void *) cl->ip, cl->fn, cl->parent,
   r & CLOSURE_REMAINING_MASK);
 
-   seq_printf(f, "%s%s%s%s\n",
+   seq_printf(f, "%s%s\n",
   test_bit(WORK_STRUCT_PENDING_BIT,
work_data_bits(>work)) ? "Q" : "",
-  r & CLOSURE_RUNNING  ? "R" : "",
-  r & CLOSURE_STACK? "S" : "",
-  r & CLOSURE_SLEEPING ? "Sl" : "");
+  r & CLOSURE_RUNNING  ? "R" : "");
 
if (r & CLOSURE_WAITING)
seq_printf(f, " W %pF\n",
diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index ccfbea6f9f6b..392a87cf1b92 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -103,6 +103,7 @@
  */
 
 struct closure;
+struct closure_syncer;
 typedef void (closure_fn) (struct closure *);
 
 struct closure_waitlist {
@@ -115,10 +116,6 @@ enum closure_state {
 * the thread that owns the closure, and cleared by the thread that's
 * waking up the closure.
 *
-* CLOSURE_SLEEPING: Must be set before a thread uses a closure to sleep
-* - indicates that cl->task is valid and closure_put() may wake it up.
-* Only set or cleared by the thread that owns the closure.
-*
 * The rest are for debugging and don't affect behaviour:
 *
 * CLOSURE_RUNNING: Set when a closure is running (i.e. by
@@ -128,22 +125,16 @@ enum closure_state {
 * continue_at() and closure_return() clear it for you, if you're doing
 * something unusual you can use closure_set_dead() which also helps
 * annotate where references are being transferred.
-*
-* CLOSURE_STA

[416 PATCH 05/13] bcache: fix wrong return value in bch_debug_init()

2018-01-08 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

in bch_debug_init(), ret is always 0, and the return value is useless,
change it to return 0 if be success after calling debugfs_create_dir(),
else return a non-zero value.

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/bcache.h|   6 ---
 drivers/md/bcache/debug.c |   5 +-
 drivers/md/bcache/writeback.c | 120 +-
 drivers/md/bcache/writeback.h |   3 ++
 4 files changed, 87 insertions(+), 47 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e017e1..1784e50eb857 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -323,12 +323,6 @@ struct cached_dev {
struct bch_ratelimitwriteback_rate;
struct delayed_work writeback_rate_update;
 
-   /*
-* Internal to the writeback code, so read_dirty() can keep track of
-* where it's at.
-*/
-   sector_tlast_read;
-
/* Limit number of writeback bios in flight */
struct semaphorein_flight;
struct task_struct  *writeback_thread;
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 879ab21074c6..af89408befe8 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -251,8 +251,7 @@ void bch_debug_exit(void)
 
 int __init bch_debug_init(struct kobject *kobj)
 {
-   int ret = 0;
-
debug = debugfs_create_dir("bcache", NULL);
-   return ret;
+
+   return IS_ERR_OR_NULL(debug);
 }
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 1ac2af6128b1..479095987f22 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -237,7 +237,9 @@ static void read_dirty_submit(struct closure *cl)
 static void read_dirty(struct cached_dev *dc)
 {
unsigned delay = 0;
-   struct keybuf_key *w;
+   struct keybuf_key *next, *keys[MAX_WRITEBACKS_IN_PASS], *w;
+   size_t size;
+   int nk, i;
struct dirty_io *io;
struct closure cl;
 
@@ -248,45 +250,87 @@ static void read_dirty(struct cached_dev *dc)
 * mempools.
 */
 
-   while (!kthread_should_stop()) {
-
-   w = bch_keybuf_next(>writeback_keys);
-   if (!w)
-   break;
-
-   BUG_ON(ptr_stale(dc->disk.c, >key, 0));
-
-   if (KEY_START(>key) != dc->last_read ||
-   jiffies_to_msecs(delay) > 50)
-   while (!kthread_should_stop() && delay)
-   delay = schedule_timeout_interruptible(delay);
-
-   dc->last_read   = KEY_OFFSET(>key);
-
-   io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec)
-* DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS),
-GFP_KERNEL);
-   if (!io)
-   goto err;
-
-   w->private  = io;
-   io->dc  = dc;
-
-   dirty_init(w);
-   bio_set_op_attrs(>bio, REQ_OP_READ, 0);
-   io->bio.bi_iter.bi_sector = PTR_OFFSET(>key, 0);
-   bio_set_dev(>bio, PTR_CACHE(dc->disk.c, >key, 0)->bdev);
-   io->bio.bi_end_io   = read_dirty_endio;
-
-   if (bch_bio_alloc_pages(>bio, GFP_KERNEL))
-   goto err_free;
-
-   trace_bcache_writeback(>key);
+   next = bch_keybuf_next(>writeback_keys);
+
+   while (!kthread_should_stop() && next) {
+   size = 0;
+   nk = 0;
+
+   do {
+   BUG_ON(ptr_stale(dc->disk.c, >key, 0));
+
+   /*
+* Don't combine too many operations, even if they
+* are all small.
+*/
+   if (nk >= MAX_WRITEBACKS_IN_PASS)
+   break;
+
+   /*
+* If the current operation is very large, don't
+* further combine operations.
+*/
+   if (size >= MAX_WRITESIZE_IN_PASS)
+   break;
+
+   /*
+* Operations are only eligible to be combined
+* if they are contiguous.
+*
+* TODO: add a heuristic willing to fire a
+* certain amount of non-contiguous IO per pass,
+* so that we can benefit from backing device
+* command queueing.
+*/
+   if ((nk != 0) && bkey_cmp([nk-1]->key,
+

[416 PATCH 04/13] bcache: segregate flash only volume write streams

2018-01-08 Thread Michael Lyle
From: Tang Junhui <tang.jun...@zte.com.cn>

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui <tang.jun...@zte.com.cn>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Signed-off-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/alloc.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a0cc1bc6d884..6cc6c0f9c3a9 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -525,15 +525,21 @@ struct open_bucket {
 
 /*
  * We keep multiple buckets open for writes, and try to segregate different
- * write streams for better cache utilization: first we look for a bucket where
- * the last write to it was sequential with the current write, and failing that
- * we look for a bucket that was last used by the same task.
+ * write streams for better cache utilization: first we try to segregate flash
+ * only volume write streams from cached devices, secondly we look for a bucket
+ * where the last write to it was sequential with the current write, and
+ * failing that we look for a bucket that was last used by the same task.
  *
  * The ideas is if you've got multiple tasks pulling data into the cache at the
  * same time, you'll get better cache utilization if you try to segregate their
  * data and preserve locality.
  *
- * For example, say you've starting Firefox at the same time you're copying a
+ * For example, dirty sectors of flash only volume is not reclaimable, if their
+ * dirty sectors mixed with dirty sectors of cached device, such buckets will
+ * be marked as dirty and won't be reclaimed, though the dirty data of cached
+ * device have been written back to backend device.
+ *
+ * And say you've starting Firefox at the same time you're copying a
  * bunch of files. Firefox will likely end up being fairly hot and stay in the
  * cache awhile, but the data you copied might not be; if you wrote all that
  * data to the same buckets it'd get invalidated at the same time.
@@ -550,7 +556,10 @@ static struct open_bucket *pick_data_bucket(struct 
cache_set *c,
struct open_bucket *ret, *ret_task = NULL;
 
list_for_each_entry_reverse(ret, >data_buckets, list)
-   if (!bkey_cmp(>key, search))
+   if (UUID_FLASH_ONLY(>uuids[KEY_INODE(>key)]) !=
+   UUID_FLASH_ONLY(>uuids[KEY_INODE(search)]))
+   continue;
+   else if (!bkey_cmp(>key, search))
goto found;
else if (ret->last_write_point == write_point)
ret_task = ret;
-- 
2.14.1



[416 PATCH 07/13] bcache: allow quick writeback when backing idle

2018-01-08 Thread Michael Lyle
If the control system would wait for at least half a second, and there's
been no reqs hitting the backing disk for awhile: use an alternate mode
where we have at most one contiguous set of writebacks in flight at a
time. (But don't otherwise delay).  If front-end IO appears, it will
still be quick, as it will only have to contend with one real operation
in flight.  But otherwise, we'll be sending data to the backing disk as
quickly as it can accept it (with one op at a time).

Signed-off-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
Acked-by: Coly Li <col...@suse.de>
---
 drivers/md/bcache/bcache.h|  7 +++
 drivers/md/bcache/request.c   |  1 +
 drivers/md/bcache/writeback.c | 21 +
 3 files changed, 29 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 3be0fcc19b1f..5f7b0b2513cc 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -320,6 +320,13 @@ struct cached_dev {
 */
atomic_thas_dirty;
 
+   /*
+* Set to zero by things that touch the backing volume-- except
+* writeback.  Incremented by writeback.  Used to determine when to
+* accelerate idle writeback.
+*/
+   atomic_tbacking_idle;
+
struct bch_ratelimitwriteback_rate;
struct delayed_work writeback_rate_update;
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 52b4ce24f9e2..ddd941056f3c 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -996,6 +996,7 @@ static blk_qc_t cached_dev_make_request(struct 
request_queue *q,
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
int rw = bio_data_dir(bio);
 
+   atomic_set(>backing_idle, 0);
generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0);
 
bio_set_dev(bio, dc->bdev);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 6e1d2fde43df..f82ffb2e9b9b 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -356,6 +356,27 @@ static void read_dirty(struct cached_dev *dc)
 
delay = writeback_delay(dc, size);
 
+   /* If the control system would wait for at least half a
+* second, and there's been no reqs hitting the backing disk
+* for awhile: use an alternate mode where we have at most
+* one contiguous set of writebacks in flight at a time.  If
+* someone wants to do IO it will be quick, as it will only
+* have to contend with one operation in flight, and we'll
+* be round-tripping data to the backing disk as quickly as
+* it can accept it.
+*/
+   if (delay >= HZ / 2) {
+   /* 3 means at least 1.5 seconds, up to 7.5 if we
+* have slowed way down.
+*/
+   if (atomic_inc_return(>backing_idle) >= 3) {
+   /* Wait for current I/Os to finish */
+   closure_sync();
+   /* And immediately launch a new set. */
+   delay = 0;
+   }
+   }
+
while (!kthread_should_stop() && delay) {
schedule_timeout_interruptible(delay);
delay = writeback_delay(dc, 0);
-- 
2.14.1



[416 PATCH 06/13] bcache: writeback: properly order backing device IO

2018-01-08 Thread Michael Lyle
Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device.  Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).

However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered.  In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.

This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.

Signed-off-by: Michael Lyle <ml...@lyle.org>
Reviewed-by: Tang Junhui <tang.jun...@zte.com.cn>
Tested-by: Tang Junhui <tang.jun...@zte.com.cn>
---
 drivers/md/bcache/bcache.h|  8 
 drivers/md/bcache/writeback.c | 29 +
 2 files changed, 37 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1784e50eb857..3be0fcc19b1f 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -330,6 +330,14 @@ struct cached_dev {
 
struct keybuf   writeback_keys;
 
+   /*
+* Order the write-half of writeback operations strongly in dispatch
+* order.  (Maintain LBA order; don't allow reads completing out of
+* order to re-order the writes...)
+*/
+   struct closure_waitlist writeback_ordering_wait;
+   atomic_twriteback_sequence_next;
+
/* For tracking sequential IO */
 #define RECENT_IO_BITS 7
 #define RECENT_IO  (1 << RECENT_IO_BITS)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 479095987f22..6e1d2fde43df 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -116,6 +116,7 @@ static unsigned writeback_delay(struct cached_dev *dc, 
unsigned sectors)
 struct dirty_io {
struct closure  cl;
struct cached_dev   *dc;
+   uint16_tsequence;
struct bio  bio;
 };
 
@@ -194,6 +195,27 @@ static void write_dirty(struct closure *cl)
 {
struct dirty_io *io = container_of(cl, struct dirty_io, cl);
struct keybuf_key *w = io->bio.bi_private;
+   struct cached_dev *dc = io->dc;
+
+   uint16_t next_sequence;
+
+   if (atomic_read(>writeback_sequence_next) != io->sequence) {
+   /* Not our turn to write; wait for a write to complete */
+   closure_wait(>writeback_ordering_wait, cl);
+
+   if (atomic_read(>writeback_sequence_next) == io->sequence) {
+   /*
+* Edge case-- it happened in indeterminate order
+* relative to when we were added to wait list..
+*/
+   closure_wake_up(>writeback_ordering_wait);
+   }
+
+   continue_at(cl, write_dirty, io->dc->writeback_write_wq);
+   return;
+   }
+
+   next_sequence = io->sequence + 1;
 
/*
 * IO errors are signalled using the dirty bit on the key.
@@ -211,6 +233,9 @@ static void write_dirty(struct closure *cl)
closure_bio_submit(>bio, cl);
}
 
+   atomic_set(>writeback_sequence_next, next_sequence);
+   closure_wake_up(>writeback_ordering_wait);
+
continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
 }
 
@@ -242,7 +267,10 @@ static void read_dirty(struct cached_dev *dc)
int nk, i;
struct dirty_io *io;
struct closure cl;
+   uint16_t sequence = 0;
 
+   BUG_ON(!llist_empty(>writeback_ordering_wait.list));
+   atomic_set(>writeback_sequence_next, sequence);
closure_init_stack();
 
/*
@@ -303,6 +331,7 @@ static void read_dirty(struct cached_dev *dc)
 
w->private  = io;
io->dc  = dc;
+   io->sequence= sequence++;
 
dirty_init(w);
bio_set_op_attrs(>bio, REQ_OP_READ, 0);
-- 
2.14.1



[416 PATCH 01/13] bcache: ret IOERR when read meets metadata error

2018-01-08 Thread Michael Lyle
From: Rui Hua <huarui@gmail.com>

The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0.  In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:

1. Because the btree is not totally cached in memery, we maybe get error
   when read btree node from cache device (see bch_btree_node_get()), the
   likely errno is -EIO, -ENOMEM

2. When read miss happens, bch_btree_insert_check_key() will be called to
   insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
   doing preparatory work before insert the missed data to cache device),
   a failure can also happen in this situation, the likely errno is
   -ENOMEM

bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean.  And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.

[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]

Signed-off-by: Hua Rui <huarui@gmail.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
Signed-off-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/request.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c493fb947dc9..52b4ce24f9e2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -576,6 +576,7 @@ static void cache_lookup(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, iop.cl);
struct bio *bio = >bio.bio;
+   struct cached_dev *dc;
int ret;
 
bch_btree_op_init(>op, -1);
@@ -588,6 +589,27 @@ static void cache_lookup(struct closure *cl)
return;
}
 
+   /*
+* We might meet err when searching the btree, If that happens, we will
+* get negative ret, in this scenario we should not recover data from
+* backing device (when cache device is dirty) because we don't know
+* whether bkeys the read request covered are all clean.
+*
+* And after that happened, s->iop.status is still its initial value
+* before we submit s->bio.bio
+*/
+   if (ret < 0) {
+   BUG_ON(ret == -EINTR);
+   if (s->d && s->d->c &&
+   !UUID_FLASH_ONLY(>d->c->uuids[s->d->id])) {
+   dc = container_of(s->d, struct cached_dev, disk);
+   if (dc && atomic_read(>has_dirty))
+   s->recoverable = false;
+   }
+   if (!s->iop.status)
+   s->iop.status = BLK_STS_IOERR;
+   }
+
closure_return(cl);
 }
 
-- 
2.14.1



[416 PATCH 00/13] Bcache changes for 4.16

2018-01-08 Thread Michael Lyle
Jens,

Please pick up the following reviewed changes for 4.16.  There's some
small cleanliness changes, a few minor bug fixes (some in preparation
for larger work), and ongoing work on writeback performance:

 11 files changed, 303 insertions(+), 133 deletions(-)

Coly Li (2):
  bcache: reduce cache_set devices iteration by devices_max_used
  bcache: fix misleading error message in bch_count_io_errors()

Kent Overstreet (2):
  bcache: Fix, improve efficiency of closure_sync()
  bcache: mark closure_sync() __sched

Michael Lyle (3):
  bcache: writeback: properly order backing device IO
  bcache: allow quick writeback when backing idle
  bcache: fix writeback target calc on large devices

Rui Hua (1):
  bcache: ret IOERR when read meets metadata error

Tang Junhui (3):
  bcache: stop writeback thread after detaching
  bcache: segregate flash only volume write streams
  bcache: fix wrong return value in bch_debug_init()

Vasyl Gomonovych (1):
  bcache: Use PTR_ERR_OR_ZERO()

Zhai Zhaoxuan (1):
  bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()

Thanks,

Mike


[416 PATCH 03/13] bcache: Use PTR_ERR_OR_ZERO()

2018-01-08 Thread Michael Lyle
From: Vasyl Gomonovych <gomonov...@gmail.com>

Fix ptr_ret.cocci warnings:
drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Signed-off-by: Vasyl Gomonovych <gomonov...@gmail.com>
Reviewed-by: Michael Lyle <ml...@lyle.org>
---
 drivers/md/bcache/btree.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ebb1874218e7..9e30713dbdb8 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1804,10 +1804,7 @@ static int bch_gc_thread(void *arg)
 int bch_gc_thread_start(struct cache_set *c)
 {
c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
-   if (IS_ERR(c->gc_thread))
-   return PTR_ERR(c->gc_thread);
-
-   return 0;
+   return PTR_ERR_OR_ZERO(c->gc_thread);
 }
 
 /* Initial partial gc */
-- 
2.14.1



  1   2   3   >