[PATCH v2] bcache: set max writeback rate when I/O request is idle

2018-07-23 Thread Coly Li
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
Cc: sta...@vger.kernel.org #4.16+
Signed-off-by: Coly Li 
Tested-by: Kai Krakow 
Cc: Michael Lyle 
Cc: Stefan Priebe 
---
Channgelog:
v2, Fix a deadlock reported by Stefan Priebe.
v1, Initial version.

 drivers/md/bcache/bcache.h|  11 ++--
 drivers/md/bcache/request.c   |  51 ++-
 drivers/md/bcache/super.c |   1 +
 drivers/md/bcache/sysfs.c |  14 +++--
 drivers/md/bcache/util.c  |   2 +-
 drivers/md/bcache/util.h  |   2 +-
 drivers/md/bcache/writeback.c | 115 ++
 7 files changed, 155 insertions(+), 41 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index d6bf294f3907..469ab1a955e0 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -328,13 +328,6 @@ struct cached_dev {
 */
atomic_thas_dirty;
 
-   /*
-* Set to zero by things that touch the backing volume-- except
-* writeback.  Incremented by writeback.  Used to determine when to
-* accelerate idle writeback.
-*/
-   atomic_tbacking_idle;
-
struct bch_ratelimitwriteback_rate;
struct delayed_work writeback_rate_update;
 
@@ -514,6 +507,8 @@ struct cache_set {
struct cache_accounting accounting;
 
unsigned long   flags;
+   atomic_tidle_counter;
+   atomic_tat_max_writeback_rate;
 
struct cache_sb sb;
 
@@ -523,6 +518,8 @@ struct cache_set {
 
struct bcache_device**devices;
unsigneddevices_max_used;
+   /* See set_at_max_writeback_rate() for how it is used */
+   unsignedprevious_dirty_dc_nr;
struct list_headcached_devs;
uint64_tcached_dev_sectors;
struct closure  caching;
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ae67f5fa8047..1af3d96abfa5 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct bcache_device 
*d, struct bio *bio)
 
 /* Cached devices - read & write stuff */
 
+static void quit_max_writeback_rate(struct cache_set *c,
+   struct cached_dev *this_dc)
+{
+   int i;
+   struct bcache_device *d;
+   struct cached_dev *dc;
+
+   /*
+* If bch_register_lock is acquired by other attach/detach operations,
+* waiting here will increase I/O request latency for seconds or more.
+* To avoid such situation, only writeback rate of current cached device
+* is set to 1, and __update_write_back() will decide writeback rate
+* of other cached devices (remember c->idle_counter is 0 now).
+*/
+   if (mutex_trylock(&bch_register_lock)){
+   for (i = 0; i < c->devices_max_used; i++) {
+   if (!c->devices[i])
+   continue;
+
+   if (UUID_FLASH_ONLY(&c->uuids[i]))
+   continue;
+
+   d = c->devices[i];
+   dc = container_of(d, struct cached_dev, disk);
+   /*
+* set writeback rate to default minimum value,
+* then let update_writeback_rate() to decide the
+* upcoming rate.
+*/
+ 

Re: [PATCH 1/2] block: move dif_prepare/dif_complete functions to block layer

2018-07-23 Thread Martin K. Petersen


Christoph,

>> +void blk_integrity_dif_prepare(struct request *rq, u8 protection_type,
>> +   u32 ref_tag)
>> +{
>
> Maybe call this blk_t10_pi_prepare?

The rest of these functions have a blk_integrity_ prefix. So either
stick with that or put the functions in t10-pi.c and use a t10_pi_
prefix.

I'm a bit torn on placement since the integrity metadata could contain
other stuff than T10 PI. But the remapping is very specific to T10 PI.

>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 9421d9877730..4186bf027c59 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -1119,7 +1119,9 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd 
>> *SCpnt)
>>  SCpnt->cmnd[0] = WRITE_6;
>>  
>>  if (blk_integrity_rq(rq))
>> -sd_dif_prepare(SCpnt);
>> +blk_integrity_dif_prepare(SCpnt->request,
>> +  sdkp->protection_type,
>> +  scsi_prot_ref_tag(SCpnt));
>
> scsi_prot_ref_tag could be move to the block layer as it only uses
> the sector in the eequest and the sector size, which we can get
> from the gendisk as well.  We then don't need to pass it to the function.

For Type 2, the PI can be at intervals different from the logical block
size (although we don't support that yet). We should use the
blk_integrity profile interval instead of assuming sector size.

And wrt. Keith's comment: The tuple_size should be the one from the
integrity profile as well, not sizeof(struct t10_pi_tuple).

-- 
Martin K. Petersen  Oracle Linux Engineering


[PATCH] block: Rename the null_blk_mod kernel module back into null_blk

2018-07-23 Thread Bart Van Assche
Commit ca4b2a011948 ("null_blk: add zone support") breaks several
blktests scripts because it renamed the null_blk kernel module into
null_blk_mod. Hence rename null_blk_mod back into null_blk.

Fixes: ca4b2a011948 ("null_blk: add zone support")
Signed-off-by: Bart Van Assche 
Cc: Matias Bjorling 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Damien Le Moal 
---
 drivers/block/Makefile| 6 +++---
 drivers/block/{null_blk.c => null_blk_main.c} | 0
 2 files changed, 3 insertions(+), 3 deletions(-)
 rename drivers/block/{null_blk.c => null_blk_main.c} (100%)

diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index a0d88aa0c05d..8566b188368b 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -38,9 +38,9 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)+= mtip32xx/
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_ZRAM) += zram/
 
-obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk_mod.o
-null_blk_mod-objs  := null_blk.o
-null_blk_mod-$(CONFIG_BLK_DEV_ZONED) += null_blk_zoned.o
+obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
+null_blk-objs  := null_blk_main.o
+null_blk-$(CONFIG_BLK_DEV_ZONED) += null_blk_zoned.o
 
 skd-y  := skd_main.o
 swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk_main.c
similarity index 100%
rename from drivers/block/null_blk.c
rename to drivers/block/null_blk_main.c
-- 
2.18.0



[PATCH] blk-mq: Avoid that a request queue stalls when restarting a shared hctx

2018-07-23 Thread Bart Van Assche
From: Roman Pen 

The patch below fixes queue stalling when shared hctx marked for restart
(BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
belongs to the particular queue, which in fact may not need to be restarted,
thus we return from blk_mq_sched_restart() and leave shared hctx of another
queue never restarted.

The fix is to make shared_hctx_restart counter belong not to the queue, but
to tags, thereby counter will reflect real number of shared hctx needed to
be restarted.

During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
were noticed in dd->fifo_list of mq-deadline scheduler.

Seeming possible sequence of events:

1. Request A of queue A is inserted into dd->fifo_list of the scheduler.

2. Request B of queue A bypasses scheduler and goes directly to
   hctx->dispatch.

3. Request C of queue B is inserted.

4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
   empty (request B is in the list) hctx is only marked for for next restart
   and request A is left in a list (see comment "So it's best to leave them
   there for as long as we can. Mark the hw queue as needing a restart in
   that case." in blk-mq-sched.c)

5. Eventually request B is completed/freed and blk_mq_sched_restart() is
   called, but by chance hctx from queue B is chosen for restart and request C
   gets a chance to be dispatched.

6. Eventually request C is completed/freed and blk_mq_sched_restart() is
   called, but shared_hctx_restart for queue B is zero and we return without
   attempt to restart hctx from queue A, thus request A is stuck forever.

But stalling queue is not the only one problem with blk_mq_sched_restart().
My tests show that those loops thru all queues and hctxs can be very costly,
even with shared_hctx_restart counter, which aims to fix performance issue.
For my tests I create 128 devices with 64 hctx each, which share same tags
set.

The following is the fio and ftrace output for v4.14-rc4 kernel:

 READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s, maxb=573208KB/s, 
mint=10058msec, maxt=10058msec
WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s, maxb=575312KB/s, 
mint=10058msec, maxt=10058msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function  Hit TimeAvg s^2
    --- --- ---
  blk_mq_sched_restart 163479540759 us  583.639 us  8804801 us
  blk_mq_sched_restart  78846073471 us  770.354 us  8780054 us
  blk_mq_sched_restart 141767586794 us  535.185 us  2822731 us
  blk_mq_sched_restart  78436205435 us  791.206 us  12424960 us
  blk_mq_sched_restart  14904786107 us  3212.153 us 1949753 us
  blk_mq_sched_restart  78926039311 us  765.244 us  2994627 us
  blk_mq_sched_restart 153827511126 us  488.306 us  3090912 us
  [cut]

And here are results with two patches reverted:
   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

 READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s, 
mint=10032msec, maxt=10032msec
WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s, 
mint=10032msec, maxt=10032msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function  Hit  TimeAvg s^2
    ---  --- ---
  blk_mq_sched_restart  506998802.349 us 0.173 us121.771 us
  blk_mq_sched_restart  503628740.470 us 0.173 us161.494 us
  blk_mq_sched_restart  504029066.337 us 0.179 us113.009 us
  blk_mq_sched_restart  501049366.197 us 0.186 us188.645 us
  blk_mq_sched_restart  503759317.727 us 0.184 us54.218 us
  blk_mq_sched_restart  501369311.657 us 0.185 us446.790 us
  blk_mq_sched_restart  501039179.625 us 0.183 us114.472 us
  [cut]

Timings and stdevs are terrible, which leads to significant difference:
570MB/s vs 1280MB/s.

Signed-off-by: Roman Pen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Johannes Thumshirn 
Cc: Jack Wang 
Cc: 
Signed-off-by: Bart Van Assche 
[ bvanassche: modified patch title, description and Cc-list ]
---
 block/blk-mq-sched.c   | 10 --
 block/blk-mq-tag.c |  1 +
 block/blk-mq-tag.h |  1 +
 block/blk-mq.c |  4 ++--
 include/linux/blkdev.h |  2 --
 5 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 56c493c6cd90..d863b1b32b07 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -60,10 +60,10 @@ static void blk_mq_

Re: [PATCH 2/2] blktests: add test for ANA state transition

2018-07-23 Thread Hannes Reinecke

On 07/21/2018 11:29 PM, Chaitanya Kulkarni wrote:

From: linux-block-ow...@vger.kernel.org  on behalf 
of Hannes Reinecke 
Sent: Tuesday, July 17, 2018 6:31 AM
To: Omar Sandoval
Cc: Christoph Hellwig; Sagi Grimberg; Keith Busch; 
linux-n...@lists.infradead.org; linux-block@vger.kernel.org; Hannes Reinecke; 
Hannes Reinecke
Subject: [PATCH 2/2] blktests: add test for ANA state transition
   
  
Signed-off-by: Hannes Reinecke 

---
  tests/nvme/014 | 158 +
  tests/nvme/014.out |  17 ++
  2 files changed, 175 insertions(+)
  create mode 100755 tests/nvme/014
  create mode 100644 tests/nvme/014.out

diff --git a/tests/nvme/014 b/tests/nvme/014
new file mode 100755
index 000..4b57229
--- /dev/null
+++ b/tests/nvme/014
@@ -0,0 +1,158 @@
+#!/bin/bash
+#
+# Regression test for ANA base support
+#
+# Copyright (C) 2018 Hannes Reinecke
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .


Licenses - GNU Project - Free Software Foundation
www.gnu.org
Published software should be free software.To make it free software, you need 
to release it under
a free software license. We normally use the GNU General Public License (GNU 
GPL), specifying version 3
or any later version, but occasionally we use other free software licenses.


I don't mind, I just copied it over from testcase 10...


+
+. tests/nvme/rc
+
+DESCRIPTION="test ANA optimized/transitioning/inaccessible support"
+QUICK=1
+
+switch_nvmet_anagroup() {
+   local port1="$1"
+   local port2="$2"
+   local mode="$3"
+
+   echo "ANA state ${mode}"
+
+   if [ "${mode}" = "change" ] ; then
+   _set_nvmet_anagroup_state "${port1}" "1" "change"
+   _set_nvmet_anagroup_state "${port1}" "2" "change"
+   _set_nvmet_anagroup_state "${port2}" "1" "change"
+   _set_nvmet_anagroup_state "${port2}" "2" "change"
+   elif [ "${mode}" = "failover" ] ; then
+   _set_nvmet_anagroup_state "${port1}" "1" "inaccessible"
+   _set_nvmet_anagroup_state "${port1}" "2" "optimized"
+   _set_nvmet_anagroup_state "${port2}" "1" "optimized"
+   _set_nvmet_anagroup_state "${port2}" "2" "inaccessible"
+   else
+   _set_nvmet_anagroup_state "${port1}" "1" "optimized"
+   _set_nvmet_anagroup_state "${port1}" "2" "inaccessible"
+   _set_nvmet_anagroup_state "${port2}" "1" "inaccessible"
+   _set_nvmet_anagroup_state "${port2}" "2" "optimized"
+   fi
+}
+
+_display_ana_state() {
+   local grpid state
[CK] Newliine here ?


Okay


+   for nvme in /sys/class/nvme/* ; do
+   for c in ${nvme}/nvme* ; do
+   if [ ! -d ${c} ] ; then
+   echo "${nvme##*/}: ANA disabled"
+   continue
+   fi
+   grpid="$(cat "${c}/ana_grpid")"
+   state="$(cat "${c}/ana_state")"
+   echo "${c##*/}: grpid ${grpid} state ${state}"
+   done
+   done
+}

I think we need to move above functions to the  ${BLKTESTS_HOME}/tests/nvme/rc.


Okay, can do.


+
+_switch_ana_states() {
+   local port1=$1
+   local port2=$2
+
[CK] Please remove the extra line.
+}
[CK] I was not able to find a caller for above function, I'm I missing 
something ?
+

Yeah, that's an older function which got left over during refactoring.


+requires() {
+   _have_program nvme && _have_module nvme-loop && _have_module loop && \
+   _have_configfs && _have_fio
[CK] Missing nvmet module from the above list.
+}

Can we split following test function into small routines, it will be easier to 
review and
maintain?

+
+test() {
+   echo "Running ${TEST_NAME}"
+
+   modprobe nvmet
+   modprobe nvme-loop
+
+   local port1
+   port1="$(_create_nvmet_port "loop")"
[CK] Can we initialize variables at the time of declaration or after 
declaration of all the
variables ?


Sure. But we should do it consistently; the older tests also do not 
follow these rules...



+   ag1="$(_create_nvmet_anagroup "${port1}")"
[CK] Not sure if we need ag1 variable.


Yeah, it's more for symmetry than anything else.


+
+   local port2
[CK] Can we plese declare all the variable at the top please see 
tests/nvme/006-013
to maintai

Re: [PATCH 1/2] block: move dif_prepare/dif_complete functions to block layer

2018-07-23 Thread Keith Busch
On Sun, Jul 22, 2018 at 12:49:57PM +0300, Max Gurtovoy wrote:
> +void blk_integrity_dif_prepare(struct request *rq, u8 protection_type,
> +u32 ref_tag)
> +{
> + const int tuple_sz = sizeof(struct t10_pi_tuple);
> + struct bio *bio;
> + struct t10_pi_tuple *pi;
> + u32 phys, virt;
> +
> + if (protection_type == T10_PI_TYPE3_PROTECTION)
> + return;
> +
> + phys = ref_tag;
> +
> + __rq_for_each_bio(bio, rq) {
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> + struct bio_vec iv;
> + struct bvec_iter iter;
> + unsigned int j;
> +
> + /* Already remapped? */
> + if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
> + break;
> +
> + virt = bip_get_seed(bip) & 0x;
> +
> + bip_for_each_vec(iv, bip, iter) {
> + pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
> +
> + for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {

nvme's data integrity buffer can actually have more space between each
PI field, so we just need to account for that when iterating instead of
assuming each element is the size of a T10 PI tuple.

Otherwise, great idea.


Re: [PATCH] bcache: set max writeback rate when I/O request is idle

2018-07-23 Thread Coly Li
On 2018/7/23 9:14 PM, Kai Krakow wrote:
> 2018-07-22 18:13 GMT+02:00 Coly Li :
>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>> allows the writeback rate to be faster if there is no I/O request on a
>> bcache device. It works well if there is only one bcache device attached
>> to the cache set. If there are many bcache devices attached to a cache
>> set, it may introduce performance regression because multiple faster
>> writeback threads of the idle bcache devices will compete the btree level
>> locks with the bcache device who have I/O requests coming.
>>
>> This patch fixes the above issue by only permitting fast writebac when
>> all bcache devices attached on the cache set are idle. And if one of the
>> bcache devices has new I/O request coming, minimized all writeback
>> throughput immediately and let PI controller __update_writeback_rate()
>> to decide the upcoming writeback rate for each bcache device.
>>
>> Also when all bcache devices are idle, limited wrieback rate to a small
>> number is wast of thoughput, especially when backing devices are slower
>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>> rate for each backing device if the whole cache set is idle. A faster
>> writeback rate in idle time means new I/Os may have more available space
>> for dirty data, and people may observe a better write performance then.
>>
>> Please note bcache may change its cache mode in run time, and this patch
>> still works if the cache mode is switched from writeback mode and there
>> is still dirty data on cache.
> 
> Running with this patch on 4.17, I see reduced IO lags in desktop
> usage during heavy writes. The reduction is minor but noticeable. The
> small short IO lags seem to be fixed with this but I still see the
> stalls with long lags in desktop usage. I don't address those to
> bcache, tho. This seems more like an issue resulting from using bees
> on btrfs.
> 
> 
>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing 
>> idle")
>> Cc: sta...@vger.kernel.org #4.16+
>> Signed-off-by: Coly Li 
>> Cc: Michael Lyle 
> 

Hi Kai,

> Tested-by: Kai Krakow 
> 

Thank you for the testing and information sharing, very helpful :-)

Coly Li

> 
>> ---
>>  drivers/md/bcache/bcache.h|  9 +---
>>  drivers/md/bcache/request.c   | 42 ++-
>>  drivers/md/bcache/sysfs.c | 14 +++--
>>  drivers/md/bcache/util.c  |  2 +-
>>  drivers/md/bcache/util.h  |  2 +-
>>  drivers/md/bcache/writeback.c | 98 +--
>>  6 files changed, 126 insertions(+), 41 deletions(-)
[snip]



Re: [PATCH] bcache: set max writeback rate when I/O request is idle

2018-07-23 Thread Kai Krakow
2018-07-22 18:13 GMT+02:00 Coly Li :
> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> allows the writeback rate to be faster if there is no I/O request on a
> bcache device. It works well if there is only one bcache device attached
> to the cache set. If there are many bcache devices attached to a cache
> set, it may introduce performance regression because multiple faster
> writeback threads of the idle bcache devices will compete the btree level
> locks with the bcache device who have I/O requests coming.
>
> This patch fixes the above issue by only permitting fast writebac when
> all bcache devices attached on the cache set are idle. And if one of the
> bcache devices has new I/O request coming, minimized all writeback
> throughput immediately and let PI controller __update_writeback_rate()
> to decide the upcoming writeback rate for each bcache device.
>
> Also when all bcache devices are idle, limited wrieback rate to a small
> number is wast of thoughput, especially when backing devices are slower
> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
> rate for each backing device if the whole cache set is idle. A faster
> writeback rate in idle time means new I/Os may have more available space
> for dirty data, and people may observe a better write performance then.
>
> Please note bcache may change its cache mode in run time, and this patch
> still works if the cache mode is switched from writeback mode and there
> is still dirty data on cache.

Running with this patch on 4.17, I see reduced IO lags in desktop
usage during heavy writes. The reduction is minor but noticeable. The
small short IO lags seem to be fixed with this but I still see the
stalls with long lags in desktop usage. I don't address those to
bcache, tho. This seems more like an issue resulting from using bees
on btrfs.


> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> Cc: sta...@vger.kernel.org #4.16+
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 

Tested-by: Kai Krakow 


> ---
>  drivers/md/bcache/bcache.h|  9 +---
>  drivers/md/bcache/request.c   | 42 ++-
>  drivers/md/bcache/sysfs.c | 14 +++--
>  drivers/md/bcache/util.c  |  2 +-
>  drivers/md/bcache/util.h  |  2 +-
>  drivers/md/bcache/writeback.c | 98 +--
>  6 files changed, 126 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index d6bf294f3907..f7451e8be03c 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -328,13 +328,6 @@ struct cached_dev {
>  */
> atomic_thas_dirty;
>
> -   /*
> -* Set to zero by things that touch the backing volume-- except
> -* writeback.  Incremented by writeback.  Used to determine when to
> -* accelerate idle writeback.
> -*/
> -   atomic_tbacking_idle;
> -
> struct bch_ratelimitwriteback_rate;
> struct delayed_work writeback_rate_update;
>
> @@ -514,6 +507,8 @@ struct cache_set {
> struct cache_accounting accounting;
>
> unsigned long   flags;
> +   atomic_tidle_counter;
> +   atomic_tat_max_writeback_rate;
>
> struct cache_sb sb;
>
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index ae67f5fa8047..fe45f561a054 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1104,6 +1104,34 @@ static void detached_dev_do_request(struct 
> bcache_device *d, struct bio *bio)
>
>  /* Cached devices - read & write stuff */
>
> +static void quit_max_writeback_rate(struct cache_set *c)
> +{
> +   int i;
> +   struct bcache_device *d;
> +   struct cached_dev *dc;
> +
> +   mutex_lock(&bch_register_lock);
> +
> +   for (i = 0; i < c->devices_max_used; i++) {
> +   if (!c->devices[i])
> +   continue;
> +
> +   if (UUID_FLASH_ONLY(&c->uuids[i]))
> +   continue;
> +
> +   d = c->devices[i];
> +   dc = container_of(d, struct cached_dev, disk);
> +   /*
> +* set writeback rate to default minimum value,
> +* then let update_writeback_rate() to decide the
> +* upcoming rate.
> +*/
> +   atomic64_set(&dc->writeback_rate.rate, 1);
> +   }
> +
> +   mutex_unlock(&bch_register_lock);
> +}
> +
>  static blk_qc_t cached_dev_make_request(struct request_queue *q,
> struct bio *bio)
>  {
> @@ -1119,7 +1147,19 @@ static blk_qc_t cached_dev_make_request(struct 
> request_queue *q,
> return BLK_QC_T_NONE;
> }
>
> -   atomic_set(&dc->backing_idle, 0);
> +   if (d->c) {
> +   atomic_set(&d->c->idle_counter, 0);
> +   /*
> 

Re: [PATCH] bcache: set max writeback rate when I/O request is idle

2018-07-23 Thread Stefan Priebe - Profihost AG


Am 23.07.2018 um 11:05 schrieb Coly Li:
> On 2018/7/23 2:38 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Coly,
>>
>> thanks for this patch. I was right on the way to debug why i have lot's
>> of stalled I/O on latest SLES12-SP3 kernel. I'll add this patch and try
>> if this solves my issues.
>>
>> Each cache set has two backing devices.
> 
> Hi Stefan,
> 
> Thanks for the testing.

OK it does not apply to  SLES12-SP3 at all. I'll use SLES15 kernel to test.

Stefan

> Coly Li
> 
> 
>>
>> Am 22.07.2018 um 18:13 schrieb Coly Li:
>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>> allows the writeback rate to be faster if there is no I/O request on a
>>> bcache device. It works well if there is only one bcache device attached
>>> to the cache set. If there are many bcache devices attached to a cache
>>> set, it may introduce performance regression because multiple faster
>>> writeback threads of the idle bcache devices will compete the btree level
>>> locks with the bcache device who have I/O requests coming.
>>>
>>> This patch fixes the above issue by only permitting fast writebac when
>>> all bcache devices attached on the cache set are idle. And if one of the
>>> bcache devices has new I/O request coming, minimized all writeback
>>> throughput immediately and let PI controller __update_writeback_rate()
>>> to decide the upcoming writeback rate for each bcache device.
>>>
>>> Also when all bcache devices are idle, limited wrieback rate to a small
>>> number is wast of thoughput, especially when backing devices are slower
>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>>> rate for each backing device if the whole cache set is idle. A faster
>>> writeback rate in idle time means new I/Os may have more available space
>>> for dirty data, and people may observe a better write performance then.
>>>
>>> Please note bcache may change its cache mode in run time, and this patch
>>> still works if the cache mode is switched from writeback mode and there
>>> is still dirty data on cache.
>>>
>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing 
>>> idle")
>>> Cc: sta...@vger.kernel.org #4.16+
>>> Signed-off-by: Coly Li 
>>> Cc: Michael Lyle 
>>> ---
>>>  drivers/md/bcache/bcache.h|  9 +---
>>>  drivers/md/bcache/request.c   | 42 ++-
>>>  drivers/md/bcache/sysfs.c | 14 +++--
>>>  drivers/md/bcache/util.c  |  2 +-
>>>  drivers/md/bcache/util.h  |  2 +-
>>>  drivers/md/bcache/writeback.c | 98 +--
>>>  6 files changed, 126 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>>> index d6bf294f3907..f7451e8be03c 100644
>>> --- a/drivers/md/bcache/bcache.h
>>> +++ b/drivers/md/bcache/bcache.h
>>> @@ -328,13 +328,6 @@ struct cached_dev {
>>>  */
>>> atomic_thas_dirty;
>>>  
>>> -   /*
>>> -* Set to zero by things that touch the backing volume-- except
>>> -* writeback.  Incremented by writeback.  Used to determine when to
>>> -* accelerate idle writeback.
>>> -*/
>>> -   atomic_tbacking_idle;
>>> -
>>> struct bch_ratelimitwriteback_rate;
>>> struct delayed_work writeback_rate_update;
>>>  
>>> @@ -514,6 +507,8 @@ struct cache_set {
>>> struct cache_accounting accounting;
>>>  
>>> unsigned long   flags;
>>> +   atomic_tidle_counter;
>>> +   atomic_tat_max_writeback_rate;
>>>  
>>> struct cache_sb sb;
>>>  
>>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>>> index ae67f5fa8047..fe45f561a054 100644
>>> --- a/drivers/md/bcache/request.c
>>> +++ b/drivers/md/bcache/request.c
>>> @@ -1104,6 +1104,34 @@ static void detached_dev_do_request(struct 
>>> bcache_device *d, struct bio *bio)
>>>  
>>>  /* Cached devices - read & write stuff */
>>>  
>>> +static void quit_max_writeback_rate(struct cache_set *c)
>>> +{
>>> +   int i;
>>> +   struct bcache_device *d;
>>> +   struct cached_dev *dc;
>>> +
>>> +   mutex_lock(&bch_register_lock);
>>> +
>>> +   for (i = 0; i < c->devices_max_used; i++) {
>>> +   if (!c->devices[i])
>>> +   continue;
>>> +
>>> +   if (UUID_FLASH_ONLY(&c->uuids[i]))
>>> +   continue;
>>> +
>>> +   d = c->devices[i];
>>> +   dc = container_of(d, struct cached_dev, disk);
>>> +   /*
>>> +* set writeback rate to default minimum value,
>>> +* then let update_writeback_rate() to decide the
>>> +* upcoming rate.
>>> +*/
>>> +   atomic64_set(&dc->writeback_rate.rate, 1);
>>> +   }
>>> +
>>> +   mutex_unlock(&bch_register_lock);
>>> +}
>>> +
>>>  static blk_qc_t cached_dev_make_request(struct request_queue *q,
>>> struct bio *bio)
>>>  {
>>> @@ -1119,7 +1147,19 @@ static blk_qc_t cached_dev_make_request(struct 
>>

Re: [PATCH] bcache: set max writeback rate when I/O request is idle

2018-07-23 Thread Coly Li
On 2018/7/23 2:38 PM, Stefan Priebe - Profihost AG wrote:
> Hi Coly,
> 
> thanks for this patch. I was right on the way to debug why i have lot's
> of stalled I/O on latest SLES12-SP3 kernel. I'll add this patch and try
> if this solves my issues.
> 
> Each cache set has two backing devices.

Hi Stefan,

Thanks for the testing.

Coly Li


> 
> Am 22.07.2018 um 18:13 schrieb Coly Li:
>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>> allows the writeback rate to be faster if there is no I/O request on a
>> bcache device. It works well if there is only one bcache device attached
>> to the cache set. If there are many bcache devices attached to a cache
>> set, it may introduce performance regression because multiple faster
>> writeback threads of the idle bcache devices will compete the btree level
>> locks with the bcache device who have I/O requests coming.
>>
>> This patch fixes the above issue by only permitting fast writebac when
>> all bcache devices attached on the cache set are idle. And if one of the
>> bcache devices has new I/O request coming, minimized all writeback
>> throughput immediately and let PI controller __update_writeback_rate()
>> to decide the upcoming writeback rate for each bcache device.
>>
>> Also when all bcache devices are idle, limited wrieback rate to a small
>> number is wast of thoughput, especially when backing devices are slower
>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>> rate for each backing device if the whole cache set is idle. A faster
>> writeback rate in idle time means new I/Os may have more available space
>> for dirty data, and people may observe a better write performance then.
>>
>> Please note bcache may change its cache mode in run time, and this patch
>> still works if the cache mode is switched from writeback mode and there
>> is still dirty data on cache.
>>
>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing 
>> idle")
>> Cc: sta...@vger.kernel.org #4.16+
>> Signed-off-by: Coly Li 
>> Cc: Michael Lyle 
>> ---
>>  drivers/md/bcache/bcache.h|  9 +---
>>  drivers/md/bcache/request.c   | 42 ++-
>>  drivers/md/bcache/sysfs.c | 14 +++--
>>  drivers/md/bcache/util.c  |  2 +-
>>  drivers/md/bcache/util.h  |  2 +-
>>  drivers/md/bcache/writeback.c | 98 +--
>>  6 files changed, 126 insertions(+), 41 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index d6bf294f3907..f7451e8be03c 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -328,13 +328,6 @@ struct cached_dev {
>>   */
>>  atomic_thas_dirty;
>>  
>> -/*
>> - * Set to zero by things that touch the backing volume-- except
>> - * writeback.  Incremented by writeback.  Used to determine when to
>> - * accelerate idle writeback.
>> - */
>> -atomic_tbacking_idle;
>> -
>>  struct bch_ratelimitwriteback_rate;
>>  struct delayed_work writeback_rate_update;
>>  
>> @@ -514,6 +507,8 @@ struct cache_set {
>>  struct cache_accounting accounting;
>>  
>>  unsigned long   flags;
>> +atomic_tidle_counter;
>> +atomic_tat_max_writeback_rate;
>>  
>>  struct cache_sb sb;
>>  
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index ae67f5fa8047..fe45f561a054 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -1104,6 +1104,34 @@ static void detached_dev_do_request(struct 
>> bcache_device *d, struct bio *bio)
>>  
>>  /* Cached devices - read & write stuff */
>>  
>> +static void quit_max_writeback_rate(struct cache_set *c)
>> +{
>> +int i;
>> +struct bcache_device *d;
>> +struct cached_dev *dc;
>> +
>> +mutex_lock(&bch_register_lock);
>> +
>> +for (i = 0; i < c->devices_max_used; i++) {
>> +if (!c->devices[i])
>> +continue;
>> +
>> +if (UUID_FLASH_ONLY(&c->uuids[i]))
>> +continue;
>> +
>> +d = c->devices[i];
>> +dc = container_of(d, struct cached_dev, disk);
>> +/*
>> + * set writeback rate to default minimum value,
>> + * then let update_writeback_rate() to decide the
>> + * upcoming rate.
>> + */
>> +atomic64_set(&dc->writeback_rate.rate, 1);
>> +}
>> +
>> +mutex_unlock(&bch_register_lock);
>> +}
>> +
>>  static blk_qc_t cached_dev_make_request(struct request_queue *q,
>>  struct bio *bio)
>>  {
>> @@ -1119,7 +1147,19 @@ static blk_qc_t cached_dev_make_request(struct 
>> request_queue *q,
>>  return BLK_QC_T_NONE;
>>  }
>>  
>> -atomic_set(&dc->backing_idle, 0);
>> +if (d->c) {
>> +atomic_set(&d->c->idle_counter, 0);
>> +/*
>> +

Re: [PATCH 2/2] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-23 Thread Christoph Hellwig
On Sun, Jul 22, 2018 at 12:49:58PM +0300, Max Gurtovoy wrote:
> Also moved the logic of the remapping to the nvme core driver instead
> of implementing it in the nvme pci driver. This way all the other nvme
> transport drivers will benefit from it (in case they'll implement metadata
> support).
> 
> Suggested-by: Christoph Hellwig 
> Cc: Jens Axboe 
> Cc: Martin K. Petersen 
> Signed-off-by: Max Gurtovoy 
> ---
>  drivers/nvme/host/core.c | 23 +--
>  drivers/nvme/host/nvme.h |  9 +-
>  drivers/nvme/host/pci.c  | 75 
> +---
>  3 files changed, 23 insertions(+), 84 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 46df030b2c3f..0d94d3eb641c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -591,6 +591,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns 
> *ns,
>   nvme_assign_write_stream(ctrl, req, &control, &dsmgmt);
>  
>   if (ns->ms) {
> + u32 ref_tag = nvme_block_nr(ns, blk_rq_pos(req));
>   /*

Please add an empty line here.

> +void nvme_cleanup_cmd(struct request *req)
> +{
> + if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
> + nvme_error_status(req) == BLK_STS_OK) {

This line can simply be nvme_req(req)->status == 0


Re: [PATCH 1/2] block: move dif_prepare/dif_complete functions to block layer

2018-07-23 Thread Christoph Hellwig
>  #include 
> +#include 

Sounds like the new functions should move to block/10-pi.c given
that they are specific to the T10 defined formats.

> +/*
> + * The virtual start sector is the one that was originally submitted
> + * by the block layer.   Due to partitioning, MD/DM cloning, etc. the
> + * actual physical start sector is likely to be different.  Remap
> + * protection information to match the physical LBA.
> + *
> + * From a protocol perspective there's a slight difference between
> + * Type 1 and 2.  The latter uses command's 32-byte exclusively, and the
> + * reference tag is seeded in the command.  This gives us the potential to
> + * avoid virt->phys remapping during write.  However, at read time we
> + * don't know whether the virt sector is the same as when we wrote it
> + * (we could be reading from real disk as opposed to MD/DM device.  So
> + * we always remap Type 2 making it identical to Type 1.
> + *
> + * Type 3 does not have a reference tag so no remapping is required.
> + */

Maybe add proper kerneldoc comments given that these are exported
functions?

The 32-byte CDB comment doesn't really make sense here as it is SCSI
specific, so we'll need to drop it or find a good place for it in the
SCSI code.

> +void blk_integrity_dif_prepare(struct request *rq, u8 protection_type,
> +u32 ref_tag)
> +{

Maybe call this blk_t10_pi_prepare?

> + const int tuple_sz = sizeof(struct t10_pi_tuple);
> + struct bio *bio;
> + struct t10_pi_tuple *pi;
> + u32 phys, virt;
> +
> + if (protection_type == T10_PI_TYPE3_PROTECTION)
> + return;
> +
> + phys = ref_tag;

Seems like we could just use the ref_tag variable later instead of
duplicating it.

> +
> + __rq_for_each_bio(bio, rq) {
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> + struct bio_vec iv;
> + struct bvec_iter iter;
> + unsigned int j;
> +
> + /* Already remapped? */
> + if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
> + break;
> +
> + virt = bip_get_seed(bip) & 0x;

Looks like we could keep the virt variable inside the loop and assign
it where declared, e.g.:

u32 virt = bip_get_seed(bip) & 0x;

at the beginning of this block.

> + bip_for_each_vec(iv, bip, iter) {
> + pi = kmap_atomic(iv.bv_page) + iv.bv_offset;

Pi can have local scope here, too.

> + for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {
> +
> + if (be32_to_cpu(pi->ref_tag) == virt)
> + pi->ref_tag = cpu_to_be32(phys);
> +
> + virt++;
> + phys++;
> + }

No need for the empty lines inside this loop.

> +/*
> + * Remap physical sector values in the reference tag to the virtual
> + * values expected by the block layer.
> + */
> +void blk_integrity_dif_complete(struct request *rq, u8 protection_type,
> + u32 ref_tag, unsigned int intervals)

And pretty much all the  comments apply to this function as well.

> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 9421d9877730..4186bf027c59 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1119,7 +1119,9 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd 
> *SCpnt)
>   SCpnt->cmnd[0] = WRITE_6;
>  
>   if (blk_integrity_rq(rq))
> - sd_dif_prepare(SCpnt);
> + blk_integrity_dif_prepare(SCpnt->request,
> +   sdkp->protection_type,
> +   scsi_prot_ref_tag(SCpnt));

scsi_prot_ref_tag could be move to the block layer as it only uses
the sector in the eequest and the sector size, which we can get
from the gendisk as well.  We then don't need to pass it to the function.

We could also move the protection type to the gendisk, although I'm not
sure it's going to be worth it.