Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-26 Thread jianchao.wang



On 07/27/2018 06:24 AM, Bart Van Assche wrote:
> On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote:
>> On 07/26/2018 06:26 AM, Bart Van Assche wrote:
>>> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>> return ret;
>>>  
>>> blk_pm_runtime_lock(q);
>>> +   blk_set_preempt_only(q);
>>
>> We only stop non-RQF_PM request entering when RPM_SUSPENDING and 
>> RPM_RESUMING.
>> blk_pre_runtime_suspend should only _check_ whether runtime suspend is 
>> allowed.
>> So we should not set preempt_only here.
>>
>>> +   percpu_ref_switch_to_atomic_sync(>q_usage_counter);
>>>  
>>> spin_lock_irq(q->queue_lock);
>>> -   if (q->nr_pending) {
>>> +   if (!percpu_ref_is_zero(>q_usage_counter)) {
>>> ret = -EBUSY;
>>> pm_runtime_mark_last_busy(q->dev);
>>> } else {
>>> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>> }
>>> spin_unlock_irq(q->queue_lock);
>>>  
>>> +   percpu_ref_switch_to_percpu(>q_usage_counter);
>>> blk_pm_runtime_unlock(q);
> 
> Hello Jianchao,
> 
> There is only one caller of blk_post_runtime_suspend(), namely
> sdev_runtime_suspend(). That function calls pm->runtime_suspend() before
> it calls blk_post_runtime_suspend(). I think it would be wrong to set the
> PREEMPT_ONLY flag from inside blk_post_runtime_suspend() because that could
> cause pm->runtime_suspend() while a request is in progress.
> 

Hi Bart

If q_usage_counter is not zero here, we will leave the request_queue in preempt 
only mode.
The request_queue should be set to preempt only mode only when we confirm we 
could set
rpm_status to RPM_SUSPENDING or RPM_RESUMING.

Maybe we could set the PREEMPT_ONLY after we confirm we could set the rpm_status
to RPM_SUSPENDING.
Something like:

percpu_ref_switch_to_atomic_sync(>q_usage_counter);
if (!percpu_ref_is_zero(>q_usage_counter)) {
ret = -EBUSY;
pm_runtime_mark_last_busy(q->dev);
} else {
blk_set_preempt_only(q);
if (!percpu_ref_is_zero(>q_usage_counter) {
ret = -EBUSY;
pm_runtime_mark_last_busy(q->dev);
blk_clear_preempt_only(q);
} else {
q->rpm_status = RPM_SUSPENDIN;
}
}


Thanks
Jianchao


Re: [PATCH 0/2] blktests: test ANA base support

2018-07-26 Thread Omar Sandoval
On Thu, Jul 26, 2018 at 02:31:32PM -0700, Omar Sandoval wrote:
> On Thu, Jul 26, 2018 at 09:35:53AM -0700, James Smart wrote:
> > make sure this fix has been picked up in your kernel:
> > http://git.infradead.org/nvme.git/commit/6cdefc6e2ad52170f89a8d0e8b1a1339f91834dc
> > 
> > deletes were broken otherwise and have the exact trace below.
> > 
> > -- james
> 
> Thanks, James, that indeed fixes the problem. I'll go ahead and merge
> these tests since there's a fix in flight.

By "these tests", I mean Chaitanya's tests, Hannes still has some
comments to resolve :)


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-26 Thread Bart Van Assche
On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote:
> On 07/26/2018 06:26 AM, Bart Van Assche wrote:
> > @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q)
> > return ret;
> >  
> > blk_pm_runtime_lock(q);
> > +   blk_set_preempt_only(q);
> 
> We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING.
> blk_pre_runtime_suspend should only _check_ whether runtime suspend is 
> allowed.
> So we should not set preempt_only here.
> 
> > +   percpu_ref_switch_to_atomic_sync(>q_usage_counter);
> >  
> > spin_lock_irq(q->queue_lock);
> > -   if (q->nr_pending) {
> > +   if (!percpu_ref_is_zero(>q_usage_counter)) {
> > ret = -EBUSY;
> > pm_runtime_mark_last_busy(q->dev);
> > } else {
> > @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q)
> > }
> > spin_unlock_irq(q->queue_lock);
> >  
> > +   percpu_ref_switch_to_percpu(>q_usage_counter);
> > blk_pm_runtime_unlock(q);

Hello Jianchao,

There is only one caller of blk_post_runtime_suspend(), namely
sdev_runtime_suspend(). That function calls pm->runtime_suspend() before
it calls blk_post_runtime_suspend(). I think it would be wrong to set the
PREEMPT_ONLY flag from inside blk_post_runtime_suspend() because that could
cause pm->runtime_suspend() while a request is in progress.

Bart.



Re: [PATCH v2 3/5] block: Serialize queue freezing and blk_pre_runtime_suspend()

2018-07-26 Thread Bart Van Assche
On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote:
> On 07/26/2018 06:26 AM, Bart Van Assche wrote:
> > +
> > +void blk_pm_runtime_lock(struct request_queue *q)
> > +{
> > +   spin_lock(>rpm_lock);
> > +   wait_event_interruptible_locked(q->rpm_wq,
> > + q->rpm_owner == NULL || q->rpm_owner == current);
> > +   if (q->rpm_owner == NULL)
> > +   q->rpm_owner = current;
> > +   q->rpm_nesting_level++;
> > +   spin_unlock(>rpm_lock);
> > +}
> 
> The lock which wait_event_interruptible_locked want to hold is the wq.lock.

The above code is indeed wrong. This is what I think it should be changed
into:

+void blk_pm_runtime_lock(struct request_queue *q)
+{
+   might_sleep();
+
+   spin_lock(>rpm_lock);
+   wait_event_exclusive_cmd(q->rpm_wq,
+   q->rpm_owner == NULL || q->rpm_owner == current,
+   spin_unlock(>rpm_lock), spin_lock(>rpm_lock));
+   if (q->rpm_owner == NULL)
+   q->rpm_owner = current;
+   q->rpm_nesting_level++;
+   spin_unlock(>rpm_lock);
+}

Bart.



Re: [PATCH 0/2] blktests: test ANA base support

2018-07-26 Thread Omar Sandoval
On Thu, Jul 26, 2018 at 09:35:53AM -0700, James Smart wrote:
> make sure this fix has been picked up in your kernel:
> http://git.infradead.org/nvme.git/commit/6cdefc6e2ad52170f89a8d0e8b1a1339f91834dc
> 
> deletes were broken otherwise and have the exact trace below.
> 
> -- james

Thanks, James, that indeed fixes the problem. I'll go ahead and merge
these tests since there's a fix in flight.


[PATCH] readahead: stricter check for bdi io_pages

2018-07-26 Thread stockhausen
ondemand_readahead() checks bdi->io_pages to cap the maximum pages
that need to be processed. This works until the readit section. If
we would do an async only readahead (async size = sync size) and
target is at beginning of window we expand the pages by another
get_next_ra_size() pages. Btrace for large reads shows that kernel
always issues a doubled size read at the beginning of processing.
Add an additional check for io_pages in the lower part of the func.
The fix helps devices that hard limit bio pages and rely on proper
handling of max_hw_read_sectors (e.g. older FusionIO cards). For
that reason it could qualify for stable.

Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Signed-off-by: Markus Stockhausen stockhau...@collogia.de

diff --git a/mm/readahead.c b/mm/readahead.c
index 539bbb6..d4ad4bb 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -380,6 +380,7 @@ static int try_context_readahead(struct address_space 
*mapping,
 {
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
unsigned long max_pages = ra->ra_pages;
+   unsigned long add_pages;
pgoff_t prev_offset;
 
/*
@@ -469,10 +470,17 @@ static int try_context_readahead(struct address_space 
*mapping,
 * Will this read hit the readahead marker made by itself?
 * If so, trigger the readahead marker hit now, and merge
 * the resulted next readahead window into the current one.
+* Take care of maximum IO pages as above.
 */
if (offset == ra->start && ra->size == ra->async_size) {
-   ra->async_size = get_next_ra_size(ra, max_pages);
-   ra->size += ra->async_size;
+   add_pages = get_next_ra_size(ra, max_pages);
+   if (ra->size + add_pages <= max_pages) {
+   ra->async_size = add_pages;
+   ra->size += add_pages;
+   } else {
+   ra->size = max_pages;
+   ra->async_size = max_pages >> 1;
+   }
}
 
return ra_submit(ra, mapping, filp);
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497




[PATCH] block: reset bi_iter.bi_done after splitting bio

2018-07-26 Thread Greg Edwards
After the bio has been updated to represent the remaining sectors, reset
bi_done so bio_rewind_iter() does not rewind further than it should.

This resolves a bio_integrity_process() failure on reads where the
original request was split.

Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage")
Signed-off-by: Greg Edwards 
---
Simple test case:

  # modprobe scsi_debug.ko dif=1 dix=1 guard=0 dev_size_mb=1024 sector_size=4096
  # echo 512 > /sys/block//queue/max_sectors_kb
  # dd if=/dev/ of=/dev/null bs=1M iflag=direct count=1

 block/bio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bio.c b/block/bio.c
index 67eff5eddc49..f33a030b9dad 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1866,6 +1866,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
bio_integrity_trim(split);
 
bio_advance(bio, split->bi_iter.bi_size);
+   bio->bi_iter.bi_done = 0;
 
if (bio_flagged(bio, BIO_TRACE_COMPLETION))
bio_set_flag(split, BIO_TRACE_COMPLETION);
-- 
2.17.1



Re: [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO()

2018-07-26 Thread Jens Axboe
On 7/25/18 2:15 PM, Martin Wilck wrote:
> Hello Jens, Ming, Jan, and all others,
> 
> the following patches have been verified by a customer to fix a silent data
> corruption which he has been seeing since "72ecad2 block: support a full bio
> worth of IO for simplified bdev direct-io".
> 
> The patches are based on our observation that the corruption is only
> observed if the __blkdev_direct_IO_simple() code path is executed,
> and if that happens, "short writes" are observed in this code path,
> which causes a fallback to buffered IO, while the application continues
> submitting direct IO requests.
> 
> Following Ming's suggestion, I've changed the patch set such that
> bio_iov_iter_get_pages() now always returns as many pages as possible.
> This simplifies the patch set a lot. Except for
> __blkdev_direct_IO_simple(), all callers of bio_iov_iter_get_pages()
> call it in a loop, and expect to get just some pages. Therefore I
> have made bio_iov_iter_get_pages() return success if it can pin some
> pages, even if MM returns an error on the way. Error is returned only
> if no pages at all could be pinned. This also avoids the need for
> cleanup code in the helper - callers will submit the bio with the
> allocated pages, and clean up later as appropriate.

Thanks everyone involved in this, I've queued it up for 4.18.

-- 
Jens Axboe



Re: [GIT PULL] nvme fixes for 4.18

2018-07-26 Thread Jens Axboe
On 7/26/18 7:16 AM, Christoph Hellwig wrote:
> Two small fixes each for the FC code and the target.
> 
> 
> The following changes since commit 8f3ea35929a0806ad1397db99a89ffee0140822a:
> 
>   nbd: handle unexpected replies better (2018-07-16 10:14:40 -0600)
> 
> are available in the Git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.18
> 
> for you to fetch changes up to 405a7519607e7a364114896264440c0f87b325c0:
> 
>   nvmet: only check for filebacking on -ENOTBLK (2018-07-25 13:14:04 +0200)
> 
> 
> Hannes Reinecke (2):
>   nvmet: fixup crash on NULL device path
>   nvmet: only check for filebacking on -ENOTBLK
> 
> James Smart (2):
>   nvmet-fc: fix target sgl list on large transfers
>   nvme: if_ready checks to fail io to deleting controller

Pulled, thanks.

-- 
Jens Axboe



Re: [PATCH 0/2] blktests: test ANA base support

2018-07-26 Thread James Smart

make sure this fix has been picked up in your kernel:
http://git.infradead.org/nvme.git/commit/6cdefc6e2ad52170f89a8d0e8b1a1339f91834dc

deletes were broken otherwise and have the exact trace below.

-- james


On 7/25/2018 7:01 PM, Chaitanya Kulkarni wrote:

Thanks for the report and correction. I was able to run the test and reproduce 
the problem, I also confirm that it is not that consistent.

As I suspected the problem is in the nvme disconnect in the following call 
chain:-


nvme_sysfs_delete
  nvme_delete_ctrl_sync
    nvme_delete_ctrl_work
     nvme_remove_namespaces
      nvme_ns_remove
        nvme_ns_remove
          blk_cleanup_queue
        blk_freeze_queue
              blk_freeze_queue_start
            wait_event(q->mq_freeze_wq, 
percpu_ref_is_zero(>q_usage_counter)); <- Stuck here.
 
blk_cleanup_queue() that's where it is blocking and then target ctrl is timing out on keep alive.


It will take some time for me to debug and find a fix let's merge this test 
once we fix this issue.



How should we go about other tests ? should we merge them and keep this one out 
in the nvmeof branch?
Regards,
Chaitanya



From: Omar Sandoval 
Sent: Wednesday, July 25, 2018 2:00 PM
To: Chaitanya Kulkarni
Cc: Hannes Reinecke; Omar Sandoval; Christoph Hellwig; Sagi Grimberg; Keith 
Busch; linux-n...@lists.infradead.org; linux-block@vger.kernel.org
Subject: Re: [PATCH 0/2] blktests: test ANA base support
   
  
On Wed, Jul 25, 2018 at 07:27:35PM +, Chaitanya Kulkarni wrote:

Thanks, Omar.

Tests nvme/014 and nvme/015 had a pretty bad typo that I didn't notice
last time:

dd=/dev/urandom of="/dev/${nvmedev}n1" count=128000 bs=4k

That should be

dd if=/dev/urandom of="/dev/${nvmedev}n1" count=128000 bs=4k status=none

When I fix that (and change the nvme flush call as mentioned before), I
sometimes get a hung task:

[  273.80] run blktests nvme/015 at 2018-07-25 13:44:11
[  273.861950] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  273.875014] nvmet: creating controller 1 for subsystem blktests-subsystem-1 
for NQN nqn.2014-08.org.nvmexpress:uuid:c5e36fdf-8e4d-4c27-be56-da373db583b2.
[  273.877457] nvme nvme1: creating 4 I/O queues.
[  273.879141] nvme nvme1: new ctrl: "blktests-subsystem-1"
[  276.247708] nvme nvme1: using deprecated NVME_IOCTL_IO_CMD ioctl on the char 
device!
[  276.262835] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
[  289.755361] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[  289.760579] nvmet: ctrl 1 fatal error occurred!
[  491.095890] INFO: task kworker/u8:0:7 blocked for more than 120 seconds.
[  491.104407]   Not tainted 4.18.0-rc6 #18
[  491.108330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  491.116521] kworker/u8:0    D    0 7  2 0x8000
[  491.121754] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
[  491.129604] Call Trace:
[  491.131611]  ? __schedule+0x2a1/0x890
[  491.135112]  ? _raw_spin_unlock_irqrestore+0x20/0x40
[  491.140542]  schedule+0x32/0x90
[  491.142030]  blk_mq_freeze_queue_wait+0x41/0xa0
[  491.144186]  ? wait_woken+0x80/0x80
[  491.145726]  blk_cleanup_queue+0x75/0x160
[  491.150235]  nvme_ns_remove+0xf9/0x130 [nvme_core]
[  491.151910]  nvme_remove_namespaces+0x86/0xc0 [nvme_core]
[  491.153127]  nvme_delete_ctrl_work+0x4b/0x80 [nvme_core]
[  491.154727]  process_one_work+0x18c/0x360
[  491.155428]  worker_thread+0x1c6/0x380
[  491.156160]  ? process_one_work+0x360/0x360
[  491.157493]  kthread+0x112/0x130
[  491.159119]  ? kthread_flush_work_fn+0x10/0x10
[  491.160008]  ret_from_fork+0x35/0x40
[  491.160729] INFO: task nvme:1139 blocked for more than 120 seconds.
[  491.162416]   Not tainted 4.18.0-rc6 #18
[  491.164348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  491.166012] nvme    D    0  1139   1072 0x
[  491.167946] Call Trace:
[  491.168459]  ? __schedule+0x2a1/0x890
[  491.169312]  schedule+0x32/0x90
[  491.170180]  schedule_timeout+0x311/0x4a0
[  491.171921]  ? kernfs_fop_release+0xa0/0xa0
[  491.172884]  wait_for_common+0x1a0/0x1d0
[  491.173813]  ? wake_up_q+0x70/0x70
[  491.174410]  flush_work+0x10e/0x1c0
[  491.174991]  ? flush_workqueue_prep_pwqs+0x130/0x130
[  491.176113]  nvme_delete_ctrl_sync+0x41/0x50 [nvme_core]
[  491.177969]  nvme_sysfs_delete+0x28/0x30 [nvme_core]
[  491.178632]  kernfs_fop_write+0x116/0x190
[  491.179254]  __vfs_write+0x36/0x190
[  491.179812]  vfs_write+0xa9/0x190
[  491.180344]  ksys_write+0x4f/0xb0
[  491.181056]  do_syscall_64+0x5b/0x170
[  491.181583]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  491.182311] RIP: 0033:0x7fc04176b9d4
[  491.182863] Code: Bad RIP value.
[  491.183650] RSP: 002b:7ffc33bd15a8 EFLAGS: 0246 ORIG_RAX: 
0001
[  491.184622] RAX: ffda RBX: 0004 RCX: 7fc04176b9d4
[  491.185606] RDX: 0001 RSI: 55884bd0810a RDI: 0004
[  491.186719] RBP: 

[PATCH 0/9] Pending bcache patches for review

2018-07-26 Thread Coly Li
Hi folks,

I ask for help to review these patches, some of them are posted for months.
They are OK with my test for a while, still tt would be great to have more
reviewer before these patches go into v4.19.

Thanks in advance. 

Coly Li
---
Coly Li (9):
  bcache: do not check return value of debugfs_create_dir()
  bcache: display rate debug parameters to 0 when writeback is not
running
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: add a comment in register_bdev()
  bcache: fix mistaken code comments in struct cache
  bcache: fix mistaken comments in bch_keylist_realloc()
  bcache: add code comments for bset.c
  bcache: initiate bcache_debug to NULL
  bcache: set max writeback rate when I/O request is idle

 drivers/md/bcache/bcache.h| 16 +++---
 drivers/md/bcache/bset.c  | 63 
 drivers/md/bcache/btree.c | 14 +++---
 drivers/md/bcache/closure.c   | 13 +++--
 drivers/md/bcache/closure.h   |  4 +-
 drivers/md/bcache/debug.c | 13 ++---
 drivers/md/bcache/request.c   | 56 -
 drivers/md/bcache/super.c |  9 +++-
 drivers/md/bcache/sysfs.c | 37 +-
 drivers/md/bcache/util.c  |  2 +-
 drivers/md/bcache/util.h  |  2 +-
 drivers/md/bcache/writeback.c | 91 +++
 12 files changed, 244 insertions(+), 76 deletions(-)

-- 
2.17.1



[PATCH 4/9] bcache: add a comment in register_bdev()

2018-07-26 Thread Coly Li
---
 drivers/md/bcache/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index c7ffa6ef3f82..f517d7d1fa10 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1291,6 +1291,7 @@ static void register_bdev(struct cache_sb *sb, struct 
page *sb_page,
pr_info("registered backing device %s", dc->backing_dev_name);
 
list_add(>list, _devices);
+   /* attch to a matched cache set if it exists */
list_for_each_entry(c, _cache_sets, list)
bch_cached_dev_attach(dc, c, NULL);
 
-- 
2.17.1



[PATCH 2/9] bcache: display rate debug parameters to 0 when writeback is not running

2018-07-26 Thread Coly Li
When writeback is not running, writeback rate should be 0, other value is
misleading. And the following dyanmic writeback rate debug parameters
should be 0 too,
rate, proportional, integral, change
otherwise they are misleading when writeback is not running.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/sysfs.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 225b15aa0340..3e9d3459a224 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -149,6 +149,7 @@ SHOW(__bch_cached_dev)
struct cached_dev *dc = container_of(kobj, struct cached_dev,
 disk.kobj);
const char *states[] = { "no cache", "clean", "dirty", "inconsistent" };
+   int wb = dc->writeback_running;
 
 #define var(stat)  (dc->stat)
 
@@ -170,7 +171,7 @@ SHOW(__bch_cached_dev)
var_printf(writeback_running,   "%i");
var_print(writeback_delay);
var_print(writeback_percent);
-   sysfs_hprint(writeback_rate,dc->writeback_rate.rate << 9);
+   sysfs_hprint(writeback_rate,wb ? dc->writeback_rate.rate << 9 : 0);
sysfs_hprint(io_errors, atomic_read(>io_errors));
sysfs_printf(io_error_limit,"%i", dc->error_limit);
sysfs_printf(io_disable,"%i", dc->io_disable);
@@ -188,15 +189,20 @@ SHOW(__bch_cached_dev)
char change[20];
s64 next_io;
 
-   bch_hprint(rate,dc->writeback_rate.rate << 9);
-   bch_hprint(dirty,   bcache_dev_sectors_dirty(>disk) << 
9);
-   bch_hprint(target,  dc->writeback_rate_target << 9);
-   bch_hprint(proportional,dc->writeback_rate_proportional << 9);
-   bch_hprint(integral,dc->writeback_rate_integral_scaled << 
9);
-   bch_hprint(change,  dc->writeback_rate_change << 9);
-
-   next_io = div64_s64(dc->writeback_rate.next - local_clock(),
-   NSEC_PER_MSEC);
+   /*
+* Except for dirty and target, other values should
+* be 0 if writeback is not running.
+*/
+   bch_hprint(rate, wb ? dc->writeback_rate.rate << 9 : 0);
+   bch_hprint(dirty, bcache_dev_sectors_dirty(>disk) << 9);
+   bch_hprint(target, dc->writeback_rate_target << 9);
+   bch_hprint(proportional,
+  wb ? dc->writeback_rate_proportional << 9 : 0);
+   bch_hprint(integral,
+  wb ? dc->writeback_rate_integral_scaled << 9 : 0);
+   bch_hprint(change, wb ? dc->writeback_rate_change << 9 : 0);
+   next_io = wb ? div64_s64(dc->writeback_rate.next-local_clock(),
+NSEC_PER_MSEC) : 0;
 
return sprintf(buf,
   "rate:\t\t%s/sec\n"
-- 
2.17.1



[PATCH 3/9] bcache: avoid unncessary cache prefetch bch_btree_node_get()

2018-07-26 Thread Coly Li
In bch_btree_node_get() the read-in btree node will be partially
prefetched into L1 cache for following bset iteration (if there is).
But if the btree node read is failed, the perfetch operations will
waste L1 cache space. This patch checkes whether read operation and
only does cache prefetch when read I/O successed.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/btree.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 475008fbbaab..c19f7716df88 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1011,6 +1011,13 @@ struct btree *bch_btree_node_get(struct cache_set *c, 
struct btree_op *op,
BUG_ON(b->level != level);
}
 
+   if (btree_node_io_error(b)) {
+   rw_unlock(write, b);
+   return ERR_PTR(-EIO);
+   }
+
+   BUG_ON(!b->written);
+
b->parent = parent;
b->accessed = 1;
 
@@ -1022,13 +1029,6 @@ struct btree *bch_btree_node_get(struct cache_set *c, 
struct btree_op *op,
for (; i <= b->keys.nsets; i++)
prefetch(b->keys.set[i].data);
 
-   if (btree_node_io_error(b)) {
-   rw_unlock(write, b);
-   return ERR_PTR(-EIO);
-   }
-
-   BUG_ON(!b->written);
-
return b;
 }
 
-- 
2.17.1



[PATCH 6/9] bcache: fix mistaken comments in bch_keylist_realloc()

2018-07-26 Thread Coly Li
---
 drivers/md/bcache/request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 8eece9ef9f46..91206f329971 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -107,7 +107,7 @@ static int bch_keylist_realloc(struct keylist *l, unsigned 
u64s,
/*
 * The journalling code doesn't handle the case where the keys to insert
 * is bigger than an empty write: If we just return -ENOMEM here,
-* bio_insert() and bio_invalidate() will insert the keys created so far
+* bch_data_insert_keys() will insert the keys created so far
 * and finish the rest when the keylist is empty.
 */
if (newsize * sizeof(uint64_t) > block_bytes(c) - sizeof(struct jset))
-- 
2.17.1



[PATCH 8/9] bcache: initiate bcache_debug to NULL

2018-07-26 Thread Coly Li
Global variable bcache_debug is firstly initialized in bch_debug_init(),
and destroyed in bch_debug_exit(). bch_debug_init() is called in
bcache_init() with many other functions, if one of the previous calling
onces failed, bcache_exit() will be called in the failure path.

The problem is, if bcache_init() fails before bch_debug_init() is called,
then in bcache_exit() when bch_debug_exit() is called to destroy global
variable bcache_debug, at this moment bcache_debug is unndefined, then the
test of "if (!IS_ERR_OR_NULL(bcache_debug))" might be buggy.

This patch initializes global varabile bcache_debug to be NULL, to make
the failure code path to be predictable.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/debug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 57f8f5aeee55..24b0eb65ddec 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -17,7 +17,7 @@
 #include 
 #include 
 
-struct dentry *bcache_debug;
+struct dentry *bcache_debug = NULL;
 
 #ifdef CONFIG_BCACHE_DEBUG
 
-- 
2.17.1



[PATCH 7/9] bcache: add code comments for bset.c

2018-07-26 Thread Coly Li
This patch tries to add code comments in bset.c, to make some
tricky code and designment to be more comprehensible. Most information
of this patch comes from the discussion between Kent and I, he
offers very informative details. If there is any mistake
of the idea behind the code, no doubt that's from me misrepresentation.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/bset.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index f3403b45bc28..596c93b44e9b 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -366,6 +366,10 @@ EXPORT_SYMBOL(bch_btree_keys_init);
 
 /* Binary tree stuff for auxiliary search trees */
 
+/*
+ * return array index next to j when does in-order traverse
+ * of a binary tree which is stored in a linear array
+ */
 static unsigned inorder_next(unsigned j, unsigned size)
 {
if (j * 2 + 1 < size) {
@@ -379,6 +383,10 @@ static unsigned inorder_next(unsigned j, unsigned size)
return j;
 }
 
+/*
+ * return array index previous to j when does in-order traverse
+ * of a binary tree which is stored in a linear array
+ */
 static unsigned inorder_prev(unsigned j, unsigned size)
 {
if (j * 2 < size) {
@@ -421,6 +429,10 @@ static unsigned __to_inorder(unsigned j, unsigned size, 
unsigned extra)
return j;
 }
 
+/*
+ * Return the cacheline index in bset_tree->data, where j is index
+ * from a linear array which stores the auxiliar binary tree
+ */
 static unsigned to_inorder(unsigned j, struct bset_tree *t)
 {
return __to_inorder(j, t->size, t->extra);
@@ -441,6 +453,10 @@ static unsigned __inorder_to_tree(unsigned j, unsigned 
size, unsigned extra)
return j;
 }
 
+/*
+ * Return an index from a linear array which stores the auxiliar binary
+ * tree, j is the cacheline index of t->data.
+ */
 static unsigned inorder_to_tree(unsigned j, struct bset_tree *t)
 {
return __inorder_to_tree(j, t->size, t->extra);
@@ -546,6 +562,20 @@ static inline uint64_t shrd128(uint64_t high, uint64_t 
low, uint8_t shift)
return low;
 }
 
+/*
+ * Calculate mantissa value for struct bkey_float.
+ * If most significant bit of f->exponent is not set, then
+ *  - f->exponent >> 6 is 0
+ *  - p[0] points to bkey->low
+ *  - p[-1] borrows bits from KEY_INODE() of bkey->high
+ * if most isgnificant bits of f->exponent is set, then
+ *  - f->exponent >> 6 is 1
+ *  - p[0] points to bits from KEY_INODE() of bkey->high
+ *  - p[-1] points to other bits from KEY_INODE() of
+ *bkey->high too.
+ * See make_bfloat() to check when most significant bit of f->exponent
+ * is set or not.
+ */
 static inline unsigned bfloat_mantissa(const struct bkey *k,
   struct bkey_float *f)
 {
@@ -570,6 +600,16 @@ static void make_bfloat(struct bset_tree *t, unsigned j)
BUG_ON(m < l || m > r);
BUG_ON(bkey_next(p) != m);
 
+   /*
+* If l and r have different KEY_INODE values (different backing
+* device), f->exponent records how many least significant bits
+* are different in KEY_INODE values and sets most significant
+* bits to 1 (by +64).
+* If l and r have same KEY_INODE value, f->exponent records
+* how many different bits in least significant bits of bkey->low.
+* See bfloat_mantiss() how the most significant bit of
+* f->exponent is used to calculate bfloat mantissa value.
+*/
if (KEY_INODE(l) != KEY_INODE(r))
f->exponent = fls64(KEY_INODE(r) ^ KEY_INODE(l)) + 64;
else
@@ -633,6 +673,15 @@ void bch_bset_init_next(struct btree_keys *b, struct bset 
*i, uint64_t magic)
 }
 EXPORT_SYMBOL(bch_bset_init_next);
 
+/*
+ * Build auxiliary binary tree 'struct bset_tree *t', this tree is used to
+ * accelerate bkey search in a btree node (pointed by bset_tree->data in
+ * memory). After search in the auxiliar tree by calling bset_search_tree(),
+ * a struct bset_search_iter is returned which indicates range [l, r] from
+ * bset_tree->data where the searching bkey might be inside. Then a followed
+ * linear comparison does the exact search, see __bch_bset_search() for how
+ * the auxiliary tree is used.
+ */
 void bch_bset_build_written_tree(struct btree_keys *b)
 {
struct bset_tree *t = bset_tree_last(b);
@@ -898,6 +947,17 @@ static struct bset_search_iter bset_search_tree(struct 
bset_tree *t,
unsigned inorder, j, n = 1;
 
do {
+   /*
+* A bit trick here.
+* If p < t->size, (int)(p - t->size) is a minus value and
+* the most significant bit is set, right shifting 31 bits
+* gets 1. If p >= t->size, the most significant bit is
+* not set, right shifting 31 bits gets 0.
+* So the following 2 lines equals to
+*  if (p >= t->size)
+*  p = 

[PATCH 1/9] bcache: do not check return value of debugfs_create_dir()

2018-07-26 Thread Coly Li
Greg KH suggests that normal code should not care about debugfs. Therefore
no matter successful or failed of debugfs_create_dir() execution, it is
unncessary to check its return value.

There are two functions called debugfs_create_dir() and check the return
value, which are bch_debug_init() and closure_debug_init(). This patch
changes these two functions from int to void type, and ignore return values
of debugfs_create_dir().

This patch does not fix exact bug, just makes things work as they should.

Signed-off-by: Coly Li 
Suggested-by: Greg Kroah-Hartman 
Cc: Kai Krakow 
Cc: Kent Overstreet 
---
 drivers/md/bcache/bcache.h  |  2 +-
 drivers/md/bcache/closure.c | 13 +
 drivers/md/bcache/closure.h |  4 ++--
 drivers/md/bcache/debug.c   | 11 ++-
 drivers/md/bcache/super.c   |  4 +++-
 5 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 872ef4d67711..a44bd427e5ba 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -1001,7 +1001,7 @@ void bch_open_buckets_free(struct cache_set *);
 int bch_cache_allocator_start(struct cache *ca);
 
 void bch_debug_exit(void);
-int bch_debug_init(struct kobject *);
+void bch_debug_init(struct kobject *);
 void bch_request_exit(void);
 int bch_request_init(void);
 
diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 0e14969182c6..618253683d40 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -199,11 +199,16 @@ static const struct file_operations debug_ops = {
.release= single_release
 };
 
-int __init closure_debug_init(void)
+void  __init closure_debug_init(void)
 {
-   closure_debug = debugfs_create_file("closures",
-   0400, bcache_debug, NULL, _ops);
-   return IS_ERR_OR_NULL(closure_debug);
+   if (!IS_ERR_OR_NULL(bcache_debug))
+   /*
+* it is unnecessary to check return value of
+* debugfs_create_file(), we should not care
+* about this.
+*/
+   closure_debug = debugfs_create_file(
+   "closures", 0400, bcache_debug, NULL, _ops);
 }
 #endif
 
diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 71427eb5fdae..7c2c5bc7c88b 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -186,13 +186,13 @@ static inline void closure_sync(struct closure *cl)
 
 #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
 
-int closure_debug_init(void);
+void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline int closure_debug_init(void) { return 0; }
+static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index d030ce3025a6..57f8f5aeee55 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -248,11 +248,12 @@ void bch_debug_exit(void)
debugfs_remove_recursive(bcache_debug);
 }
 
-int __init bch_debug_init(struct kobject *kobj)
+void __init bch_debug_init(struct kobject *kobj)
 {
-   if (!IS_ENABLED(CONFIG_DEBUG_FS))
-   return 0;
-
+   /*
+* it is unnecessary to check return value of
+* debugfs_create_file(), we should not care
+* about this.
+*/
bcache_debug = debugfs_create_dir("bcache", NULL);
-   return IS_ERR_OR_NULL(bcache_debug);
 }
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e0a92104ca23..c7ffa6ef3f82 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2345,10 +2345,12 @@ static int __init bcache_init(void)
goto err;
 
if (bch_request_init() ||
-   bch_debug_init(bcache_kobj) || closure_debug_init() ||
sysfs_create_files(bcache_kobj, files))
goto err;
 
+   bch_debug_init(bcache_kobj);
+   closure_debug_init();
+
return 0;
 err:
bcache_exit();
-- 
2.17.1



[PATCH 9/9] bcache: set max writeback rate when I/O request is idle

2018-07-26 Thread Coly Li
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
Cc: sta...@vger.kernel.org #4.16+
Signed-off-by: Coly Li 
Tested-by: Kai Krakow 
Tested-by: Stefan Priebe 
Cc: Michael Lyle 
---
 drivers/md/bcache/bcache.h| 10 ++--
 drivers/md/bcache/request.c   | 54 -
 drivers/md/bcache/super.c |  4 ++
 drivers/md/bcache/sysfs.c | 15 --
 drivers/md/bcache/util.c  |  2 +-
 drivers/md/bcache/util.h  |  2 +-
 drivers/md/bcache/writeback.c | 91 +++
 7 files changed, 134 insertions(+), 44 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5f7082aab1b0..97489573dedc 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -328,13 +328,6 @@ struct cached_dev {
 */
atomic_thas_dirty;
 
-   /*
-* Set to zero by things that touch the backing volume-- except
-* writeback.  Incremented by writeback.  Used to determine when to
-* accelerate idle writeback.
-*/
-   atomic_tbacking_idle;
-
struct bch_ratelimitwriteback_rate;
struct delayed_work writeback_rate_update;
 
@@ -515,6 +508,8 @@ struct cache_set {
struct cache_accounting accounting;
 
unsigned long   flags;
+   atomic_tidle_counter;
+   atomic_tat_max_writeback_rate;
 
struct cache_sb sb;
 
@@ -524,6 +519,7 @@ struct cache_set {
 
struct bcache_device**devices;
unsigneddevices_max_used;
+   atomic_tattached_dev_nr;
struct list_headcached_devs;
uint64_tcached_dev_sectors;
atomic_long_t   flash_dev_dirty_sectors;
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 91206f329971..86a977c2a176 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1105,6 +1105,44 @@ static void detached_dev_do_request(struct bcache_device 
*d, struct bio *bio)
generic_make_request(bio);
 }
 
+static void quit_max_writeback_rate(struct cache_set *c,
+   struct cached_dev *this_dc)
+{
+   int i;
+   struct bcache_device *d;
+   struct cached_dev *dc;
+
+   /*
+* mutex bch_register_lock may compete with other parallel requesters,
+* or attach/detach operations on other backing device. Waiting to
+* the mutex lock may increase I/O request latency for seconds or more.
+* To avoid such situation, if mutext_trylock() failed, only writeback
+* rate of current cached device is set to 1, and __update_write_back()
+* will decide writeback rate of other cached devices (remember now
+* c->idle_counter is 0 already).
+*/
+   if (mutex_trylock(_register_lock)) {
+   for (i = 0; i < c->devices_max_used; i++) {
+   if (!c->devices[i])
+   continue;
+
+   if (UUID_FLASH_ONLY(>uuids[i]))
+   continue;
+
+   d = c->devices[i];
+   dc = container_of(d, struct cached_dev, disk);
+   /*
+* set writeback rate to default minimum value,
+* then let update_writeback_rate() to decide the
+* upcoming rate.
+*/
+

[PATCH 5/9] bcache: fix mistaken code comments in struct cache

2018-07-26 Thread Coly Li
---
 drivers/md/bcache/bcache.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index a44bd427e5ba..5f7082aab1b0 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -423,8 +423,8 @@ struct cache {
/*
 * When allocating new buckets, prio_write() gets first dibs - since we
 * may not be allocate at all without writing priorities and gens.
-* prio_buckets[] contains the last buckets we wrote priorities to (so
-* gc can mark them as metadata), prio_next[] contains the buckets
+* prio_last_buckets[] contains the last buckets we wrote priorities to 
(so
+* gc can mark them as metadata), prio_buckets[] contains the buckets
 * allocated for the next prio write.
 */
uint64_t*prio_buckets;
-- 
2.17.1



[GIT PULL] nvme fixes for 4.18

2018-07-26 Thread Christoph Hellwig
Two small fixes each for the FC code and the target.


The following changes since commit 8f3ea35929a0806ad1397db99a89ffee0140822a:

  nbd: handle unexpected replies better (2018-07-16 10:14:40 -0600)

are available in the Git repository at:

  git://git.infradead.org/nvme.git nvme-4.18

for you to fetch changes up to 405a7519607e7a364114896264440c0f87b325c0:

  nvmet: only check for filebacking on -ENOTBLK (2018-07-25 13:14:04 +0200)


Hannes Reinecke (2):
  nvmet: fixup crash on NULL device path
  nvmet: only check for filebacking on -ENOTBLK

James Smart (2):
  nvmet-fc: fix target sgl list on large transfers
  nvme: if_ready checks to fail io to deleting controller

 drivers/nvme/host/fabrics.c| 10 +++---
 drivers/nvme/host/fabrics.h|  3 ++-
 drivers/nvme/host/fc.c |  2 +-
 drivers/nvme/host/rdma.c   |  2 +-
 drivers/nvme/target/configfs.c |  9 +++--
 drivers/nvme/target/core.c |  2 +-
 drivers/nvme/target/fc.c   | 44 +-
 drivers/nvme/target/loop.c |  2 +-
 8 files changed, 55 insertions(+), 19 deletions(-)


Re: [PATCH] blktests: Add '--outdir' to store results in a different directory

2018-07-26 Thread Hannes Reinecke
On 07/25/2018 11:23 PM, Omar Sandoval wrote:
> On Tue, Jul 17, 2018 at 03:27:50PM +0200, Hannes Reinecke wrote:
>> Adding an option '--outdir' to store results in a different
>> director so as not to clutter the git repository itself.
>>
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  check | 14 ++
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/check b/check
>> index a635531..42d07f8 100755
>> --- a/check
>> +++ b/check
>> @@ -334,7 +334,7 @@ _call_test() {
>>  fi
>>  
>>  trap _cleanup EXIT
>> -if ! TMPDIR="$(mktemp --tmpdir -p "$PWD/results" -d 
>> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
>> +if ! TMPDIR="$(mktemp --tmpdir -p "$RESULTS_DIR" -d 
>> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
>>  return
>>  fi
>>  
>> @@ -415,7 +415,7 @@ _run_test() {
>>  return 0
>>  fi
>>  
>> -RESULTS_DIR="results/nodev"
>> +RESULTS_DIR="${OUT_DIR}/results/nodev"
>>  _call_test test
>>  else
>>  if [[ ${#TEST_DEVS[@]} -eq 0 ]]; then
>> @@ -434,7 +434,7 @@ _run_test() {
>>  _output_notrun "$TEST_NAME => $(basename 
>> "$TEST_DEV")"
>>  continue
>>  fi
>> -RESULTS_DIR="results/$(basename "$TEST_DEV")"
>> +RESULTS_DIR="${OUT_DIR}/results/$(basename "$TEST_DEV")"
>>  if ! _call_test test_device; then
>>  ret=1
>>  fi
>> @@ -567,6 +567,7 @@ Test runs:
>>   tests to run
>>  
>>  Miscellaneous:
>> +  -o, --outdir=OUTDIRwrite results into the specified directory
>>-h, --help display this help message and exit"
>>  
>>  case "$1" in
>> @@ -581,12 +582,13 @@ Miscellaneous:
>>  esac
>>  }
>>  
>> -if ! TEMP=$(getopt -o 'dq::x:h' --long 'quick::,exclude:,help' -n "$0" -- 
>> "$@"); then
>> +if ! TEMP=$(getopt -o 'dq::o:x:h' --long 'quick::,exclude:,outdir:,help' -n 
>> "$0" -- "$@"); then
>>  exit 1
>>  fi
>>  
>>  eval set -- "$TEMP"
>>  unset TEMP
>> +OUT_DIR="."
> 
> This doesn't allow setting it from the config file. How about this?
> 
> diff --git a/check b/check
> index 5f4461f..5e99415 100755
> --- a/check
> +++ b/check
> @@ -313,7 +313,7 @@ _call_test() {
>   local test_func="$1"
>   local seqres="${RESULTS_DIR}/${TEST_NAME}"
>   # shellcheck disable=SC2034
> - FULL="$PWD/${seqres}.full"
> + FULL="${seqres}.full"
>   declare -A TEST_DEV_QUEUE_SAVED
>  
>   _read_last_test_run
> @@ -334,7 +334,7 @@ _call_test() {
>   fi
>  
>   trap _cleanup EXIT
> - if ! TMPDIR="$(mktemp --tmpdir -p "$PWD/results" -d 
> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
> + if ! TMPDIR="$(mktemp --tmpdir -p "$OUTPUT" -d 
> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
>   return
>   fi
>  
> @@ -415,7 +415,7 @@ _run_test() {
>   return 0
>   fi
>  
> - RESULTS_DIR="results/nodev"
> + RESULTS_DIR="$OUTPUT/nodev"
>   _call_test test
>   else
>   if [[ ${#TEST_DEVS[@]} -eq 0 ]]; then
> @@ -434,7 +434,7 @@ _run_test() {
>   _output_notrun "$TEST_NAME => $(basename 
> "$TEST_DEV")"
>   continue
>   fi
> - RESULTS_DIR="results/$(basename "$TEST_DEV")"
> + RESULTS_DIR="$OUTPUT/$(basename "$TEST_DEV")"
>   if ! _call_test test_device; then
>   ret=1
>   fi
> @@ -559,6 +559,9 @@ Test runs:
>-d, --device-only   only run tests which use a test device from the
>TEST_DEVS config setting
>  
> +  -o, --output=DIRoutput results to the given directory (the default is
> +  ./results)
> +
>-q, --quick=SECONDS do a quick run (only run quick tests and limit 
> the
>runtime of longer tests to the given timeout,
>defaulting to 30 seconds)
> @@ -581,7 +584,7 @@ Miscellaneous:
>   esac
>  }
>  
> -if ! TEMP=$(getopt -o 'dq::x:h' --long 'quick::,exclude:,help' -n "$0" -- 
> "$@"); then
> +if ! TEMP=$(getopt -o 'do:q::x:h' --long 'quick::,exclude:,output:,help' -n 
> "$0" -- "$@"); then
>   exit 1
>  fi
>  
> @@ -596,6 +599,7 @@ fi
>  # Default configuration.
>  : "${DEVICE_ONLY:=0}"
>  : "${QUICK_RUN:=0}"
> +: "${OUTPUT:=results}"
>  if [[ -v EXCLUDE ]] && ! declare -p EXCLUDE | grep -q '^declare -a'; then
>   # If EXCLUDE was not defined as an array, convert it to one.
>   # shellcheck disable=SC2190,SC2206
> @@ -617,6 +621,10 @@ while true; do
>   DEVICE_ONLY=1
>   shift
>   ;;
> + '-o'|'--output')
> + OUTPUT="$2"
> + shift 2
> +   

Re: [PATCH v3] bcache: set max writeback rate when I/O request is idle

2018-07-26 Thread Coly Li
On 2018/7/26 7:49 PM, Stefan Priebe - Profihost AG wrote:
> Am 26.07.2018 um 12:42 schrieb Coly Li:
>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>> allows the writeback rate to be faster if there is no I/O request on a
>> bcache device. It works well if there is only one bcache device attached
>> to the cache set. If there are many bcache devices attached to a cache
>> set, it may introduce performance regression because multiple faster
>> writeback threads of the idle bcache devices will compete the btree level
>> locks with the bcache device who have I/O requests coming.
>>
>> This patch fixes the above issue by only permitting fast writebac when
>> all bcache devices attached on the cache set are idle. And if one of the
>> bcache devices has new I/O request coming, minimized all writeback
>> throughput immediately and let PI controller __update_writeback_rate()
>> to decide the upcoming writeback rate for each bcache device.
>>
>> Also when all bcache devices are idle, limited wrieback rate to a small
>> number is wast of thoughput, especially when backing devices are slower
>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>> rate for each backing device if the whole cache set is idle. A faster
>> writeback rate in idle time means new I/Os may have more available space
>> for dirty data, and people may observe a better write performance then.
>>
>> Please note bcache may change its cache mode in run time, and this patch
>> still works if the cache mode is switched from writeback mode and there
>> is still dirty data on cache.
> 
> Tested-by: Stefan Priebe 
> 
> Working fine now.
> 

Great news, thanks !

Coly Li

>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing 
>> idle")
>> Cc: sta...@vger.kernel.org #4.16+
>> Signed-off-by: Coly Li 
>> Tested-by: Kai Krakow 
>> Cc: Michael Lyle 
>> Cc: Stefan Priebe 
>> ---
>> Channgelog:
>> v3, Do not acquire bch_register_lock in set_at_max_writeback_rate().
>> v2, Fix a deadlock reported by Stefan Priebe.
>> v1, Initial version.
>>
>>  drivers/md/bcache/bcache.h| 10 ++--
>>  drivers/md/bcache/request.c   | 54 -
>>  drivers/md/bcache/super.c |  4 ++
>>  drivers/md/bcache/sysfs.c | 14 --
>>  drivers/md/bcache/util.c  |  2 +-
>>  drivers/md/bcache/util.h  |  2 +-
>>  drivers/md/bcache/writeback.c | 91 +++
>>  7 files changed, 133 insertions(+), 44 deletions(-)
[snipped]



Re: [PATCH v3] bcache: set max writeback rate when I/O request is idle

2018-07-26 Thread Stefan Priebe - Profihost AG
Am 26.07.2018 um 12:42 schrieb Coly Li:
> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> allows the writeback rate to be faster if there is no I/O request on a
> bcache device. It works well if there is only one bcache device attached
> to the cache set. If there are many bcache devices attached to a cache
> set, it may introduce performance regression because multiple faster
> writeback threads of the idle bcache devices will compete the btree level
> locks with the bcache device who have I/O requests coming.
> 
> This patch fixes the above issue by only permitting fast writebac when
> all bcache devices attached on the cache set are idle. And if one of the
> bcache devices has new I/O request coming, minimized all writeback
> throughput immediately and let PI controller __update_writeback_rate()
> to decide the upcoming writeback rate for each bcache device.
> 
> Also when all bcache devices are idle, limited wrieback rate to a small
> number is wast of thoughput, especially when backing devices are slower
> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
> rate for each backing device if the whole cache set is idle. A faster
> writeback rate in idle time means new I/Os may have more available space
> for dirty data, and people may observe a better write performance then.
> 
> Please note bcache may change its cache mode in run time, and this patch
> still works if the cache mode is switched from writeback mode and there
> is still dirty data on cache.

Tested-by: Stefan Priebe 

Working fine now.

> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> Cc: sta...@vger.kernel.org #4.16+
> Signed-off-by: Coly Li 
> Tested-by: Kai Krakow 
> Cc: Michael Lyle 
> Cc: Stefan Priebe 
> ---
> Channgelog:
> v3, Do not acquire bch_register_lock in set_at_max_writeback_rate().
> v2, Fix a deadlock reported by Stefan Priebe.
> v1, Initial version.
> 
>  drivers/md/bcache/bcache.h| 10 ++--
>  drivers/md/bcache/request.c   | 54 -
>  drivers/md/bcache/super.c |  4 ++
>  drivers/md/bcache/sysfs.c | 14 --
>  drivers/md/bcache/util.c  |  2 +-
>  drivers/md/bcache/util.h  |  2 +-
>  drivers/md/bcache/writeback.c | 91 +++
>  7 files changed, 133 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 872ef4d67711..13f908be42ba 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -328,13 +328,6 @@ struct cached_dev {
>*/
>   atomic_thas_dirty;
>  
> - /*
> -  * Set to zero by things that touch the backing volume-- except
> -  * writeback.  Incremented by writeback.  Used to determine when to
> -  * accelerate idle writeback.
> -  */
> - atomic_tbacking_idle;
> -
>   struct bch_ratelimitwriteback_rate;
>   struct delayed_work writeback_rate_update;
>  
> @@ -515,6 +508,8 @@ struct cache_set {
>   struct cache_accounting accounting;
>  
>   unsigned long   flags;
> + atomic_tidle_counter;
> + atomic_tat_max_writeback_rate;
>  
>   struct cache_sb sb;
>  
> @@ -524,6 +519,7 @@ struct cache_set {
>  
>   struct bcache_device**devices;
>   unsigneddevices_max_used;
> + atomic_tattached_dev_nr;
>   struct list_headcached_devs;
>   uint64_tcached_dev_sectors;
>   atomic_long_t   flash_dev_dirty_sectors;
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 8eece9ef9f46..26f97acde403 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1105,6 +1105,44 @@ static void detached_dev_do_request(struct 
> bcache_device *d, struct bio *bio)
>   generic_make_request(bio);
>  }
>  
> +static void quit_max_writeback_rate(struct cache_set *c,
> + struct cached_dev *this_dc)
> +{
> + int i;
> + struct bcache_device *d;
> + struct cached_dev *dc;
> +
> + /*
> +  * mutex bch_register_lock may compete with other parallel requesters,
> +  * or attach/detach operations on other backing device. Waiting to
> +  * the mutex lock may increase I/O request latency for seconds or more.
> +  * To avoid such situation, if mutext_trylock() failed, only writeback
> +  * rate of current cached device is set to 1, and __update_write_back()
> +  * will decide writeback rate of other cached devices (remember now
> +  * c->idle_counter is 0 already).
> +  */
> + if (mutex_trylock(_register_lock)) {
> + for (i = 0; i < c->devices_max_used; i++) {
> + if (!c->devices[i])
> + continue;
> +
> + if (UUID_FLASH_ONLY(>uuids[i]))
> + 

Re: [PATCH for v4.18] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER

2018-07-26 Thread Johannes Thumshirn
I'm OK with the rename in general but not sure about doing it this
late in the cycle.

Anyways,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH v3] bcache: set max writeback rate when I/O request is idle

2018-07-26 Thread Coly Li
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
Cc: sta...@vger.kernel.org #4.16+
Signed-off-by: Coly Li 
Tested-by: Kai Krakow 
Cc: Michael Lyle 
Cc: Stefan Priebe 
---
Channgelog:
v3, Do not acquire bch_register_lock in set_at_max_writeback_rate().
v2, Fix a deadlock reported by Stefan Priebe.
v1, Initial version.

 drivers/md/bcache/bcache.h| 10 ++--
 drivers/md/bcache/request.c   | 54 -
 drivers/md/bcache/super.c |  4 ++
 drivers/md/bcache/sysfs.c | 14 --
 drivers/md/bcache/util.c  |  2 +-
 drivers/md/bcache/util.h  |  2 +-
 drivers/md/bcache/writeback.c | 91 +++
 7 files changed, 133 insertions(+), 44 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 872ef4d67711..13f908be42ba 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -328,13 +328,6 @@ struct cached_dev {
 */
atomic_thas_dirty;
 
-   /*
-* Set to zero by things that touch the backing volume-- except
-* writeback.  Incremented by writeback.  Used to determine when to
-* accelerate idle writeback.
-*/
-   atomic_tbacking_idle;
-
struct bch_ratelimitwriteback_rate;
struct delayed_work writeback_rate_update;
 
@@ -515,6 +508,8 @@ struct cache_set {
struct cache_accounting accounting;
 
unsigned long   flags;
+   atomic_tidle_counter;
+   atomic_tat_max_writeback_rate;
 
struct cache_sb sb;
 
@@ -524,6 +519,7 @@ struct cache_set {
 
struct bcache_device**devices;
unsigneddevices_max_used;
+   atomic_tattached_dev_nr;
struct list_headcached_devs;
uint64_tcached_dev_sectors;
atomic_long_t   flash_dev_dirty_sectors;
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 8eece9ef9f46..26f97acde403 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1105,6 +1105,44 @@ static void detached_dev_do_request(struct bcache_device 
*d, struct bio *bio)
generic_make_request(bio);
 }
 
+static void quit_max_writeback_rate(struct cache_set *c,
+   struct cached_dev *this_dc)
+{
+   int i;
+   struct bcache_device *d;
+   struct cached_dev *dc;
+
+   /*
+* mutex bch_register_lock may compete with other parallel requesters,
+* or attach/detach operations on other backing device. Waiting to
+* the mutex lock may increase I/O request latency for seconds or more.
+* To avoid such situation, if mutext_trylock() failed, only writeback
+* rate of current cached device is set to 1, and __update_write_back()
+* will decide writeback rate of other cached devices (remember now
+* c->idle_counter is 0 already).
+*/
+   if (mutex_trylock(_register_lock)) {
+   for (i = 0; i < c->devices_max_used; i++) {
+   if (!c->devices[i])
+   continue;
+
+   if (UUID_FLASH_ONLY(>uuids[i]))
+   continue;
+
+   d = c->devices[i];
+   dc = container_of(d, struct cached_dev, disk);
+   /*
+* set writeback rate to default minimum value,
+  

Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

2018-07-26 Thread Christoph Hellwig
Both the changelog and the comments should use up all horizontal space
(73 and 80 charcs respectively), but that can be fixed up when applied.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig 


Re: [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case

2018-07-26 Thread Christoph Hellwig
On Wed, Jul 25, 2018 at 11:15:08PM +0200, Martin Wilck wrote:
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
>  simplified bdev direct-io")
> Reviewed-by: Ming Lei 
> Signed-off-by: Martin Wilck 

I'm pretty sure I already ACKed this last time, but here we go
again:

Reviewed-by: Christoph Hellwig 


Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-26 Thread jianchao.wang



On 07/26/2018 03:52 PM, jianchao.wang wrote:
> In addition, .runtime_suspend is invoked under spinlock and irq-disabled.
> So sleep is forbidden here.
> Please refer to rpm_suspend
> 
> * This function must be called under dev->power.lock with interrupts disabled
> 
__rpm_callback will unlock the spinlock before invoke the callback.
Sorry for the noise.

Thanks
Jianchao



Re: [PATCH v2 4/5] block, scsi: Rework runtime power management

2018-07-26 Thread jianchao.wang



On 07/26/2018 10:45 AM, jianchao.wang wrote:
> Hi Bart
> 
> On 07/26/2018 06:26 AM, Bart Van Assche wrote:
>> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>  return ret;
>>  
>>  blk_pm_runtime_lock(q);
>> +blk_set_preempt_only(q);
> 
> We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING.
> blk_pre_runtime_suspend should only _check_ whether runtime suspend is 
> allowed.
> So we should not set preempt_only here.
> 
> 
>> +percpu_ref_switch_to_atomic_sync(>q_usage_counter);

In addition, .runtime_suspend is invoked under spinlock and irq-disabled.
So sleep is forbidden here.
Please refer to rpm_suspend

* This function must be called under dev->power.lock with interrupts disabled

Thanks
Jianchao
>>  
>>  spin_lock_irq(q->queue_lock);
>> -if (q->nr_pending) {
>> +if (!percpu_ref_is_zero(>q_usage_counter)) {
>>  ret = -EBUSY;
>>  pm_runtime_mark_last_busy(q->dev);
>>  } else {
>> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>  }
>>  spin_unlock_irq(q->queue_lock);
>>  
>> +percpu_ref_switch_to_percpu(>q_usage_counter);
>>  blk_pm_runtime_unlock(q);
> 


Re: [PATCH v2] bcache: set max writeback rate when I/O request is idle

2018-07-26 Thread Coly Li
On 2018/7/26 1:44 PM, Stefan Priebe - Profihost AG wrote:
>> Am 25.07.2018 um 08:16 schrieb Stefan Priebe - Profihost AG 
>> :
>>
>>> Am 24.07.2018 um 18:36 schrieb Coly Li:
 On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote:
> Am 24.07.2018 um 10:28 schrieb Coly Li:
>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote:
>>> Am 24.07.2018 um 06:03 schrieb Coly Li:
>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>> allows the writeback rate to be faster if there is no I/O request on a
>>> bcache device. It works well if there is only one bcache device attached
>>> to the cache set. If there are many bcache devices attached to a cache
>>> set, it may introduce performance regression because multiple faster
>>> writeback threads of the idle bcache devices will compete the btree 
>>> level
>>> locks with the bcache device who have I/O requests coming.
>>>
>>> This patch fixes the above issue by only permitting fast writebac when
>>> all bcache devices attached on the cache set are idle. And if one of the
>>> bcache devices has new I/O request coming, minimized all writeback
>>> throughput immediately and let PI controller __update_writeback_rate()
>>> to decide the upcoming writeback rate for each bcache device.
>>>
>>> Also when all bcache devices are idle, limited wrieback rate to a small
>>> number is wast of thoughput, especially when backing devices are slower
>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>>> rate for each backing device if the whole cache set is idle. A faster
>>> writeback rate in idle time means new I/Os may have more available space
>>> for dirty data, and people may observe a better write performance then.
>>>
>>> Please note bcache may change its cache mode in run time, and this patch
>>> still works if the cache mode is switched from writeback mode and there
>>> is still dirty data on cache.
>>>
>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing 
>>> idle")
>>> Cc: sta...@vger.kernel.org #4.16+
>>> Signed-off-by: Coly Li 
>>> Tested-by: Kai Krakow 
>>> Cc: Michael Lyle 
>>> Cc: Stefan Priebe 
>>
>> Hi Coly,
>>
>> i'm still experiencing the same bug as yesterday while rebooting every
>> two times i get a deadlock in cache_set_free.
>>
>> Sadly it's so late ion the shutdown process that netconsole is already
>> unloaded or at least i get no messages from while it happens. The traces
>> look the same like yesterday.
>
> Hi Stefan,
>
> Do you use the v2 patch on latest SLE15 kernel source?

 Yes - i use the kernel code from here:
 https://github.com/openSUSE/kernel-source/commits/SLE15

> Let me try to
> reproduce it on my machine. I assume when you reboot, the cache is still
> dirty and not clean up, am I right ?

 Yes with around 15GB of dirty data - ceph is running on top of it and
 generated many random writes.

> And which file system do you mount
> on top of the bcache device ?
 XFS
>>>
>>> Hi Stefan,
>>>
>>> Thanks for the information. I managed to reproduce a similar deadlock on
>>> my small box. Let me see how to fix it and post a new version ASAP :-)
>>
>> Great!
> 
> Any news on this already?

I am testing the new version, and it also requires other patches I
posted earlier for 4.19 (which gets rid of bch_register_lock in
writeback thread).

Hope I can make it today :-)

Coly Li



Re: [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case

2018-07-26 Thread Hannes Reinecke
On 07/25/2018 11:15 PM, Martin Wilck wrote:
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
>  simplified bdev direct-io")
> Reviewed-by: Ming Lei 
> Signed-off-by: Martin Wilck 
> ---
>  fs/block_dev.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)