Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On 07/27/2018 06:24 AM, Bart Van Assche wrote: > On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote: >> On 07/26/2018 06:26 AM, Bart Van Assche wrote: >>> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q) >>> return ret; >>> >>> blk_pm_runtime_lock(q); >>> + blk_set_preempt_only(q); >> >> We only stop non-RQF_PM request entering when RPM_SUSPENDING and >> RPM_RESUMING. >> blk_pre_runtime_suspend should only _check_ whether runtime suspend is >> allowed. >> So we should not set preempt_only here. >> >>> + percpu_ref_switch_to_atomic_sync(>q_usage_counter); >>> >>> spin_lock_irq(q->queue_lock); >>> - if (q->nr_pending) { >>> + if (!percpu_ref_is_zero(>q_usage_counter)) { >>> ret = -EBUSY; >>> pm_runtime_mark_last_busy(q->dev); >>> } else { >>> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q) >>> } >>> spin_unlock_irq(q->queue_lock); >>> >>> + percpu_ref_switch_to_percpu(>q_usage_counter); >>> blk_pm_runtime_unlock(q); > > Hello Jianchao, > > There is only one caller of blk_post_runtime_suspend(), namely > sdev_runtime_suspend(). That function calls pm->runtime_suspend() before > it calls blk_post_runtime_suspend(). I think it would be wrong to set the > PREEMPT_ONLY flag from inside blk_post_runtime_suspend() because that could > cause pm->runtime_suspend() while a request is in progress. > Hi Bart If q_usage_counter is not zero here, we will leave the request_queue in preempt only mode. The request_queue should be set to preempt only mode only when we confirm we could set rpm_status to RPM_SUSPENDING or RPM_RESUMING. Maybe we could set the PREEMPT_ONLY after we confirm we could set the rpm_status to RPM_SUSPENDING. Something like: percpu_ref_switch_to_atomic_sync(>q_usage_counter); if (!percpu_ref_is_zero(>q_usage_counter)) { ret = -EBUSY; pm_runtime_mark_last_busy(q->dev); } else { blk_set_preempt_only(q); if (!percpu_ref_is_zero(>q_usage_counter) { ret = -EBUSY; pm_runtime_mark_last_busy(q->dev); blk_clear_preempt_only(q); } else { q->rpm_status = RPM_SUSPENDIN; } } Thanks Jianchao
Re: [PATCH 0/2] blktests: test ANA base support
On Thu, Jul 26, 2018 at 02:31:32PM -0700, Omar Sandoval wrote: > On Thu, Jul 26, 2018 at 09:35:53AM -0700, James Smart wrote: > > make sure this fix has been picked up in your kernel: > > http://git.infradead.org/nvme.git/commit/6cdefc6e2ad52170f89a8d0e8b1a1339f91834dc > > > > deletes were broken otherwise and have the exact trace below. > > > > -- james > > Thanks, James, that indeed fixes the problem. I'll go ahead and merge > these tests since there's a fix in flight. By "these tests", I mean Chaitanya's tests, Hannes still has some comments to resolve :)
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote: > On 07/26/2018 06:26 AM, Bart Van Assche wrote: > > @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q) > > return ret; > > > > blk_pm_runtime_lock(q); > > + blk_set_preempt_only(q); > > We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING. > blk_pre_runtime_suspend should only _check_ whether runtime suspend is > allowed. > So we should not set preempt_only here. > > > + percpu_ref_switch_to_atomic_sync(>q_usage_counter); > > > > spin_lock_irq(q->queue_lock); > > - if (q->nr_pending) { > > + if (!percpu_ref_is_zero(>q_usage_counter)) { > > ret = -EBUSY; > > pm_runtime_mark_last_busy(q->dev); > > } else { > > @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q) > > } > > spin_unlock_irq(q->queue_lock); > > > > + percpu_ref_switch_to_percpu(>q_usage_counter); > > blk_pm_runtime_unlock(q); Hello Jianchao, There is only one caller of blk_post_runtime_suspend(), namely sdev_runtime_suspend(). That function calls pm->runtime_suspend() before it calls blk_post_runtime_suspend(). I think it would be wrong to set the PREEMPT_ONLY flag from inside blk_post_runtime_suspend() because that could cause pm->runtime_suspend() while a request is in progress. Bart.
Re: [PATCH v2 3/5] block: Serialize queue freezing and blk_pre_runtime_suspend()
On Thu, 2018-07-26 at 10:45 +0800, jianchao.wang wrote: > On 07/26/2018 06:26 AM, Bart Van Assche wrote: > > + > > +void blk_pm_runtime_lock(struct request_queue *q) > > +{ > > + spin_lock(>rpm_lock); > > + wait_event_interruptible_locked(q->rpm_wq, > > + q->rpm_owner == NULL || q->rpm_owner == current); > > + if (q->rpm_owner == NULL) > > + q->rpm_owner = current; > > + q->rpm_nesting_level++; > > + spin_unlock(>rpm_lock); > > +} > > The lock which wait_event_interruptible_locked want to hold is the wq.lock. The above code is indeed wrong. This is what I think it should be changed into: +void blk_pm_runtime_lock(struct request_queue *q) +{ + might_sleep(); + + spin_lock(>rpm_lock); + wait_event_exclusive_cmd(q->rpm_wq, + q->rpm_owner == NULL || q->rpm_owner == current, + spin_unlock(>rpm_lock), spin_lock(>rpm_lock)); + if (q->rpm_owner == NULL) + q->rpm_owner = current; + q->rpm_nesting_level++; + spin_unlock(>rpm_lock); +} Bart.
Re: [PATCH 0/2] blktests: test ANA base support
On Thu, Jul 26, 2018 at 09:35:53AM -0700, James Smart wrote: > make sure this fix has been picked up in your kernel: > http://git.infradead.org/nvme.git/commit/6cdefc6e2ad52170f89a8d0e8b1a1339f91834dc > > deletes were broken otherwise and have the exact trace below. > > -- james Thanks, James, that indeed fixes the problem. I'll go ahead and merge these tests since there's a fix in flight.
[PATCH] readahead: stricter check for bdi io_pages
ondemand_readahead() checks bdi->io_pages to cap the maximum pages that need to be processed. This works until the readit section. If we would do an async only readahead (async size = sync size) and target is at beginning of window we expand the pages by another get_next_ra_size() pages. Btrace for large reads shows that kernel always issues a doubled size read at the beginning of processing. Add an additional check for io_pages in the lower part of the func. The fix helps devices that hard limit bio pages and rely on proper handling of max_hw_read_sectors (e.g. older FusionIO cards). For that reason it could qualify for stable. Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting") Signed-off-by: Markus Stockhausen stockhau...@collogia.de diff --git a/mm/readahead.c b/mm/readahead.c index 539bbb6..d4ad4bb 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -380,6 +380,7 @@ static int try_context_readahead(struct address_space *mapping, { struct backing_dev_info *bdi = inode_to_bdi(mapping->host); unsigned long max_pages = ra->ra_pages; + unsigned long add_pages; pgoff_t prev_offset; /* @@ -469,10 +470,17 @@ static int try_context_readahead(struct address_space *mapping, * Will this read hit the readahead marker made by itself? * If so, trigger the readahead marker hit now, and merge * the resulted next readahead window into the current one. +* Take care of maximum IO pages as above. */ if (offset == ra->start && ra->size == ra->async_size) { - ra->async_size = get_next_ra_size(ra, max_pages); - ra->size += ra->async_size; + add_pages = get_next_ra_size(ra, max_pages); + if (ra->size + add_pages <= max_pages) { + ra->async_size = add_pages; + ra->size += add_pages; + } else { + ra->size = max_pages; + ra->async_size = max_pages >> 1; + } } return ra_submit(ra, mapping, filp); Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497
[PATCH] block: reset bi_iter.bi_done after splitting bio
After the bio has been updated to represent the remaining sectors, reset bi_done so bio_rewind_iter() does not rewind further than it should. This resolves a bio_integrity_process() failure on reads where the original request was split. Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage") Signed-off-by: Greg Edwards --- Simple test case: # modprobe scsi_debug.ko dif=1 dix=1 guard=0 dev_size_mb=1024 sector_size=4096 # echo 512 > /sys/block//queue/max_sectors_kb # dd if=/dev/ of=/dev/null bs=1M iflag=direct count=1 block/bio.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/bio.c b/block/bio.c index 67eff5eddc49..f33a030b9dad 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1866,6 +1866,7 @@ struct bio *bio_split(struct bio *bio, int sectors, bio_integrity_trim(split); bio_advance(bio, split->bi_iter.bi_size); + bio->bi_iter.bi_done = 0; if (bio_flagged(bio, BIO_TRACE_COMPLETION)) bio_set_flag(split, BIO_TRACE_COMPLETION); -- 2.17.1
Re: [PATCH v5 0/3] Fix silent data corruption in blkdev_direct_IO()
On 7/25/18 2:15 PM, Martin Wilck wrote: > Hello Jens, Ming, Jan, and all others, > > the following patches have been verified by a customer to fix a silent data > corruption which he has been seeing since "72ecad2 block: support a full bio > worth of IO for simplified bdev direct-io". > > The patches are based on our observation that the corruption is only > observed if the __blkdev_direct_IO_simple() code path is executed, > and if that happens, "short writes" are observed in this code path, > which causes a fallback to buffered IO, while the application continues > submitting direct IO requests. > > Following Ming's suggestion, I've changed the patch set such that > bio_iov_iter_get_pages() now always returns as many pages as possible. > This simplifies the patch set a lot. Except for > __blkdev_direct_IO_simple(), all callers of bio_iov_iter_get_pages() > call it in a loop, and expect to get just some pages. Therefore I > have made bio_iov_iter_get_pages() return success if it can pin some > pages, even if MM returns an error on the way. Error is returned only > if no pages at all could be pinned. This also avoids the need for > cleanup code in the helper - callers will submit the bio with the > allocated pages, and clean up later as appropriate. Thanks everyone involved in this, I've queued it up for 4.18. -- Jens Axboe
Re: [GIT PULL] nvme fixes for 4.18
On 7/26/18 7:16 AM, Christoph Hellwig wrote: > Two small fixes each for the FC code and the target. > > > The following changes since commit 8f3ea35929a0806ad1397db99a89ffee0140822a: > > nbd: handle unexpected replies better (2018-07-16 10:14:40 -0600) > > are available in the Git repository at: > > git://git.infradead.org/nvme.git nvme-4.18 > > for you to fetch changes up to 405a7519607e7a364114896264440c0f87b325c0: > > nvmet: only check for filebacking on -ENOTBLK (2018-07-25 13:14:04 +0200) > > > Hannes Reinecke (2): > nvmet: fixup crash on NULL device path > nvmet: only check for filebacking on -ENOTBLK > > James Smart (2): > nvmet-fc: fix target sgl list on large transfers > nvme: if_ready checks to fail io to deleting controller Pulled, thanks. -- Jens Axboe
Re: [PATCH 0/2] blktests: test ANA base support
make sure this fix has been picked up in your kernel: http://git.infradead.org/nvme.git/commit/6cdefc6e2ad52170f89a8d0e8b1a1339f91834dc deletes were broken otherwise and have the exact trace below. -- james On 7/25/2018 7:01 PM, Chaitanya Kulkarni wrote: Thanks for the report and correction. I was able to run the test and reproduce the problem, I also confirm that it is not that consistent. As I suspected the problem is in the nvme disconnect in the following call chain:- nvme_sysfs_delete nvme_delete_ctrl_sync nvme_delete_ctrl_work nvme_remove_namespaces nvme_ns_remove nvme_ns_remove blk_cleanup_queue blk_freeze_queue blk_freeze_queue_start wait_event(q->mq_freeze_wq, percpu_ref_is_zero(>q_usage_counter)); <- Stuck here. blk_cleanup_queue() that's where it is blocking and then target ctrl is timing out on keep alive. It will take some time for me to debug and find a fix let's merge this test once we fix this issue. How should we go about other tests ? should we merge them and keep this one out in the nvmeof branch? Regards, Chaitanya From: Omar Sandoval Sent: Wednesday, July 25, 2018 2:00 PM To: Chaitanya Kulkarni Cc: Hannes Reinecke; Omar Sandoval; Christoph Hellwig; Sagi Grimberg; Keith Busch; linux-n...@lists.infradead.org; linux-block@vger.kernel.org Subject: Re: [PATCH 0/2] blktests: test ANA base support On Wed, Jul 25, 2018 at 07:27:35PM +, Chaitanya Kulkarni wrote: Thanks, Omar. Tests nvme/014 and nvme/015 had a pretty bad typo that I didn't notice last time: dd=/dev/urandom of="/dev/${nvmedev}n1" count=128000 bs=4k That should be dd if=/dev/urandom of="/dev/${nvmedev}n1" count=128000 bs=4k status=none When I fix that (and change the nvme flush call as mentioned before), I sometimes get a hung task: [ 273.80] run blktests nvme/015 at 2018-07-25 13:44:11 [ 273.861950] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 273.875014] nvmet: creating controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:c5e36fdf-8e4d-4c27-be56-da373db583b2. [ 273.877457] nvme nvme1: creating 4 I/O queues. [ 273.879141] nvme nvme1: new ctrl: "blktests-subsystem-1" [ 276.247708] nvme nvme1: using deprecated NVME_IOCTL_IO_CMD ioctl on the char device! [ 276.262835] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1" [ 289.755361] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! [ 289.760579] nvmet: ctrl 1 fatal error occurred! [ 491.095890] INFO: task kworker/u8:0:7 blocked for more than 120 seconds. [ 491.104407] Not tainted 4.18.0-rc6 #18 [ 491.108330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 491.116521] kworker/u8:0 D 0 7 2 0x8000 [ 491.121754] Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] [ 491.129604] Call Trace: [ 491.131611] ? __schedule+0x2a1/0x890 [ 491.135112] ? _raw_spin_unlock_irqrestore+0x20/0x40 [ 491.140542] schedule+0x32/0x90 [ 491.142030] blk_mq_freeze_queue_wait+0x41/0xa0 [ 491.144186] ? wait_woken+0x80/0x80 [ 491.145726] blk_cleanup_queue+0x75/0x160 [ 491.150235] nvme_ns_remove+0xf9/0x130 [nvme_core] [ 491.151910] nvme_remove_namespaces+0x86/0xc0 [nvme_core] [ 491.153127] nvme_delete_ctrl_work+0x4b/0x80 [nvme_core] [ 491.154727] process_one_work+0x18c/0x360 [ 491.155428] worker_thread+0x1c6/0x380 [ 491.156160] ? process_one_work+0x360/0x360 [ 491.157493] kthread+0x112/0x130 [ 491.159119] ? kthread_flush_work_fn+0x10/0x10 [ 491.160008] ret_from_fork+0x35/0x40 [ 491.160729] INFO: task nvme:1139 blocked for more than 120 seconds. [ 491.162416] Not tainted 4.18.0-rc6 #18 [ 491.164348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 491.166012] nvme D 0 1139 1072 0x [ 491.167946] Call Trace: [ 491.168459] ? __schedule+0x2a1/0x890 [ 491.169312] schedule+0x32/0x90 [ 491.170180] schedule_timeout+0x311/0x4a0 [ 491.171921] ? kernfs_fop_release+0xa0/0xa0 [ 491.172884] wait_for_common+0x1a0/0x1d0 [ 491.173813] ? wake_up_q+0x70/0x70 [ 491.174410] flush_work+0x10e/0x1c0 [ 491.174991] ? flush_workqueue_prep_pwqs+0x130/0x130 [ 491.176113] nvme_delete_ctrl_sync+0x41/0x50 [nvme_core] [ 491.177969] nvme_sysfs_delete+0x28/0x30 [nvme_core] [ 491.178632] kernfs_fop_write+0x116/0x190 [ 491.179254] __vfs_write+0x36/0x190 [ 491.179812] vfs_write+0xa9/0x190 [ 491.180344] ksys_write+0x4f/0xb0 [ 491.181056] do_syscall_64+0x5b/0x170 [ 491.181583] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 491.182311] RIP: 0033:0x7fc04176b9d4 [ 491.182863] Code: Bad RIP value. [ 491.183650] RSP: 002b:7ffc33bd15a8 EFLAGS: 0246 ORIG_RAX: 0001 [ 491.184622] RAX: ffda RBX: 0004 RCX: 7fc04176b9d4 [ 491.185606] RDX: 0001 RSI: 55884bd0810a RDI: 0004 [ 491.186719] RBP:
[PATCH 0/9] Pending bcache patches for review
Hi folks, I ask for help to review these patches, some of them are posted for months. They are OK with my test for a while, still tt would be great to have more reviewer before these patches go into v4.19. Thanks in advance. Coly Li --- Coly Li (9): bcache: do not check return value of debugfs_create_dir() bcache: display rate debug parameters to 0 when writeback is not running bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: add a comment in register_bdev() bcache: fix mistaken code comments in struct cache bcache: fix mistaken comments in bch_keylist_realloc() bcache: add code comments for bset.c bcache: initiate bcache_debug to NULL bcache: set max writeback rate when I/O request is idle drivers/md/bcache/bcache.h| 16 +++--- drivers/md/bcache/bset.c | 63 drivers/md/bcache/btree.c | 14 +++--- drivers/md/bcache/closure.c | 13 +++-- drivers/md/bcache/closure.h | 4 +- drivers/md/bcache/debug.c | 13 ++--- drivers/md/bcache/request.c | 56 - drivers/md/bcache/super.c | 9 +++- drivers/md/bcache/sysfs.c | 37 +- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 91 +++ 12 files changed, 244 insertions(+), 76 deletions(-) -- 2.17.1
[PATCH 4/9] bcache: add a comment in register_bdev()
--- drivers/md/bcache/super.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index c7ffa6ef3f82..f517d7d1fa10 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1291,6 +1291,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page, pr_info("registered backing device %s", dc->backing_dev_name); list_add(>list, _devices); + /* attch to a matched cache set if it exists */ list_for_each_entry(c, _cache_sets, list) bch_cached_dev_attach(dc, c, NULL); -- 2.17.1
[PATCH 2/9] bcache: display rate debug parameters to 0 when writeback is not running
When writeback is not running, writeback rate should be 0, other value is misleading. And the following dyanmic writeback rate debug parameters should be 0 too, rate, proportional, integral, change otherwise they are misleading when writeback is not running. Signed-off-by: Coly Li --- drivers/md/bcache/sysfs.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 225b15aa0340..3e9d3459a224 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -149,6 +149,7 @@ SHOW(__bch_cached_dev) struct cached_dev *dc = container_of(kobj, struct cached_dev, disk.kobj); const char *states[] = { "no cache", "clean", "dirty", "inconsistent" }; + int wb = dc->writeback_running; #define var(stat) (dc->stat) @@ -170,7 +171,7 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); - sysfs_hprint(writeback_rate,dc->writeback_rate.rate << 9); + sysfs_hprint(writeback_rate,wb ? dc->writeback_rate.rate << 9 : 0); sysfs_hprint(io_errors, atomic_read(>io_errors)); sysfs_printf(io_error_limit,"%i", dc->error_limit); sysfs_printf(io_disable,"%i", dc->io_disable); @@ -188,15 +189,20 @@ SHOW(__bch_cached_dev) char change[20]; s64 next_io; - bch_hprint(rate,dc->writeback_rate.rate << 9); - bch_hprint(dirty, bcache_dev_sectors_dirty(>disk) << 9); - bch_hprint(target, dc->writeback_rate_target << 9); - bch_hprint(proportional,dc->writeback_rate_proportional << 9); - bch_hprint(integral,dc->writeback_rate_integral_scaled << 9); - bch_hprint(change, dc->writeback_rate_change << 9); - - next_io = div64_s64(dc->writeback_rate.next - local_clock(), - NSEC_PER_MSEC); + /* +* Except for dirty and target, other values should +* be 0 if writeback is not running. +*/ + bch_hprint(rate, wb ? dc->writeback_rate.rate << 9 : 0); + bch_hprint(dirty, bcache_dev_sectors_dirty(>disk) << 9); + bch_hprint(target, dc->writeback_rate_target << 9); + bch_hprint(proportional, + wb ? dc->writeback_rate_proportional << 9 : 0); + bch_hprint(integral, + wb ? dc->writeback_rate_integral_scaled << 9 : 0); + bch_hprint(change, wb ? dc->writeback_rate_change << 9 : 0); + next_io = wb ? div64_s64(dc->writeback_rate.next-local_clock(), +NSEC_PER_MSEC) : 0; return sprintf(buf, "rate:\t\t%s/sec\n" -- 2.17.1
[PATCH 3/9] bcache: avoid unncessary cache prefetch bch_btree_node_get()
In bch_btree_node_get() the read-in btree node will be partially prefetched into L1 cache for following bset iteration (if there is). But if the btree node read is failed, the perfetch operations will waste L1 cache space. This patch checkes whether read operation and only does cache prefetch when read I/O successed. Signed-off-by: Coly Li --- drivers/md/bcache/btree.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 475008fbbaab..c19f7716df88 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1011,6 +1011,13 @@ struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op, BUG_ON(b->level != level); } + if (btree_node_io_error(b)) { + rw_unlock(write, b); + return ERR_PTR(-EIO); + } + + BUG_ON(!b->written); + b->parent = parent; b->accessed = 1; @@ -1022,13 +1029,6 @@ struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op, for (; i <= b->keys.nsets; i++) prefetch(b->keys.set[i].data); - if (btree_node_io_error(b)) { - rw_unlock(write, b); - return ERR_PTR(-EIO); - } - - BUG_ON(!b->written); - return b; } -- 2.17.1
[PATCH 6/9] bcache: fix mistaken comments in bch_keylist_realloc()
--- drivers/md/bcache/request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 8eece9ef9f46..91206f329971 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -107,7 +107,7 @@ static int bch_keylist_realloc(struct keylist *l, unsigned u64s, /* * The journalling code doesn't handle the case where the keys to insert * is bigger than an empty write: If we just return -ENOMEM here, -* bio_insert() and bio_invalidate() will insert the keys created so far +* bch_data_insert_keys() will insert the keys created so far * and finish the rest when the keylist is empty. */ if (newsize * sizeof(uint64_t) > block_bytes(c) - sizeof(struct jset)) -- 2.17.1
[PATCH 8/9] bcache: initiate bcache_debug to NULL
Global variable bcache_debug is firstly initialized in bch_debug_init(), and destroyed in bch_debug_exit(). bch_debug_init() is called in bcache_init() with many other functions, if one of the previous calling onces failed, bcache_exit() will be called in the failure path. The problem is, if bcache_init() fails before bch_debug_init() is called, then in bcache_exit() when bch_debug_exit() is called to destroy global variable bcache_debug, at this moment bcache_debug is unndefined, then the test of "if (!IS_ERR_OR_NULL(bcache_debug))" might be buggy. This patch initializes global varabile bcache_debug to be NULL, to make the failure code path to be predictable. Signed-off-by: Coly Li --- drivers/md/bcache/debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 57f8f5aeee55..24b0eb65ddec 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -17,7 +17,7 @@ #include #include -struct dentry *bcache_debug; +struct dentry *bcache_debug = NULL; #ifdef CONFIG_BCACHE_DEBUG -- 2.17.1
[PATCH 7/9] bcache: add code comments for bset.c
This patch tries to add code comments in bset.c, to make some tricky code and designment to be more comprehensible. Most information of this patch comes from the discussion between Kent and I, he offers very informative details. If there is any mistake of the idea behind the code, no doubt that's from me misrepresentation. Signed-off-by: Coly Li --- drivers/md/bcache/bset.c | 63 1 file changed, 63 insertions(+) diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index f3403b45bc28..596c93b44e9b 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -366,6 +366,10 @@ EXPORT_SYMBOL(bch_btree_keys_init); /* Binary tree stuff for auxiliary search trees */ +/* + * return array index next to j when does in-order traverse + * of a binary tree which is stored in a linear array + */ static unsigned inorder_next(unsigned j, unsigned size) { if (j * 2 + 1 < size) { @@ -379,6 +383,10 @@ static unsigned inorder_next(unsigned j, unsigned size) return j; } +/* + * return array index previous to j when does in-order traverse + * of a binary tree which is stored in a linear array + */ static unsigned inorder_prev(unsigned j, unsigned size) { if (j * 2 < size) { @@ -421,6 +429,10 @@ static unsigned __to_inorder(unsigned j, unsigned size, unsigned extra) return j; } +/* + * Return the cacheline index in bset_tree->data, where j is index + * from a linear array which stores the auxiliar binary tree + */ static unsigned to_inorder(unsigned j, struct bset_tree *t) { return __to_inorder(j, t->size, t->extra); @@ -441,6 +453,10 @@ static unsigned __inorder_to_tree(unsigned j, unsigned size, unsigned extra) return j; } +/* + * Return an index from a linear array which stores the auxiliar binary + * tree, j is the cacheline index of t->data. + */ static unsigned inorder_to_tree(unsigned j, struct bset_tree *t) { return __inorder_to_tree(j, t->size, t->extra); @@ -546,6 +562,20 @@ static inline uint64_t shrd128(uint64_t high, uint64_t low, uint8_t shift) return low; } +/* + * Calculate mantissa value for struct bkey_float. + * If most significant bit of f->exponent is not set, then + * - f->exponent >> 6 is 0 + * - p[0] points to bkey->low + * - p[-1] borrows bits from KEY_INODE() of bkey->high + * if most isgnificant bits of f->exponent is set, then + * - f->exponent >> 6 is 1 + * - p[0] points to bits from KEY_INODE() of bkey->high + * - p[-1] points to other bits from KEY_INODE() of + *bkey->high too. + * See make_bfloat() to check when most significant bit of f->exponent + * is set or not. + */ static inline unsigned bfloat_mantissa(const struct bkey *k, struct bkey_float *f) { @@ -570,6 +600,16 @@ static void make_bfloat(struct bset_tree *t, unsigned j) BUG_ON(m < l || m > r); BUG_ON(bkey_next(p) != m); + /* +* If l and r have different KEY_INODE values (different backing +* device), f->exponent records how many least significant bits +* are different in KEY_INODE values and sets most significant +* bits to 1 (by +64). +* If l and r have same KEY_INODE value, f->exponent records +* how many different bits in least significant bits of bkey->low. +* See bfloat_mantiss() how the most significant bit of +* f->exponent is used to calculate bfloat mantissa value. +*/ if (KEY_INODE(l) != KEY_INODE(r)) f->exponent = fls64(KEY_INODE(r) ^ KEY_INODE(l)) + 64; else @@ -633,6 +673,15 @@ void bch_bset_init_next(struct btree_keys *b, struct bset *i, uint64_t magic) } EXPORT_SYMBOL(bch_bset_init_next); +/* + * Build auxiliary binary tree 'struct bset_tree *t', this tree is used to + * accelerate bkey search in a btree node (pointed by bset_tree->data in + * memory). After search in the auxiliar tree by calling bset_search_tree(), + * a struct bset_search_iter is returned which indicates range [l, r] from + * bset_tree->data where the searching bkey might be inside. Then a followed + * linear comparison does the exact search, see __bch_bset_search() for how + * the auxiliary tree is used. + */ void bch_bset_build_written_tree(struct btree_keys *b) { struct bset_tree *t = bset_tree_last(b); @@ -898,6 +947,17 @@ static struct bset_search_iter bset_search_tree(struct bset_tree *t, unsigned inorder, j, n = 1; do { + /* +* A bit trick here. +* If p < t->size, (int)(p - t->size) is a minus value and +* the most significant bit is set, right shifting 31 bits +* gets 1. If p >= t->size, the most significant bit is +* not set, right shifting 31 bits gets 0. +* So the following 2 lines equals to +* if (p >= t->size) +* p =
[PATCH 1/9] bcache: do not check return value of debugfs_create_dir()
Greg KH suggests that normal code should not care about debugfs. Therefore no matter successful or failed of debugfs_create_dir() execution, it is unncessary to check its return value. There are two functions called debugfs_create_dir() and check the return value, which are bch_debug_init() and closure_debug_init(). This patch changes these two functions from int to void type, and ignore return values of debugfs_create_dir(). This patch does not fix exact bug, just makes things work as they should. Signed-off-by: Coly Li Suggested-by: Greg Kroah-Hartman Cc: Kai Krakow Cc: Kent Overstreet --- drivers/md/bcache/bcache.h | 2 +- drivers/md/bcache/closure.c | 13 + drivers/md/bcache/closure.h | 4 ++-- drivers/md/bcache/debug.c | 11 ++- drivers/md/bcache/super.c | 4 +++- 5 files changed, 21 insertions(+), 13 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 872ef4d67711..a44bd427e5ba 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -1001,7 +1001,7 @@ void bch_open_buckets_free(struct cache_set *); int bch_cache_allocator_start(struct cache *ca); void bch_debug_exit(void); -int bch_debug_init(struct kobject *); +void bch_debug_init(struct kobject *); void bch_request_exit(void); int bch_request_init(void); diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index 0e14969182c6..618253683d40 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -199,11 +199,16 @@ static const struct file_operations debug_ops = { .release= single_release }; -int __init closure_debug_init(void) +void __init closure_debug_init(void) { - closure_debug = debugfs_create_file("closures", - 0400, bcache_debug, NULL, _ops); - return IS_ERR_OR_NULL(closure_debug); + if (!IS_ERR_OR_NULL(bcache_debug)) + /* +* it is unnecessary to check return value of +* debugfs_create_file(), we should not care +* about this. +*/ + closure_debug = debugfs_create_file( + "closures", 0400, bcache_debug, NULL, _ops); } #endif diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h index 71427eb5fdae..7c2c5bc7c88b 100644 --- a/drivers/md/bcache/closure.h +++ b/drivers/md/bcache/closure.h @@ -186,13 +186,13 @@ static inline void closure_sync(struct closure *cl) #ifdef CONFIG_BCACHE_CLOSURES_DEBUG -int closure_debug_init(void); +void closure_debug_init(void); void closure_debug_create(struct closure *cl); void closure_debug_destroy(struct closure *cl); #else -static inline int closure_debug_init(void) { return 0; } +static inline void closure_debug_init(void) {} static inline void closure_debug_create(struct closure *cl) {} static inline void closure_debug_destroy(struct closure *cl) {} diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index d030ce3025a6..57f8f5aeee55 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -248,11 +248,12 @@ void bch_debug_exit(void) debugfs_remove_recursive(bcache_debug); } -int __init bch_debug_init(struct kobject *kobj) +void __init bch_debug_init(struct kobject *kobj) { - if (!IS_ENABLED(CONFIG_DEBUG_FS)) - return 0; - + /* +* it is unnecessary to check return value of +* debugfs_create_file(), we should not care +* about this. +*/ bcache_debug = debugfs_create_dir("bcache", NULL); - return IS_ERR_OR_NULL(bcache_debug); } diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index e0a92104ca23..c7ffa6ef3f82 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2345,10 +2345,12 @@ static int __init bcache_init(void) goto err; if (bch_request_init() || - bch_debug_init(bcache_kobj) || closure_debug_init() || sysfs_create_files(bcache_kobj, files)) goto err; + bch_debug_init(bcache_kobj); + closure_debug_init(); + return 0; err: bcache_exit(); -- 2.17.1
[PATCH 9/9] bcache: set max writeback rate when I/O request is idle
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: sta...@vger.kernel.org #4.16+ Signed-off-by: Coly Li Tested-by: Kai Krakow Tested-by: Stefan Priebe Cc: Michael Lyle --- drivers/md/bcache/bcache.h| 10 ++-- drivers/md/bcache/request.c | 54 - drivers/md/bcache/super.c | 4 ++ drivers/md/bcache/sysfs.c | 15 -- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 91 +++ 7 files changed, 134 insertions(+), 44 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 5f7082aab1b0..97489573dedc 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -328,13 +328,6 @@ struct cached_dev { */ atomic_thas_dirty; - /* -* Set to zero by things that touch the backing volume-- except -* writeback. Incremented by writeback. Used to determine when to -* accelerate idle writeback. -*/ - atomic_tbacking_idle; - struct bch_ratelimitwriteback_rate; struct delayed_work writeback_rate_update; @@ -515,6 +508,8 @@ struct cache_set { struct cache_accounting accounting; unsigned long flags; + atomic_tidle_counter; + atomic_tat_max_writeback_rate; struct cache_sb sb; @@ -524,6 +519,7 @@ struct cache_set { struct bcache_device**devices; unsigneddevices_max_used; + atomic_tattached_dev_nr; struct list_headcached_devs; uint64_tcached_dev_sectors; atomic_long_t flash_dev_dirty_sectors; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 91206f329971..86a977c2a176 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1105,6 +1105,44 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) generic_make_request(bio); } +static void quit_max_writeback_rate(struct cache_set *c, + struct cached_dev *this_dc) +{ + int i; + struct bcache_device *d; + struct cached_dev *dc; + + /* +* mutex bch_register_lock may compete with other parallel requesters, +* or attach/detach operations on other backing device. Waiting to +* the mutex lock may increase I/O request latency for seconds or more. +* To avoid such situation, if mutext_trylock() failed, only writeback +* rate of current cached device is set to 1, and __update_write_back() +* will decide writeback rate of other cached devices (remember now +* c->idle_counter is 0 already). +*/ + if (mutex_trylock(_register_lock)) { + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + + if (UUID_FLASH_ONLY(>uuids[i])) + continue; + + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + /* +* set writeback rate to default minimum value, +* then let update_writeback_rate() to decide the +* upcoming rate. +*/ +
[PATCH 5/9] bcache: fix mistaken code comments in struct cache
--- drivers/md/bcache/bcache.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index a44bd427e5ba..5f7082aab1b0 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -423,8 +423,8 @@ struct cache { /* * When allocating new buckets, prio_write() gets first dibs - since we * may not be allocate at all without writing priorities and gens. -* prio_buckets[] contains the last buckets we wrote priorities to (so -* gc can mark them as metadata), prio_next[] contains the buckets +* prio_last_buckets[] contains the last buckets we wrote priorities to (so +* gc can mark them as metadata), prio_buckets[] contains the buckets * allocated for the next prio write. */ uint64_t*prio_buckets; -- 2.17.1
[GIT PULL] nvme fixes for 4.18
Two small fixes each for the FC code and the target. The following changes since commit 8f3ea35929a0806ad1397db99a89ffee0140822a: nbd: handle unexpected replies better (2018-07-16 10:14:40 -0600) are available in the Git repository at: git://git.infradead.org/nvme.git nvme-4.18 for you to fetch changes up to 405a7519607e7a364114896264440c0f87b325c0: nvmet: only check for filebacking on -ENOTBLK (2018-07-25 13:14:04 +0200) Hannes Reinecke (2): nvmet: fixup crash on NULL device path nvmet: only check for filebacking on -ENOTBLK James Smart (2): nvmet-fc: fix target sgl list on large transfers nvme: if_ready checks to fail io to deleting controller drivers/nvme/host/fabrics.c| 10 +++--- drivers/nvme/host/fabrics.h| 3 ++- drivers/nvme/host/fc.c | 2 +- drivers/nvme/host/rdma.c | 2 +- drivers/nvme/target/configfs.c | 9 +++-- drivers/nvme/target/core.c | 2 +- drivers/nvme/target/fc.c | 44 +- drivers/nvme/target/loop.c | 2 +- 8 files changed, 55 insertions(+), 19 deletions(-)
Re: [PATCH] blktests: Add '--outdir' to store results in a different directory
On 07/25/2018 11:23 PM, Omar Sandoval wrote: > On Tue, Jul 17, 2018 at 03:27:50PM +0200, Hannes Reinecke wrote: >> Adding an option '--outdir' to store results in a different >> director so as not to clutter the git repository itself. >> >> Signed-off-by: Hannes Reinecke >> --- >> check | 14 ++ >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/check b/check >> index a635531..42d07f8 100755 >> --- a/check >> +++ b/check >> @@ -334,7 +334,7 @@ _call_test() { >> fi >> >> trap _cleanup EXIT >> -if ! TMPDIR="$(mktemp --tmpdir -p "$PWD/results" -d >> "tmpdir.${TEST_NAME//\//.}.XXX")"; then >> +if ! TMPDIR="$(mktemp --tmpdir -p "$RESULTS_DIR" -d >> "tmpdir.${TEST_NAME//\//.}.XXX")"; then >> return >> fi >> >> @@ -415,7 +415,7 @@ _run_test() { >> return 0 >> fi >> >> -RESULTS_DIR="results/nodev" >> +RESULTS_DIR="${OUT_DIR}/results/nodev" >> _call_test test >> else >> if [[ ${#TEST_DEVS[@]} -eq 0 ]]; then >> @@ -434,7 +434,7 @@ _run_test() { >> _output_notrun "$TEST_NAME => $(basename >> "$TEST_DEV")" >> continue >> fi >> -RESULTS_DIR="results/$(basename "$TEST_DEV")" >> +RESULTS_DIR="${OUT_DIR}/results/$(basename "$TEST_DEV")" >> if ! _call_test test_device; then >> ret=1 >> fi >> @@ -567,6 +567,7 @@ Test runs: >> tests to run >> >> Miscellaneous: >> + -o, --outdir=OUTDIRwrite results into the specified directory >>-h, --help display this help message and exit" >> >> case "$1" in >> @@ -581,12 +582,13 @@ Miscellaneous: >> esac >> } >> >> -if ! TEMP=$(getopt -o 'dq::x:h' --long 'quick::,exclude:,help' -n "$0" -- >> "$@"); then >> +if ! TEMP=$(getopt -o 'dq::o:x:h' --long 'quick::,exclude:,outdir:,help' -n >> "$0" -- "$@"); then >> exit 1 >> fi >> >> eval set -- "$TEMP" >> unset TEMP >> +OUT_DIR="." > > This doesn't allow setting it from the config file. How about this? > > diff --git a/check b/check > index 5f4461f..5e99415 100755 > --- a/check > +++ b/check > @@ -313,7 +313,7 @@ _call_test() { > local test_func="$1" > local seqres="${RESULTS_DIR}/${TEST_NAME}" > # shellcheck disable=SC2034 > - FULL="$PWD/${seqres}.full" > + FULL="${seqres}.full" > declare -A TEST_DEV_QUEUE_SAVED > > _read_last_test_run > @@ -334,7 +334,7 @@ _call_test() { > fi > > trap _cleanup EXIT > - if ! TMPDIR="$(mktemp --tmpdir -p "$PWD/results" -d > "tmpdir.${TEST_NAME//\//.}.XXX")"; then > + if ! TMPDIR="$(mktemp --tmpdir -p "$OUTPUT" -d > "tmpdir.${TEST_NAME//\//.}.XXX")"; then > return > fi > > @@ -415,7 +415,7 @@ _run_test() { > return 0 > fi > > - RESULTS_DIR="results/nodev" > + RESULTS_DIR="$OUTPUT/nodev" > _call_test test > else > if [[ ${#TEST_DEVS[@]} -eq 0 ]]; then > @@ -434,7 +434,7 @@ _run_test() { > _output_notrun "$TEST_NAME => $(basename > "$TEST_DEV")" > continue > fi > - RESULTS_DIR="results/$(basename "$TEST_DEV")" > + RESULTS_DIR="$OUTPUT/$(basename "$TEST_DEV")" > if ! _call_test test_device; then > ret=1 > fi > @@ -559,6 +559,9 @@ Test runs: >-d, --device-only only run tests which use a test device from the >TEST_DEVS config setting > > + -o, --output=DIRoutput results to the given directory (the default is > + ./results) > + >-q, --quick=SECONDS do a quick run (only run quick tests and limit > the >runtime of longer tests to the given timeout, >defaulting to 30 seconds) > @@ -581,7 +584,7 @@ Miscellaneous: > esac > } > > -if ! TEMP=$(getopt -o 'dq::x:h' --long 'quick::,exclude:,help' -n "$0" -- > "$@"); then > +if ! TEMP=$(getopt -o 'do:q::x:h' --long 'quick::,exclude:,output:,help' -n > "$0" -- "$@"); then > exit 1 > fi > > @@ -596,6 +599,7 @@ fi > # Default configuration. > : "${DEVICE_ONLY:=0}" > : "${QUICK_RUN:=0}" > +: "${OUTPUT:=results}" > if [[ -v EXCLUDE ]] && ! declare -p EXCLUDE | grep -q '^declare -a'; then > # If EXCLUDE was not defined as an array, convert it to one. > # shellcheck disable=SC2190,SC2206 > @@ -617,6 +621,10 @@ while true; do > DEVICE_ONLY=1 > shift > ;; > + '-o'|'--output') > + OUTPUT="$2" > + shift 2 > +
Re: [PATCH v3] bcache: set max writeback rate when I/O request is idle
On 2018/7/26 7:49 PM, Stefan Priebe - Profihost AG wrote: > Am 26.07.2018 um 12:42 schrieb Coly Li: >> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >> allows the writeback rate to be faster if there is no I/O request on a >> bcache device. It works well if there is only one bcache device attached >> to the cache set. If there are many bcache devices attached to a cache >> set, it may introduce performance regression because multiple faster >> writeback threads of the idle bcache devices will compete the btree level >> locks with the bcache device who have I/O requests coming. >> >> This patch fixes the above issue by only permitting fast writebac when >> all bcache devices attached on the cache set are idle. And if one of the >> bcache devices has new I/O request coming, minimized all writeback >> throughput immediately and let PI controller __update_writeback_rate() >> to decide the upcoming writeback rate for each bcache device. >> >> Also when all bcache devices are idle, limited wrieback rate to a small >> number is wast of thoughput, especially when backing devices are slower >> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >> rate for each backing device if the whole cache set is idle. A faster >> writeback rate in idle time means new I/Os may have more available space >> for dirty data, and people may observe a better write performance then. >> >> Please note bcache may change its cache mode in run time, and this patch >> still works if the cache mode is switched from writeback mode and there >> is still dirty data on cache. > > Tested-by: Stefan Priebe > > Working fine now. > Great news, thanks ! Coly Li >> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing >> idle") >> Cc: sta...@vger.kernel.org #4.16+ >> Signed-off-by: Coly Li >> Tested-by: Kai Krakow >> Cc: Michael Lyle >> Cc: Stefan Priebe >> --- >> Channgelog: >> v3, Do not acquire bch_register_lock in set_at_max_writeback_rate(). >> v2, Fix a deadlock reported by Stefan Priebe. >> v1, Initial version. >> >> drivers/md/bcache/bcache.h| 10 ++-- >> drivers/md/bcache/request.c | 54 - >> drivers/md/bcache/super.c | 4 ++ >> drivers/md/bcache/sysfs.c | 14 -- >> drivers/md/bcache/util.c | 2 +- >> drivers/md/bcache/util.h | 2 +- >> drivers/md/bcache/writeback.c | 91 +++ >> 7 files changed, 133 insertions(+), 44 deletions(-) [snipped]
Re: [PATCH v3] bcache: set max writeback rate when I/O request is idle
Am 26.07.2018 um 12:42 schrieb Coly Li: > Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") > allows the writeback rate to be faster if there is no I/O request on a > bcache device. It works well if there is only one bcache device attached > to the cache set. If there are many bcache devices attached to a cache > set, it may introduce performance regression because multiple faster > writeback threads of the idle bcache devices will compete the btree level > locks with the bcache device who have I/O requests coming. > > This patch fixes the above issue by only permitting fast writebac when > all bcache devices attached on the cache set are idle. And if one of the > bcache devices has new I/O request coming, minimized all writeback > throughput immediately and let PI controller __update_writeback_rate() > to decide the upcoming writeback rate for each bcache device. > > Also when all bcache devices are idle, limited wrieback rate to a small > number is wast of thoughput, especially when backing devices are slower > non-rotation devices (e.g. SATA SSD). This patch sets a max writeback > rate for each backing device if the whole cache set is idle. A faster > writeback rate in idle time means new I/Os may have more available space > for dirty data, and people may observe a better write performance then. > > Please note bcache may change its cache mode in run time, and this patch > still works if the cache mode is switched from writeback mode and there > is still dirty data on cache. Tested-by: Stefan Priebe Working fine now. > Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") > Cc: sta...@vger.kernel.org #4.16+ > Signed-off-by: Coly Li > Tested-by: Kai Krakow > Cc: Michael Lyle > Cc: Stefan Priebe > --- > Channgelog: > v3, Do not acquire bch_register_lock in set_at_max_writeback_rate(). > v2, Fix a deadlock reported by Stefan Priebe. > v1, Initial version. > > drivers/md/bcache/bcache.h| 10 ++-- > drivers/md/bcache/request.c | 54 - > drivers/md/bcache/super.c | 4 ++ > drivers/md/bcache/sysfs.c | 14 -- > drivers/md/bcache/util.c | 2 +- > drivers/md/bcache/util.h | 2 +- > drivers/md/bcache/writeback.c | 91 +++ > 7 files changed, 133 insertions(+), 44 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 872ef4d67711..13f908be42ba 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -328,13 +328,6 @@ struct cached_dev { >*/ > atomic_thas_dirty; > > - /* > - * Set to zero by things that touch the backing volume-- except > - * writeback. Incremented by writeback. Used to determine when to > - * accelerate idle writeback. > - */ > - atomic_tbacking_idle; > - > struct bch_ratelimitwriteback_rate; > struct delayed_work writeback_rate_update; > > @@ -515,6 +508,8 @@ struct cache_set { > struct cache_accounting accounting; > > unsigned long flags; > + atomic_tidle_counter; > + atomic_tat_max_writeback_rate; > > struct cache_sb sb; > > @@ -524,6 +519,7 @@ struct cache_set { > > struct bcache_device**devices; > unsigneddevices_max_used; > + atomic_tattached_dev_nr; > struct list_headcached_devs; > uint64_tcached_dev_sectors; > atomic_long_t flash_dev_dirty_sectors; > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 8eece9ef9f46..26f97acde403 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -1105,6 +1105,44 @@ static void detached_dev_do_request(struct > bcache_device *d, struct bio *bio) > generic_make_request(bio); > } > > +static void quit_max_writeback_rate(struct cache_set *c, > + struct cached_dev *this_dc) > +{ > + int i; > + struct bcache_device *d; > + struct cached_dev *dc; > + > + /* > + * mutex bch_register_lock may compete with other parallel requesters, > + * or attach/detach operations on other backing device. Waiting to > + * the mutex lock may increase I/O request latency for seconds or more. > + * To avoid such situation, if mutext_trylock() failed, only writeback > + * rate of current cached device is set to 1, and __update_write_back() > + * will decide writeback rate of other cached devices (remember now > + * c->idle_counter is 0 already). > + */ > + if (mutex_trylock(_register_lock)) { > + for (i = 0; i < c->devices_max_used; i++) { > + if (!c->devices[i]) > + continue; > + > + if (UUID_FLASH_ONLY(>uuids[i])) > +
Re: [PATCH for v4.18] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER
I'm OK with the rename in general but not sure about doing it this late in the cycle. Anyways, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH v3] bcache: set max writeback rate when I/O request is idle
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: sta...@vger.kernel.org #4.16+ Signed-off-by: Coly Li Tested-by: Kai Krakow Cc: Michael Lyle Cc: Stefan Priebe --- Channgelog: v3, Do not acquire bch_register_lock in set_at_max_writeback_rate(). v2, Fix a deadlock reported by Stefan Priebe. v1, Initial version. drivers/md/bcache/bcache.h| 10 ++-- drivers/md/bcache/request.c | 54 - drivers/md/bcache/super.c | 4 ++ drivers/md/bcache/sysfs.c | 14 -- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 91 +++ 7 files changed, 133 insertions(+), 44 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 872ef4d67711..13f908be42ba 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -328,13 +328,6 @@ struct cached_dev { */ atomic_thas_dirty; - /* -* Set to zero by things that touch the backing volume-- except -* writeback. Incremented by writeback. Used to determine when to -* accelerate idle writeback. -*/ - atomic_tbacking_idle; - struct bch_ratelimitwriteback_rate; struct delayed_work writeback_rate_update; @@ -515,6 +508,8 @@ struct cache_set { struct cache_accounting accounting; unsigned long flags; + atomic_tidle_counter; + atomic_tat_max_writeback_rate; struct cache_sb sb; @@ -524,6 +519,7 @@ struct cache_set { struct bcache_device**devices; unsigneddevices_max_used; + atomic_tattached_dev_nr; struct list_headcached_devs; uint64_tcached_dev_sectors; atomic_long_t flash_dev_dirty_sectors; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 8eece9ef9f46..26f97acde403 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1105,6 +1105,44 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) generic_make_request(bio); } +static void quit_max_writeback_rate(struct cache_set *c, + struct cached_dev *this_dc) +{ + int i; + struct bcache_device *d; + struct cached_dev *dc; + + /* +* mutex bch_register_lock may compete with other parallel requesters, +* or attach/detach operations on other backing device. Waiting to +* the mutex lock may increase I/O request latency for seconds or more. +* To avoid such situation, if mutext_trylock() failed, only writeback +* rate of current cached device is set to 1, and __update_write_back() +* will decide writeback rate of other cached devices (remember now +* c->idle_counter is 0 already). +*/ + if (mutex_trylock(_register_lock)) { + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + + if (UUID_FLASH_ONLY(>uuids[i])) + continue; + + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + /* +* set writeback rate to default minimum value, +
Re: [PATCH v5 3/3] block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
Both the changelog and the comments should use up all horizontal space (73 and 80 charcs respectively), but that can be fixed up when applied. Otherwise looks fine: Reviewed-by: Christoph Hellwig
Re: [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case
On Wed, Jul 25, 2018 at 11:15:08PM +0200, Martin Wilck wrote: > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > simplified bdev direct-io") > Reviewed-by: Ming Lei > Signed-off-by: Martin Wilck I'm pretty sure I already ACKed this last time, but here we go again: Reviewed-by: Christoph Hellwig
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On 07/26/2018 03:52 PM, jianchao.wang wrote: > In addition, .runtime_suspend is invoked under spinlock and irq-disabled. > So sleep is forbidden here. > Please refer to rpm_suspend > > * This function must be called under dev->power.lock with interrupts disabled > __rpm_callback will unlock the spinlock before invoke the callback. Sorry for the noise. Thanks Jianchao
Re: [PATCH v2 4/5] block, scsi: Rework runtime power management
On 07/26/2018 10:45 AM, jianchao.wang wrote: > Hi Bart > > On 07/26/2018 06:26 AM, Bart Van Assche wrote: >> @@ -102,9 +109,11 @@ int blk_pre_runtime_suspend(struct request_queue *q) >> return ret; >> >> blk_pm_runtime_lock(q); >> +blk_set_preempt_only(q); > > We only stop non-RQF_PM request entering when RPM_SUSPENDING and RPM_RESUMING. > blk_pre_runtime_suspend should only _check_ whether runtime suspend is > allowed. > So we should not set preempt_only here. > > >> +percpu_ref_switch_to_atomic_sync(>q_usage_counter); In addition, .runtime_suspend is invoked under spinlock and irq-disabled. So sleep is forbidden here. Please refer to rpm_suspend * This function must be called under dev->power.lock with interrupts disabled Thanks Jianchao >> >> spin_lock_irq(q->queue_lock); >> -if (q->nr_pending) { >> +if (!percpu_ref_is_zero(>q_usage_counter)) { >> ret = -EBUSY; >> pm_runtime_mark_last_busy(q->dev); >> } else { >> @@ -112,6 +121,7 @@ int blk_pre_runtime_suspend(struct request_queue *q) >> } >> spin_unlock_irq(q->queue_lock); >> >> +percpu_ref_switch_to_percpu(>q_usage_counter); >> blk_pm_runtime_unlock(q); >
Re: [PATCH v2] bcache: set max writeback rate when I/O request is idle
On 2018/7/26 1:44 PM, Stefan Priebe - Profihost AG wrote: >> Am 25.07.2018 um 08:16 schrieb Stefan Priebe - Profihost AG >> : >> >>> Am 24.07.2018 um 18:36 schrieb Coly Li: On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote: > Am 24.07.2018 um 10:28 schrieb Coly Li: >> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote: >>> Am 24.07.2018 um 06:03 schrieb Coly Li: >>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>> allows the writeback rate to be faster if there is no I/O request on a >>> bcache device. It works well if there is only one bcache device attached >>> to the cache set. If there are many bcache devices attached to a cache >>> set, it may introduce performance regression because multiple faster >>> writeback threads of the idle bcache devices will compete the btree >>> level >>> locks with the bcache device who have I/O requests coming. >>> >>> This patch fixes the above issue by only permitting fast writebac when >>> all bcache devices attached on the cache set are idle. And if one of the >>> bcache devices has new I/O request coming, minimized all writeback >>> throughput immediately and let PI controller __update_writeback_rate() >>> to decide the upcoming writeback rate for each bcache device. >>> >>> Also when all bcache devices are idle, limited wrieback rate to a small >>> number is wast of thoughput, especially when backing devices are slower >>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >>> rate for each backing device if the whole cache set is idle. A faster >>> writeback rate in idle time means new I/Os may have more available space >>> for dirty data, and people may observe a better write performance then. >>> >>> Please note bcache may change its cache mode in run time, and this patch >>> still works if the cache mode is switched from writeback mode and there >>> is still dirty data on cache. >>> >>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing >>> idle") >>> Cc: sta...@vger.kernel.org #4.16+ >>> Signed-off-by: Coly Li >>> Tested-by: Kai Krakow >>> Cc: Michael Lyle >>> Cc: Stefan Priebe >> >> Hi Coly, >> >> i'm still experiencing the same bug as yesterday while rebooting every >> two times i get a deadlock in cache_set_free. >> >> Sadly it's so late ion the shutdown process that netconsole is already >> unloaded or at least i get no messages from while it happens. The traces >> look the same like yesterday. > > Hi Stefan, > > Do you use the v2 patch on latest SLE15 kernel source? Yes - i use the kernel code from here: https://github.com/openSUSE/kernel-source/commits/SLE15 > Let me try to > reproduce it on my machine. I assume when you reboot, the cache is still > dirty and not clean up, am I right ? Yes with around 15GB of dirty data - ceph is running on top of it and generated many random writes. > And which file system do you mount > on top of the bcache device ? XFS >>> >>> Hi Stefan, >>> >>> Thanks for the information. I managed to reproduce a similar deadlock on >>> my small box. Let me see how to fix it and post a new version ASAP :-) >> >> Great! > > Any news on this already? I am testing the new version, and it also requires other patches I posted earlier for 4.19 (which gets rid of bch_register_lock in writeback thread). Hope I can make it today :-) Coly Li
Re: [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case
On 07/25/2018 11:15 PM, Martin Wilck wrote: > Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for > simplified bdev direct-io") > Reviewed-by: Ming Lei > Signed-off-by: Martin Wilck > --- > fs/block_dev.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)