Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 4:52 PM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote: >> On 1/19/18 9:37 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: On 1/19/18 9:26 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >> On 1/19/18 9:05 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: On 1/19/18 8:40 AM, Ming Lei wrote: Where does the dm STS_RESOURCE error usually come from - what's exact resource are we running out of? >>> >>> It is from blk_get_request(underlying queue), see >>> multipath_clone_and_map(). >> >> That's what I thought. So for a low queue depth underlying queue, >> it's >> quite possible that this situation can happen. Two potential >> solutions >> I see: >> >> 1) As described earlier in this thread, having a mechanism for being >>notified when the scarce resource becomes available. It would not >>be hard to tap into the existing sbitmap wait queue for that. >> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>allocation. I haven't read the dm code to know if this is a >>possibility or not. >> >> I'd probably prefer #1. It's a classic case of trying to get the >> request, and if it fails, add ourselves to the sbitmap tag wait >> queue head, retry, and bail if that also fails. Connecting the >> scarce resource and the consumer is the only way to really fix >> this, without bogus arbitrary delays. > > Right, as I have replied to Bart, using mod_delayed_work_on() with > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > resource should fix this issue. It'll fix the forever stall, but it won't really fix it, as we'll slow down the dm device by some random amount. A simple test case would be to have a null_blk device with a queue depth of one, and dm on top of that. Start a fio job that runs two jobs: one that does IO to the underlying device, and one that does IO to the dm device. If the job on the dm device runs substantially slower than the one to the underlying device, then the problem isn't really fixed. >>> >>> I remembered that I tried this test on scsi-debug & dm-mpath over >>> scsi-debug, >>> seems not observed this issue, could you explain a bit why IO over >>> dm-mpath >>> may be slower? Because both two IO contexts call same get_request(), and >>> in theory dm-mpath should be a bit quicker since it uses direct issue >>> for >>> underlying queue, without io scheduler involved. >> >> Because if you lose the race for getting the request, you'll have some >> arbitrary delay before trying again, potentially. Compared to the direct > > But the restart still works, one request is completed, then the queue > is return immediately because we use mod_delayed_work_on(0), so looks > no such issue. There are no pending requests for this case, nothing to restart the queue. When you fail that blk_get_request(), you are idle, nothing is pending. >>> >>> I think we needn't worry about that, once a device is attached to >>> dm-rq, it can't be mounted any more, and usually user don't use the device >>> directly and by dm-mpath at the same time. >> >> Here's an example of that, using my current block tree (merged into >> master). The setup is dm-mpath on top of null_blk, the latter having >> just a single request. Both are mq devices. >> >> Fio direct 4k random reads on dm_mq: ~250K iops >> >> Start dd on underlying device (or partition on same device), just doing >> sequential reads. >> >> Fio direct 4k random reads on dm_mq with dd running: 9 iops >> >> No schedulers involved. >> >> https://i.imgur.com/WTDnnwE.gif > > If null_blk's timer mode is used with a bit delay introduced, I guess > the effect from direct access to underlying queue shouldn't be so > serious. But it still won't be good as direct access. Doesn't matter if it's used at the default of 10usec completion latency, or inline complete. Same result, I ran both. > Another way may be to introduce a variants blk_get_request(), such as > blk_get_request_with_notify(), then pass the current dm-rq's hctx to > it, and use the tag's waitqueue to handle that. But the change can be > a bit big. Yes, that's exactly the solution I suggested both yesterday and today. -- Jens Axboe
Re: [PATCH v2] bdi: show error log when fail to create bdi debugfs entry
2018-01-20 3:54 GMT+08:00 Jens Axboe: > On 1/19/18 10:36 AM, weiping zhang wrote: >> 2018-01-11 0:36 GMT+08:00 weiping zhang : >>> bdi debugfs dir/file may create fail, add error log here. >>> >>> Signed-off-by: weiping zhang >>> --- >>> V1->V2: >>> fix indentation and make log message more clear >>> >>> mm/backing-dev.c | 4 +++- >>> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c >>> index b5f940c..0a49665 100644 >>> --- a/mm/backing-dev.c >>> +++ b/mm/backing-dev.c >>> @@ -885,7 +885,9 @@ int bdi_register_va(struct backing_dev_info *bdi, const >>> char *fmt, va_list args) >>> cgwb_bdi_register(bdi); >>> bdi->dev = dev; >>> >>> - bdi_debug_register(bdi, dev_name(dev)); >>> + if (bdi_debug_register(bdi, dev_name(dev))) >>> + pr_warn("blkdev %s: creation of bdi debugfs entries >>> failed.\n", >>> + dev_name(dev)); >>> set_bit(WB_registered, >wb.state); >>> >>> spin_lock_bh(_lock); >>> -- >> >> Hi Jens, >> >> madam has no permission to create debuts entry if SELINUX is enable at >> Fedora and Centos, >> and we have revert 6d0e4827b72 Revert "bdi: add error handle for >> bdi_debug_register", that is to say >> bdi debugfs is not the key component of block device, this patch just >> add warning log. > > Have we fixed the case where we know it will trigger? The reason is that mdadm has no permission to create dir/file under /sys/kernel/debug/ , I think we can solve it in two possible ways. 1.Add proper SELINUX policy to allow mdadm create /sys/kernel/debug/bdi/xxx, but not every user add this allowance, so kernel show a warning for this case. 2.Split mdadm into 2 part, Firstly, user proccess mdadm trigger a kwork and wait a event done, secondly kwork will create gendisk)and then wake up event. But it is more likely a hack, and it may break SELINUX mechanism, so I give up this way. https://marc.info/?l=linux-mm=151456540928231=2 > -- > Jens Axboe >
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 09:23:35AM -0700, Jens Axboe wrote: > On 1/19/18 9:13 AM, Mike Snitzer wrote: > > On Fri, Jan 19 2018 at 10:48am -0500, > > Jens Axboewrote: > > > >> On 1/19/18 8:40 AM, Ming Lei wrote: > >> Where does the dm STS_RESOURCE error usually come from - what's exact > >> resource are we running out of? > > > > It is from blk_get_request(underlying queue), see > > multipath_clone_and_map(). > > That's what I thought. So for a low queue depth underlying queue, it's > quite possible that this situation can happen. Two potential solutions > I see: > > 1) As described earlier in this thread, having a mechanism for being > notified when the scarce resource becomes available. It would not > be hard to tap into the existing sbitmap wait queue for that. > > 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > allocation. I haven't read the dm code to know if this is a > possibility or not. > > > > Right, #2 is _not_ the way forward. Historically request-based DM used > > its own mempool for requests, this was to be able to have some measure > > of control and resiliency in the face of low memory conditions that > > might be affecting the broader system. > > > > Then Christoph switched over to adding per-request-data; which ushered > > in the use of blk_get_request using ATOMIC allocations. I like the > > result of that line of development. But taking the next step of setting > > BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this > > dm-mpath.c code is common to old .request_fn and blk-mq, at least the > > call to blk_get_request is). Ultimately dm-mpath like to avoid blocking > > for a request because for this dm-mpath device we have multiple queues > > to allocate from if need be (provided we have an active-active storage > > network topology). > > If you can go to multiple devices, obviously it should not block on a > single device. That's only true for the case where you can only go to > one device, blocking at that point would probably be fine. Or if all > your paths are busy, then blocking would also be OK. Introducing one extra block point will hurt AIO performance, in which there is usually much less jobs/processes to submit IO. -- Ming
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > On 1/19/18 9:05 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >> On 1/19/18 8:40 AM, Ming Lei wrote: > >> Where does the dm STS_RESOURCE error usually come from - what's > >> exact > >> resource are we running out of? > > > > It is from blk_get_request(underlying queue), see > > multipath_clone_and_map(). > > That's what I thought. So for a low queue depth underlying queue, > it's > quite possible that this situation can happen. Two potential > solutions > I see: > > 1) As described earlier in this thread, having a mechanism for being > notified when the scarce resource becomes available. It would not > be hard to tap into the existing sbitmap wait queue for that. > > 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > allocation. I haven't read the dm code to know if this is a > possibility or not. > > I'd probably prefer #1. It's a classic case of trying to get the > request, and if it fails, add ourselves to the sbitmap tag wait > queue head, retry, and bail if that also fails. Connecting the > scarce resource and the consumer is the only way to really fix > this, without bogus arbitrary delays. > >>> > >>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>> resource should fix this issue. > >> > >> It'll fix the forever stall, but it won't really fix it, as we'll slow > >> down the dm device by some random amount. > >> > >> A simple test case would be to have a null_blk device with a queue > >> depth > >> of one, and dm on top of that. Start a fio job that runs two jobs: one > >> that does IO to the underlying device, and one that does IO to the dm > >> device. If the job on the dm device runs substantially slower than the > >> one to the underlying device, then the problem isn't really fixed. > > > > I remembered that I tried this test on scsi-debug & dm-mpath over > > scsi-debug, > > seems not observed this issue, could you explain a bit why IO over > > dm-mpath > > may be slower? Because both two IO contexts call same get_request(), and > > in theory dm-mpath should be a bit quicker since it uses direct issue > > for > > underlying queue, without io scheduler involved. > > Because if you lose the race for getting the request, you'll have some > arbitrary delay before trying again, potentially. Compared to the direct > >>> > >>> But the restart still works, one request is completed, then the queue > >>> is return immediately because we use mod_delayed_work_on(0), so looks > >>> no such issue. > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Here's an example of that, using my current block tree (merged into > master). The setup is dm-mpath on top of null_blk, the latter having > just a single request. Both are mq devices. > > Fio direct 4k random reads on dm_mq: ~250K iops > > Start dd on underlying device (or partition on same device), just doing > sequential reads. > > Fio direct 4k random reads on dm_mq with dd running: 9 iops > > No schedulers involved. > > https://i.imgur.com/WTDnnwE.gif If null_blk's timer mode is used with a bit delay introduced, I guess the effect from direct access to underlying queue shouldn't be so serious. But it still won't be good as direct access. Another way may be to introduce a variants blk_get_request(), such as blk_get_request_with_notify(), then pass the current dm-rq's hctx to it, and use the tag's waitqueue to handle that. But the change can be a bit big. -- Ming
Re: [PATCH] blk-throttle: use queue_is_rq_based
2018-01-20 3:58 GMT+08:00 Jens Axboe: > On 1/19/18 10:40 AM, weiping zhang wrote: >> use queue_is_rq_based instead of open code. >> >> Signed-off-by: weiping zhang >> --- >> block/blk-throttle.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/block/blk-throttle.c b/block/blk-throttle.c >> index 96ad326..457e985 100644 >> --- a/block/blk-throttle.c >> +++ b/block/blk-throttle.c >> @@ -2456,7 +2456,7 @@ void blk_throtl_register_queue(struct request_queue *q) >> td->throtl_slice = DFL_THROTL_SLICE_HD; >> #endif >> >> - td->track_bio_latency = !q->mq_ops && !q->request_fn; >> + td->track_bio_latency = !(queue_is_rq_based(q)); > > Kill the extra parenthesis here. Oh, kill it at V2. > -- > Jens Axboe >
[PATCH v2] blk-throttle: use queue_is_rq_based
use queue_is_rq_based instead of open code. Signed-off-by: weiping zhang--- block/blk-throttle.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index e136f5e..c475f0f 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2489,7 +2489,7 @@ void blk_throtl_register_queue(struct request_queue *q) td->throtl_slice = DFL_THROTL_SLICE_HD; #endif - td->track_bio_latency = !q->mq_ops && !q->request_fn; + td->track_bio_latency = !queue_is_rq_based(q); if (!td->track_bio_latency) blk_stat_enable_accounting(q); } -- 2.9.4
Hello There
UNSECURED BUSINESS/PERSONAL LOAN BY LOAN CAPITAL FINANCE - NO COLLATERAL - MINIMUM DOCUMENTATION - BUSINESS LOAN UP TO FIVE(5) MILLION US DOLLARS CONTACT US TODAY VIA EMAIL: financecapital...@mail.com --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Re: [PATCH] blk-throttle: use queue_is_rq_based
On 1/19/18 10:40 AM, weiping zhang wrote: > use queue_is_rq_based instead of open code. > > Signed-off-by: weiping zhang> --- > block/blk-throttle.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index 96ad326..457e985 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -2456,7 +2456,7 @@ void blk_throtl_register_queue(struct request_queue *q) > td->throtl_slice = DFL_THROTL_SLICE_HD; > #endif > > - td->track_bio_latency = !q->mq_ops && !q->request_fn; > + td->track_bio_latency = !(queue_is_rq_based(q)); Kill the extra parenthesis here. -- Jens Axboe
Re: [PATCH v2] bdi: show error log when fail to create bdi debugfs entry
On 1/19/18 10:36 AM, weiping zhang wrote: > 2018-01-11 0:36 GMT+08:00 weiping zhang: >> bdi debugfs dir/file may create fail, add error log here. >> >> Signed-off-by: weiping zhang >> --- >> V1->V2: >> fix indentation and make log message more clear >> >> mm/backing-dev.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c >> index b5f940c..0a49665 100644 >> --- a/mm/backing-dev.c >> +++ b/mm/backing-dev.c >> @@ -885,7 +885,9 @@ int bdi_register_va(struct backing_dev_info *bdi, const >> char *fmt, va_list args) >> cgwb_bdi_register(bdi); >> bdi->dev = dev; >> >> - bdi_debug_register(bdi, dev_name(dev)); >> + if (bdi_debug_register(bdi, dev_name(dev))) >> + pr_warn("blkdev %s: creation of bdi debugfs entries >> failed.\n", >> + dev_name(dev)); >> set_bit(WB_registered, >wb.state); >> >> spin_lock_bh(_lock); >> -- > > Hi Jens, > > madam has no permission to create debuts entry if SELINUX is enable at > Fedora and Centos, > and we have revert 6d0e4827b72 Revert "bdi: add error handle for > bdi_debug_register", that is to say > bdi debugfs is not the key component of block device, this patch just > add warning log. Have we fixed the case where we know it will trigger? -- Jens Axboe
Re: [PATCH 0/3] Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
On 1/19/18 9:58 AM, Bart Van Assche wrote: > Hello Jens, > > Earlier today it was agreed that a blk_mq_delay_run_hw_queue() call followed > by a blk_mq_run_hw_queue() call should result in an immediate queue run. Hence > this patch series. Please consider this patch series for kernel v4.16. Thanks Bart, applied. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, 2018-01-19 at 15:34 +0800, Ming Lei wrote: > Could you explain a bit when SCSI target replies with BUSY very often? > > Inside initiator, we have limited the max per-LUN requests and per-host > requests already before calling .queue_rq(). That's correct. However, when a SCSI initiator and target system are communicating with each other there is no guarantee that initiator and target queue depth have been tuned properly. If the initiator queue depth and the number of requests that can be in flight according to the network protocol are both larger than the target queue depth and if the target system uses relatively slow storage (e.g. harddisks) then it can happen that the target replies with BUSY often. The Linux iSCSI initiator limits MaxOutstandingR2T (the number of requests an initiator may sent without having received an answer from the target) to one so I don't think this can happen when using iSCSI/TCP. With the SRP initiator however the maximum requests that can be in flight between initiator and target depends on the number of credits that were negotiated during login between initiator and target. Some time ago I modified the SRP initiator such that it limits the initiator queue depth to the number of SRP credits minus one (for task management). That resulted in a performance improvement due to fewer BUSY conditions at the initiator side (see also commit 7ade400aba9a ("IB/srp: Reduce number of BUSY conditions")). Another patch for the SCST SRP target driver limited the number of SRP credits to the queue depth of the block device at the target side. I'm referring to the following code: ch->rq_size = min(MAX_SRPT_RQ_SIZE, scst_get_max_lun_commands(NULL, 0)) (I have not yet had the time to port this change to LIO). Without such tuning across initiator and target it can happen often that the target system sends the reply "BUSY" back to the initiator. I think that's why there is code in the SCSI core to automatically adjust the initiator queue depth if the "BUSY" condition is encountered frequently. See also scsi_track_queue_full(). Bart.
Re: [PATCH] lib/scatterlist: Fix chaining support in sgl_alloc_order()
On 1/19/18 12:00 PM, Bart Van Assche wrote: > This patch avoids that workloads with large block sizes (megabytes) > can trigger the following call stack with the ib_srpt driver (that > driver is the only driver that chains scatterlists allocated by > sgl_alloc_order()): > > BUG: Bad page state in process kworker/0:1H pfn:2423a78 > page:fb03d08e9e00 count:-3 mapcount:0 mapping: (null) index:0x0 > flags: 0x57c000() > raw: 0057c000 fffd > raw: dead0100 dead0200 > page dumped because: nonzero _count > CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G I > 4.15.0-rc7.bart+ #1 > Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 > Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] > Call Trace: > dump_stack+0x5c/0x83 > bad_page+0xf5/0x10f > get_page_from_freelist+0xa46/0x11b0 > __alloc_pages_nodemask+0x103/0x290 > sgl_alloc_order+0x101/0x180 > target_alloc_sgl+0x2c/0x40 [target_core_mod] > srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt] > srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt] > __ib_process_cq+0x55/0xa0 [ib_core] > ib_cq_poll_work+0x1b/0x60 [ib_core] > process_one_work+0x141/0x340 > worker_thread+0x47/0x3e0 > kthread+0xf5/0x130 > ret_from_fork+0x1f/0x30 Applied, thanks. -- Jens Axboe
Re: [GIT PULL] additional nvme fixes and cleanups for Linux 4.16
On 1/19/18 12:02 PM, Christoph Hellwig wrote: > The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993: > > blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() > (2018-01-14 10:46:24 -0700) > > are available in the git repository at: > > git://git.infradead.org/nvme.git nvme-4.16 > > for you to fetch changes up to 88de4598bca84e27b261685c06fff816b8d932a1: > > nvme-pci: clean up SMBSZ bit definitions (2018-01-17 17:55:14 +0100) > > > Christoph Hellwig (2): > nvme-pci: clean up CMB initialization > nvme-pci: clean up SMBSZ bit definitions > > James Smart (2): > nvme-fc: fix rogue admin cmds stalling teardown > nvme-fc: correct hang in nvme_ns_remove() > > Minwoo Im (1): > nvme: fix comment typos in nvme_create_io_queues > > Roland Dreier (1): > nvme-fabrics: fix memory leak when parsing host ID option > > Roy Shterman (1): > nvme: host delete_work and reset_work on separate workqueues > > Sagi Grimberg (3): > nvme-pci: serialize pci resets > nvme-pci: allocate device queues storage space at probe > nvmet: release a ns reference in nvmet_req_uninit if needed > > drivers/nvme/host/core.c| 47 ++-- > drivers/nvme/host/fabrics.c | 4 +- > drivers/nvme/host/fc.c | 6 ++ > drivers/nvme/host/nvme.h| 3 + > drivers/nvme/host/pci.c | 133 > > drivers/nvme/host/rdma.c| 2 +- > drivers/nvme/target/core.c | 3 + > drivers/nvme/target/loop.c | 2 +- > include/linux/nvme.h| 22 +--- > 9 files changed, 132 insertions(+), 90 deletions(-) Pulled, thanks. -- Jens Axboe
Re: [PATCH] lib/scatterlist: Fix chaining support in sgl_alloc_order()
On Fri, 2018-01-19 at 11:00 -0800, Bart Van Assche wrote: > This patch avoids that workloads with large block sizes (megabytes) > can trigger the following call stack with the ib_srpt driver (that > driver is the only driver that chains scatterlists allocated by > sgl_alloc_order()): > > BUG: Bad page state in process kworker/0:1H pfn:2423a78 > page:fb03d08e9e00 count:-3 mapcount:0 mapping: (null) > index:0x0 > flags: 0x57c000() > raw: 0057c000 > fffd > raw: dead0100 dead0200 > > page dumped because: nonzero _count > CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G I 4.15.0- > rc7.bart+ #1 > Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 > Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] > Call Trace: > dump_stack+0x5c/0x83 > bad_page+0xf5/0x10f > get_page_from_freelist+0xa46/0x11b0 > __alloc_pages_nodemask+0x103/0x290 > sgl_alloc_order+0x101/0x180 > target_alloc_sgl+0x2c/0x40 [target_core_mod] > srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt] > srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt] > __ib_process_cq+0x55/0xa0 [ib_core] > ib_cq_poll_work+0x1b/0x60 [ib_core] > process_one_work+0x141/0x340 > worker_thread+0x47/0x3e0 > kthread+0xf5/0x130 > ret_from_fork+0x1f/0x30 > > Fixes: e80a0af4759a ("lib/scatterlist: Introduce sgl_alloc() and > sgl_free()") > Reported-by: Laurence Oberman> Signed-off-by: Bart Van Assche > Cc: Nicholas A. Bellinger > Cc: Laurence Oberman > --- > drivers/target/target_core_transport.c | 2 +- > include/linux/scatterlist.h| 1 + > lib/scatterlist.c | 32 > +++- > 3 files changed, 29 insertions(+), 6 deletions(-) > > diff --git a/drivers/target/target_core_transport.c > b/drivers/target/target_core_transport.c > index a001ba711cca..c03a78ee26cd 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -2300,7 +2300,7 @@ static void target_complete_ok_work(struct > work_struct *work) > > void target_free_sgl(struct scatterlist *sgl, int nents) > { > - sgl_free(sgl); > + sgl_free_n_order(sgl, nents, 0); > } > EXPORT_SYMBOL(target_free_sgl); > > diff --git a/include/linux/scatterlist.h > b/include/linux/scatterlist.h > index b8a7c1d1dbe3..22b2131bcdcd 100644 > --- a/include/linux/scatterlist.h > +++ b/include/linux/scatterlist.h > @@ -282,6 +282,7 @@ struct scatterlist *sgl_alloc_order(unsigned long > long length, > gfp_t gfp, unsigned int > *nent_p); > struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, > unsigned int *nent_p); > +void sgl_free_n_order(struct scatterlist *sgl, int nents, int > order); > void sgl_free_order(struct scatterlist *sgl, int order); > void sgl_free(struct scatterlist *sgl); > #endif /* CONFIG_SGL_ALLOC */ > diff --git a/lib/scatterlist.c b/lib/scatterlist.c > index 9afc9b432083..53728d391d3a 100644 > --- a/lib/scatterlist.c > +++ b/lib/scatterlist.c > @@ -512,7 +512,7 @@ struct scatterlist *sgl_alloc_order(unsigned long > long length, > if (!sgl) > return NULL; > > - sg_init_table(sgl, nent); > + sg_init_table(sgl, nalloc); > sg = sgl; > while (length) { > elem_len = min_t(u64, length, PAGE_SIZE << order); > @@ -526,7 +526,7 @@ struct scatterlist *sgl_alloc_order(unsigned long > long length, > length -= elem_len; > sg = sg_next(sg); > } > - WARN_ON_ONCE(sg); > + WARN_ONCE(length, "length = %lld\n", length); > if (nent_p) > *nent_p = nent; > return sgl; > @@ -549,22 +549,44 @@ struct scatterlist *sgl_alloc(unsigned long > long length, gfp_t gfp, > EXPORT_SYMBOL(sgl_alloc); > > /** > - * sgl_free_order - free a scatterlist and its pages > + * sgl_free_n_order - free a scatterlist and its pages > * @sgl: Scatterlist with one or more elements > + * @nents: Maximum number of elements to free > * @order: Second argument for __free_pages() > + * > + * Notes: > + * - If several scatterlists have been chained and each chain > element is > + * freed separately then it's essential to set nents correctly to > avoid that a > + * page would get freed twice. > + * - All pages in a chained scatterlist can be freed at once by > setting @nents > + * to a high number. > */ > -void sgl_free_order(struct scatterlist *sgl, int order) > +void sgl_free_n_order(struct scatterlist *sgl, int nents, int order) > { > struct scatterlist *sg; > struct page *page; > + int i; > > - for (sg = sgl; sg; sg = sg_next(sg)) { > + for_each_sg(sgl, sg, nents, i) { > + if (!sg) > + break; > page = sg_page(sg); > if (page) >
[GIT PULL] additional nvme fixes and cleanups for Linux 4.16
The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993: blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() (2018-01-14 10:46:24 -0700) are available in the git repository at: git://git.infradead.org/nvme.git nvme-4.16 for you to fetch changes up to 88de4598bca84e27b261685c06fff816b8d932a1: nvme-pci: clean up SMBSZ bit definitions (2018-01-17 17:55:14 +0100) Christoph Hellwig (2): nvme-pci: clean up CMB initialization nvme-pci: clean up SMBSZ bit definitions James Smart (2): nvme-fc: fix rogue admin cmds stalling teardown nvme-fc: correct hang in nvme_ns_remove() Minwoo Im (1): nvme: fix comment typos in nvme_create_io_queues Roland Dreier (1): nvme-fabrics: fix memory leak when parsing host ID option Roy Shterman (1): nvme: host delete_work and reset_work on separate workqueues Sagi Grimberg (3): nvme-pci: serialize pci resets nvme-pci: allocate device queues storage space at probe nvmet: release a ns reference in nvmet_req_uninit if needed drivers/nvme/host/core.c| 47 ++-- drivers/nvme/host/fabrics.c | 4 +- drivers/nvme/host/fc.c | 6 ++ drivers/nvme/host/nvme.h| 3 + drivers/nvme/host/pci.c | 133 drivers/nvme/host/rdma.c| 2 +- drivers/nvme/target/core.c | 3 + drivers/nvme/target/loop.c | 2 +- include/linux/nvme.h| 22 +--- 9 files changed, 132 insertions(+), 90 deletions(-)
[PATCH] lib/scatterlist: Fix chaining support in sgl_alloc_order()
This patch avoids that workloads with large block sizes (megabytes) can trigger the following call stack with the ib_srpt driver (that driver is the only driver that chains scatterlists allocated by sgl_alloc_order()): BUG: Bad page state in process kworker/0:1H pfn:2423a78 page:fb03d08e9e00 count:-3 mapcount:0 mapping: (null) index:0x0 flags: 0x57c000() raw: 0057c000 fffd raw: dead0100 dead0200 page dumped because: nonzero _count CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G I 4.15.0-rc7.bart+ #1 Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] Call Trace: dump_stack+0x5c/0x83 bad_page+0xf5/0x10f get_page_from_freelist+0xa46/0x11b0 __alloc_pages_nodemask+0x103/0x290 sgl_alloc_order+0x101/0x180 target_alloc_sgl+0x2c/0x40 [target_core_mod] srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt] srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt] __ib_process_cq+0x55/0xa0 [ib_core] ib_cq_poll_work+0x1b/0x60 [ib_core] process_one_work+0x141/0x340 worker_thread+0x47/0x3e0 kthread+0xf5/0x130 ret_from_fork+0x1f/0x30 Fixes: e80a0af4759a ("lib/scatterlist: Introduce sgl_alloc() and sgl_free()") Reported-by: Laurence ObermanSigned-off-by: Bart Van Assche Cc: Nicholas A. Bellinger Cc: Laurence Oberman --- drivers/target/target_core_transport.c | 2 +- include/linux/scatterlist.h| 1 + lib/scatterlist.c | 32 +++- 3 files changed, 29 insertions(+), 6 deletions(-) diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index a001ba711cca..c03a78ee26cd 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2300,7 +2300,7 @@ static void target_complete_ok_work(struct work_struct *work) void target_free_sgl(struct scatterlist *sgl, int nents) { - sgl_free(sgl); + sgl_free_n_order(sgl, nents, 0); } EXPORT_SYMBOL(target_free_sgl); diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index b8a7c1d1dbe3..22b2131bcdcd 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -282,6 +282,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, gfp_t gfp, unsigned int *nent_p); struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); +void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 9afc9b432083..53728d391d3a 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -512,7 +512,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, if (!sgl) return NULL; - sg_init_table(sgl, nent); + sg_init_table(sgl, nalloc); sg = sgl; while (length) { elem_len = min_t(u64, length, PAGE_SIZE << order); @@ -526,7 +526,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, length -= elem_len; sg = sg_next(sg); } - WARN_ON_ONCE(sg); + WARN_ONCE(length, "length = %lld\n", length); if (nent_p) *nent_p = nent; return sgl; @@ -549,22 +549,44 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, EXPORT_SYMBOL(sgl_alloc); /** - * sgl_free_order - free a scatterlist and its pages + * sgl_free_n_order - free a scatterlist and its pages * @sgl: Scatterlist with one or more elements + * @nents: Maximum number of elements to free * @order: Second argument for __free_pages() + * + * Notes: + * - If several scatterlists have been chained and each chain element is + * freed separately then it's essential to set nents correctly to avoid that a + * page would get freed twice. + * - All pages in a chained scatterlist can be freed at once by setting @nents + * to a high number. */ -void sgl_free_order(struct scatterlist *sgl, int order) +void sgl_free_n_order(struct scatterlist *sgl, int nents, int order) { struct scatterlist *sg; struct page *page; + int i; - for (sg = sgl; sg; sg = sg_next(sg)) { + for_each_sg(sgl, sg, nents, i) { + if (!sg) + break; page = sg_page(sg); if (page) __free_pages(page, order); } kfree(sgl); } +EXPORT_SYMBOL(sgl_free_n_order); + +/** + * sgl_free_order - free a scatterlist and its pages + * @sgl: Scatterlist with one or more elements + * @order: Second argument for __free_pages() +
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19 2018 at 12:38pm -0500, Jens Axboewrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Here's an example of that, using my current block tree (merged into > master). The setup is dm-mpath on top of null_blk, the latter having > just a single request. Both are mq devices. > > Fio direct 4k random reads on dm_mq: ~250K iops > > Start dd on underlying device (or partition on same device), just doing > sequential reads. > > Fio direct 4k random reads on dm_mq with dd running: 9 iops > > No schedulers involved. > > https://i.imgur.com/WTDnnwE.gif FYI, your tree doesn't have these dm-4.16 changes (which are needed to make Ming's blk-mq chnages that are in your tree, commit 396eaf21e et al, have Ming's desired effect on DM): https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=050af08ffb1b62af69196d61c22a0755f9a3cdbd https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=459b54019cfeb7330ed4863ad40f78489e0ff23d https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec Fact that you're seeing such shit results without dm-4.16 commit ec3eaf9a67 (that reverts older commit 6077c2d706) means: yeap, this is really awful, let's fix it! But it is a different flavor of awful because the dm-rq.c:map_request() will handle the DM_MAPIO_DELAY_REQUEUE result from the null_blk's blk_get_request() failure using dm_mq_delay_requeue_request() against the dm-mq mpath device: blk_mq_requeue_request(rq, false); __dm_mq_kick_requeue_list(rq->q, msecs); So begs the question: why are we stalling _exactly_? (you may have it all figured out, as your gif implies.. but I'm not there yet). Might be interesting to see how your same test behaves without all of the churn we've staged for 4.16, e.g. against v4.15-rc8 Mike
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > On 1/19/18 9:05 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >> On 1/19/18 8:40 AM, Ming Lei wrote: > >> Where does the dm STS_RESOURCE error usually come from - what's > >> exact > >> resource are we running out of? > > > > It is from blk_get_request(underlying queue), see > > multipath_clone_and_map(). > > That's what I thought. So for a low queue depth underlying queue, > it's > quite possible that this situation can happen. Two potential > solutions > I see: > > 1) As described earlier in this thread, having a mechanism for being > notified when the scarce resource becomes available. It would not > be hard to tap into the existing sbitmap wait queue for that. > > 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > allocation. I haven't read the dm code to know if this is a > possibility or not. > > I'd probably prefer #1. It's a classic case of trying to get the > request, and if it fails, add ourselves to the sbitmap tag wait > queue head, retry, and bail if that also fails. Connecting the > scarce resource and the consumer is the only way to really fix > this, without bogus arbitrary delays. > >>> > >>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>> resource should fix this issue. > >> > >> It'll fix the forever stall, but it won't really fix it, as we'll slow > >> down the dm device by some random amount. > >> > >> A simple test case would be to have a null_blk device with a queue > >> depth > >> of one, and dm on top of that. Start a fio job that runs two jobs: one > >> that does IO to the underlying device, and one that does IO to the dm > >> device. If the job on the dm device runs substantially slower than the > >> one to the underlying device, then the problem isn't really fixed. > > > > I remembered that I tried this test on scsi-debug & dm-mpath over > > scsi-debug, > > seems not observed this issue, could you explain a bit why IO over > > dm-mpath > > may be slower? Because both two IO contexts call same get_request(), and > > in theory dm-mpath should be a bit quicker since it uses direct issue > > for > > underlying queue, without io scheduler involved. > > Because if you lose the race for getting the request, you'll have some > arbitrary delay before trying again, potentially. Compared to the direct > >>> > >>> But the restart still works, one request is completed, then the queue > >>> is return immediately because we use mod_delayed_work_on(0), so looks > >>> no such issue. > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Here's an example of that, using my current block tree (merged into > master). The setup is dm-mpath on top of null_blk, the latter having > just a single request. Both are mq devices. > > Fio direct 4k random reads on dm_mq: ~250K iops > > Start dd on underlying device (or partition on same device), just doing > sequential reads. > > Fio direct 4k random reads on dm_mq with dd running: 9 iops > > No schedulers involved. > > https://i.imgur.com/WTDnnwE.gif This DM specific issue might be addressed by applying notifier_chain (or similar mechanism)between the two queues, will think about the details tomorrow. -- Ming
[PATCH] blk-throttle: use queue_is_rq_based
use queue_is_rq_based instead of open code. Signed-off-by: weiping zhang--- block/blk-throttle.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 96ad326..457e985 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2456,7 +2456,7 @@ void blk_throtl_register_queue(struct request_queue *q) td->throtl_slice = DFL_THROTL_SLICE_HD; #endif - td->track_bio_latency = !q->mq_ops && !q->request_fn; + td->track_bio_latency = !(queue_is_rq_based(q)); if (!td->track_bio_latency) blk_stat_enable_accounting(q); } -- 2.9.4
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 9:37 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >> On 1/19/18 9:26 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: On 1/19/18 9:05 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >> On 1/19/18 8:40 AM, Ming Lei wrote: >> Where does the dm STS_RESOURCE error usually come from - what's exact >> resource are we running out of? > > It is from blk_get_request(underlying queue), see > multipath_clone_and_map(). That's what I thought. So for a low queue depth underlying queue, it's quite possible that this situation can happen. Two potential solutions I see: 1) As described earlier in this thread, having a mechanism for being notified when the scarce resource becomes available. It would not be hard to tap into the existing sbitmap wait queue for that. 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource allocation. I haven't read the dm code to know if this is a possibility or not. I'd probably prefer #1. It's a classic case of trying to get the request, and if it fails, add ourselves to the sbitmap tag wait queue head, retry, and bail if that also fails. Connecting the scarce resource and the consumer is the only way to really fix this, without bogus arbitrary delays. >>> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>> resource should fix this issue. >> >> It'll fix the forever stall, but it won't really fix it, as we'll slow >> down the dm device by some random amount. >> >> A simple test case would be to have a null_blk device with a queue depth >> of one, and dm on top of that. Start a fio job that runs two jobs: one >> that does IO to the underlying device, and one that does IO to the dm >> device. If the job on the dm device runs substantially slower than the >> one to the underlying device, then the problem isn't really fixed. > > I remembered that I tried this test on scsi-debug & dm-mpath over > scsi-debug, > seems not observed this issue, could you explain a bit why IO over > dm-mpath > may be slower? Because both two IO contexts call same get_request(), and > in theory dm-mpath should be a bit quicker since it uses direct issue for > underlying queue, without io scheduler involved. Because if you lose the race for getting the request, you'll have some arbitrary delay before trying again, potentially. Compared to the direct >>> >>> But the restart still works, one request is completed, then the queue >>> is return immediately because we use mod_delayed_work_on(0), so looks >>> no such issue. >> >> There are no pending requests for this case, nothing to restart the >> queue. When you fail that blk_get_request(), you are idle, nothing >> is pending. > > I think we needn't worry about that, once a device is attached to > dm-rq, it can't be mounted any more, and usually user don't use the device > directly and by dm-mpath at the same time. Here's an example of that, using my current block tree (merged into master). The setup is dm-mpath on top of null_blk, the latter having just a single request. Both are mq devices. Fio direct 4k random reads on dm_mq: ~250K iops Start dd on underlying device (or partition on same device), just doing sequential reads. Fio direct 4k random reads on dm_mq with dd running: 9 iops No schedulers involved. https://i.imgur.com/WTDnnwE.gif -- Jens Axboe
Re: [PATCH v2] bdi: show error log when fail to create bdi debugfs entry
2018-01-11 0:36 GMT+08:00 weiping zhang: > bdi debugfs dir/file may create fail, add error log here. > > Signed-off-by: weiping zhang > --- > V1->V2: > fix indentation and make log message more clear > > mm/backing-dev.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index b5f940c..0a49665 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -885,7 +885,9 @@ int bdi_register_va(struct backing_dev_info *bdi, const > char *fmt, va_list args) > cgwb_bdi_register(bdi); > bdi->dev = dev; > > - bdi_debug_register(bdi, dev_name(dev)); > + if (bdi_debug_register(bdi, dev_name(dev))) > + pr_warn("blkdev %s: creation of bdi debugfs entries > failed.\n", > + dev_name(dev)); > set_bit(WB_registered, >wb.state); > > spin_lock_bh(_lock); > -- Hi Jens, madam has no permission to create debuts entry if SELINUX is enable at Fedora and Centos, and we have revert 6d0e4827b72 Revert "bdi: add error handle for bdi_debug_register", that is to say bdi debugfs is not the key component of block device, this patch just add warning log. Thanks
Re: [PATCH 3/3] block: Remove kblockd_schedule_delayed_work{,_on}()
On Fri, Jan 19 2018 at 11:58am -0500, Bart Van Asschewrote: > The previous patch removed all users of these two functions. Hence > also remove the functions themselves. > > Signed-off-by: Bart Van Assche Reviewed-by: Mike Snitzer
Re: [PATCH 2/3] blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
On Fri, Jan 19 2018 at 11:58am -0500, Bart Van Asschewrote: > Make sure that calling blk_mq_run_hw_queue() or > blk_mq_kick_requeue_list() triggers a queue run without delay even > if blk_mq_delay_run_hw_queue() has been called recently and if its > delay has not yet expired. > > Signed-off-by: Bart Van Assche Reviewed-by: Mike Snitzer
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 10:09:11AM -0700, Jens Axboe wrote: > On 1/19/18 10:05 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:47 AM, Mike Snitzer wrote: > >>> On Fri, Jan 19 2018 at 11:41am -0500, > >>> Jens Axboewrote: > >>> > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the > > device > > directly and by dm-mpath at the same time. > > Even if it doesn't happen for a normal dm setup, it is a case that > needs to be handled. The request allocation is just one example of > a wider scope resource that can be unavailable. If the driver returns > NO_DEV_RESOURCE (or whatever name), it will be a possibility that > the device itself is currently idle. > >>> > >>> How would a driver's resources be exhausted yet the device is idle (so > >>> as not to be able to benefit from RESTART)? > >> > >> I've outlined a number of these examples already. Another case might be: > >> > >> 1) Device is idle > >> 2) Device gets request > >> 3) Device attempts to DMA map > >> 4) DMA map fails because the IOMMU is out of space (nic is using it all) > >> 5) Device returns STS_RESOURCE > >> 6) Queue is marked as needing a restart > >> > >> All's well, except there is no IO on this device that will notice the > >> restart bit and retry the operation. > >> > >> Replace IOMMU failure with any other resource that the driver might need > >> for an IO, which isn't tied to a device specific resource. > >> blk_get_request() on dm is an example, as is any allocation failure > >> occurring in the queue IO path for the driver. > > > > Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all > > the current STS_RESOURCE for non-device resource allocation(kmalloc, dma > > map, get_request, ...) should be converted to NO_DEV_RESOURCE. Or simply introduce BLK_STS_DEV_RESOURCE and convert out of real device resource into it, this way may be much easy to do. -- Ming
Re: [PATCH 1/3] blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
On Fri, Jan 19 2018 at 11:58am -0500, Bart Van Asschewrote: > Most blk-mq functions have a name that follows the pattern blk_mq_${action}. > However, the function name blk_mq_request_direct_issue is an exception. > Hence rename this function. This patch does not change any functionality. > > Signed-off-by: Bart Van Assche Reviewed-by: Mike Snitzer
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote: > On 1/19/18 9:47 AM, Mike Snitzer wrote: > > On Fri, Jan 19 2018 at 11:41am -0500, > > Jens Axboewrote: > > > >> On 1/19/18 9:37 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > On 1/19/18 9:26 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > > There are no pending requests for this case, nothing to restart the > queue. When you fail that blk_get_request(), you are idle, nothing > is pending. > >>> > >>> I think we needn't worry about that, once a device is attached to > >>> dm-rq, it can't be mounted any more, and usually user don't use the device > >>> directly and by dm-mpath at the same time. > >> > >> Even if it doesn't happen for a normal dm setup, it is a case that > >> needs to be handled. The request allocation is just one example of > >> a wider scope resource that can be unavailable. If the driver returns > >> NO_DEV_RESOURCE (or whatever name), it will be a possibility that > >> the device itself is currently idle. > > > > How would a driver's resources be exhausted yet the device is idle (so > > as not to be able to benefit from RESTART)? > > I've outlined a number of these examples already. Another case might be: > > 1) Device is idle > 2) Device gets request > 3) Device attempts to DMA map > 4) DMA map fails because the IOMMU is out of space (nic is using it all) > 5) Device returns STS_RESOURCE > 6) Queue is marked as needing a restart > > All's well, except there is no IO on this device that will notice the > restart bit and retry the operation. > > Replace IOMMU failure with any other resource that the driver might need > for an IO, which isn't tied to a device specific resource. > blk_get_request() on dm is an example, as is any allocation failure > occurring in the queue IO path for the driver. Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all the current STS_RESOURCE for non-device resource allocation(kmalloc, dma map, get_request, ...) should be converted to NO_DEV_RESOURCE. And it is a generic issue, which need generic solution. Seems running queue after arbitrary in this case is the only way we thought of, or other solutions? If the decision is made, let's make/review the patch, :-) -- Ming
[PATCH 3/3] block: Remove kblockd_schedule_delayed_work{,_on}()
The previous patch removed all users of these two functions. Hence also remove the functions themselves. Signed-off-by: Bart Van Assche--- block/blk-core.c | 14 -- include/linux/blkdev.h | 2 -- 2 files changed, 16 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 5c21d6b97985..a2005a485335 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -3446,20 +3446,6 @@ int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, } EXPORT_SYMBOL(kblockd_mod_delayed_work_on); -int kblockd_schedule_delayed_work(struct delayed_work *dwork, - unsigned long delay) -{ - return queue_delayed_work(kblockd_workqueue, dwork, delay); -} -EXPORT_SYMBOL(kblockd_schedule_delayed_work); - -int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork, -unsigned long delay) -{ - return queue_delayed_work_on(cpu, kblockd_workqueue, dwork, delay); -} -EXPORT_SYMBOL(kblockd_schedule_delayed_work_on); - /** * blk_start_plug - initialize blk_plug and track it inside the task_struct * @plug: The blk_plug that needs to be initialized diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index a4c7aafb686d..4f3df807cf8f 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1817,8 +1817,6 @@ static inline bool req_gap_front_merge(struct request *req, struct bio *bio) int kblockd_schedule_work(struct work_struct *work); int kblockd_schedule_work_on(int cpu, struct work_struct *work); -int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay); -int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay); int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay); #ifdef CONFIG_BLK_CGROUP -- 2.15.1
[PATCH 0/3] Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
Hello Jens, Earlier today it was agreed that a blk_mq_delay_run_hw_queue() call followed by a blk_mq_run_hw_queue() call should result in an immediate queue run. Hence this patch series. Please consider this patch series for kernel v4.16. Thanks, Bart. Bart Van Assche (3): blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays block: Remove kblockd_schedule_delayed_work{,_on}() block/blk-core.c | 16 +--- block/blk-mq.c | 11 +-- block/blk-mq.h | 2 +- include/linux/blkdev.h | 2 -- 4 files changed, 7 insertions(+), 24 deletions(-) -- 2.15.1
[PATCH 2/3] blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
Make sure that calling blk_mq_run_hw_queue() or blk_mq_kick_requeue_list() triggers a queue run without delay even if blk_mq_delay_run_hw_queue() has been called recently and if its delay has not yet expired. Signed-off-by: Bart Van Assche--- block/blk-mq.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6859b5492509..01f271d40825 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -787,7 +787,7 @@ EXPORT_SYMBOL(blk_mq_add_to_requeue_list); void blk_mq_kick_requeue_list(struct request_queue *q) { - kblockd_schedule_delayed_work(>requeue_work, 0); + kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, >requeue_work, 0); } EXPORT_SYMBOL(blk_mq_kick_requeue_list); @@ -1403,9 +1403,8 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, put_cpu(); } - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), ->run_work, -msecs_to_jiffies(msecs)); + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), >run_work, + msecs_to_jiffies(msecs)); } void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) -- 2.15.1
[PATCH 1/3] blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
Most blk-mq functions have a name that follows the pattern blk_mq_${action}. However, the function name blk_mq_request_direct_issue is an exception. Hence rename this function. This patch does not change any functionality. Signed-off-by: Bart Van Assche--- block/blk-core.c | 2 +- block/blk-mq.c | 4 ++-- block/blk-mq.h | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 4ae295c1be2d..5c21d6b97985 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2525,7 +2525,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_direct_issue(rq); + return blk_mq_request_issue_directly(rq); } spin_lock_irqsave(q->queue_lock, flags); diff --git a/block/blk-mq.c b/block/blk-mq.c index f7515dd95a36..6859b5492509 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1787,7 +1787,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, * RCU or SRCU read lock is needed before checking quiesced flag. * * When queue is stopped or quiesced, ignore 'bypass_insert' from -* blk_mq_request_direct_issue(), and return BLK_STS_OK to caller, +* blk_mq_request_issue_directly(), and return BLK_STS_OK to caller, * and avoid driver to try to dispatch again. */ if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) { @@ -1835,7 +1835,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, hctx_unlock(hctx, srcu_idx); } -blk_status_t blk_mq_request_direct_issue(struct request *rq) +blk_status_t blk_mq_request_issue_directly(struct request *rq) { blk_status_t ret; int srcu_idx; diff --git a/block/blk-mq.h b/block/blk-mq.h index e3ebc93646ca..88c558f71819 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -75,7 +75,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, struct list_head *list); /* Used by blk_insert_cloned_request() to issue request directly */ -blk_status_t blk_mq_request_direct_issue(struct request *rq); +blk_status_t blk_mq_request_issue_directly(struct request *rq); /* * CPU -> queue mappings -- 2.15.1
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 9:47 AM, Mike Snitzer wrote: > On Fri, Jan 19 2018 at 11:41am -0500, > Jens Axboewrote: > >> On 1/19/18 9:37 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: On 1/19/18 9:26 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: There are no pending requests for this case, nothing to restart the queue. When you fail that blk_get_request(), you are idle, nothing is pending. >>> >>> I think we needn't worry about that, once a device is attached to >>> dm-rq, it can't be mounted any more, and usually user don't use the device >>> directly and by dm-mpath at the same time. >> >> Even if it doesn't happen for a normal dm setup, it is a case that >> needs to be handled. The request allocation is just one example of >> a wider scope resource that can be unavailable. If the driver returns >> NO_DEV_RESOURCE (or whatever name), it will be a possibility that >> the device itself is currently idle. > > How would a driver's resources be exhausted yet the device is idle (so > as not to be able to benefit from RESTART)? I've outlined a number of these examples already. Another case might be: 1) Device is idle 2) Device gets request 3) Device attempts to DMA map 4) DMA map fails because the IOMMU is out of space (nic is using it all) 5) Device returns STS_RESOURCE 6) Queue is marked as needing a restart All's well, except there is no IO on this device that will notice the restart bit and retry the operation. Replace IOMMU failure with any other resource that the driver might need for an IO, which isn't tied to a device specific resource. blk_get_request() on dm is an example, as is any allocation failure occurring in the queue IO path for the driver. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19 2018 at 11:41am -0500, Jens Axboewrote: > On 1/19/18 9:37 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:26 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >> > >> There are no pending requests for this case, nothing to restart the > >> queue. When you fail that blk_get_request(), you are idle, nothing > >> is pending. > > > > I think we needn't worry about that, once a device is attached to > > dm-rq, it can't be mounted any more, and usually user don't use the device > > directly and by dm-mpath at the same time. > > Even if it doesn't happen for a normal dm setup, it is a case that > needs to be handled. The request allocation is just one example of > a wider scope resource that can be unavailable. If the driver returns > NO_DEV_RESOURCE (or whatever name), it will be a possibility that > the device itself is currently idle. How would a driver's resources be exhausted yet the device is idle (so as not to be able to benefit from RESTART)?
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 9:37 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: >> On 1/19/18 9:26 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: On 1/19/18 9:05 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >> On 1/19/18 8:40 AM, Ming Lei wrote: >> Where does the dm STS_RESOURCE error usually come from - what's exact >> resource are we running out of? > > It is from blk_get_request(underlying queue), see > multipath_clone_and_map(). That's what I thought. So for a low queue depth underlying queue, it's quite possible that this situation can happen. Two potential solutions I see: 1) As described earlier in this thread, having a mechanism for being notified when the scarce resource becomes available. It would not be hard to tap into the existing sbitmap wait queue for that. 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource allocation. I haven't read the dm code to know if this is a possibility or not. I'd probably prefer #1. It's a classic case of trying to get the request, and if it fails, add ourselves to the sbitmap tag wait queue head, retry, and bail if that also fails. Connecting the scarce resource and the consumer is the only way to really fix this, without bogus arbitrary delays. >>> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>> resource should fix this issue. >> >> It'll fix the forever stall, but it won't really fix it, as we'll slow >> down the dm device by some random amount. >> >> A simple test case would be to have a null_blk device with a queue depth >> of one, and dm on top of that. Start a fio job that runs two jobs: one >> that does IO to the underlying device, and one that does IO to the dm >> device. If the job on the dm device runs substantially slower than the >> one to the underlying device, then the problem isn't really fixed. > > I remembered that I tried this test on scsi-debug & dm-mpath over > scsi-debug, > seems not observed this issue, could you explain a bit why IO over > dm-mpath > may be slower? Because both two IO contexts call same get_request(), and > in theory dm-mpath should be a bit quicker since it uses direct issue for > underlying queue, without io scheduler involved. Because if you lose the race for getting the request, you'll have some arbitrary delay before trying again, potentially. Compared to the direct >>> >>> But the restart still works, one request is completed, then the queue >>> is return immediately because we use mod_delayed_work_on(0), so looks >>> no such issue. >> >> There are no pending requests for this case, nothing to restart the >> queue. When you fail that blk_get_request(), you are idle, nothing >> is pending. > > I think we needn't worry about that, once a device is attached to > dm-rq, it can't be mounted any more, and usually user don't use the device > directly and by dm-mpath at the same time. Even if it doesn't happen for a normal dm setup, it is a case that needs to be handled. The request allocation is just one example of a wider scope resource that can be unavailable. If the driver returns NO_DEV_RESOURCE (or whatever name), it will be a possibility that the device itself is currently idle. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote: > On 1/19/18 9:26 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > >> On 1/19/18 9:05 AM, Ming Lei wrote: > >>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > On 1/19/18 8:40 AM, Ming Lei wrote: > Where does the dm STS_RESOURCE error usually come from - what's exact > resource are we running out of? > >>> > >>> It is from blk_get_request(underlying queue), see > >>> multipath_clone_and_map(). > >> > >> That's what I thought. So for a low queue depth underlying queue, it's > >> quite possible that this situation can happen. Two potential solutions > >> I see: > >> > >> 1) As described earlier in this thread, having a mechanism for being > >>notified when the scarce resource becomes available. It would not > >>be hard to tap into the existing sbitmap wait queue for that. > >> > >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>allocation. I haven't read the dm code to know if this is a > >>possibility or not. > >> > >> I'd probably prefer #1. It's a classic case of trying to get the > >> request, and if it fails, add ourselves to the sbitmap tag wait > >> queue head, retry, and bail if that also fails. Connecting the > >> scarce resource and the consumer is the only way to really fix > >> this, without bogus arbitrary delays. > > > > Right, as I have replied to Bart, using mod_delayed_work_on() with > > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > > resource should fix this issue. > > It'll fix the forever stall, but it won't really fix it, as we'll slow > down the dm device by some random amount. > > A simple test case would be to have a null_blk device with a queue depth > of one, and dm on top of that. Start a fio job that runs two jobs: one > that does IO to the underlying device, and one that does IO to the dm > device. If the job on the dm device runs substantially slower than the > one to the underlying device, then the problem isn't really fixed. > >>> > >>> I remembered that I tried this test on scsi-debug & dm-mpath over > >>> scsi-debug, > >>> seems not observed this issue, could you explain a bit why IO over > >>> dm-mpath > >>> may be slower? Because both two IO contexts call same get_request(), and > >>> in theory dm-mpath should be a bit quicker since it uses direct issue for > >>> underlying queue, without io scheduler involved. > >> > >> Because if you lose the race for getting the request, you'll have some > >> arbitrary delay before trying again, potentially. Compared to the direct > > > > But the restart still works, one request is completed, then the queue > > is return immediately because we use mod_delayed_work_on(0), so looks > > no such issue. > > There are no pending requests for this case, nothing to restart the > queue. When you fail that blk_get_request(), you are idle, nothing > is pending. I think we needn't worry about that, once a device is attached to dm-rq, it can't be mounted any more, and usually user don't use the device directly and by dm-mpath at the same time. -- Ming
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 9:26 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: >> On 1/19/18 9:05 AM, Ming Lei wrote: >>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: On 1/19/18 8:40 AM, Ming Lei wrote: Where does the dm STS_RESOURCE error usually come from - what's exact resource are we running out of? >>> >>> It is from blk_get_request(underlying queue), see >>> multipath_clone_and_map(). >> >> That's what I thought. So for a low queue depth underlying queue, it's >> quite possible that this situation can happen. Two potential solutions >> I see: >> >> 1) As described earlier in this thread, having a mechanism for being >>notified when the scarce resource becomes available. It would not >>be hard to tap into the existing sbitmap wait queue for that. >> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>allocation. I haven't read the dm code to know if this is a >>possibility or not. >> >> I'd probably prefer #1. It's a classic case of trying to get the >> request, and if it fails, add ourselves to the sbitmap tag wait >> queue head, retry, and bail if that also fails. Connecting the >> scarce resource and the consumer is the only way to really fix >> this, without bogus arbitrary delays. > > Right, as I have replied to Bart, using mod_delayed_work_on() with > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > resource should fix this issue. It'll fix the forever stall, but it won't really fix it, as we'll slow down the dm device by some random amount. A simple test case would be to have a null_blk device with a queue depth of one, and dm on top of that. Start a fio job that runs two jobs: one that does IO to the underlying device, and one that does IO to the dm device. If the job on the dm device runs substantially slower than the one to the underlying device, then the problem isn't really fixed. >>> >>> I remembered that I tried this test on scsi-debug & dm-mpath over >>> scsi-debug, >>> seems not observed this issue, could you explain a bit why IO over dm-mpath >>> may be slower? Because both two IO contexts call same get_request(), and >>> in theory dm-mpath should be a bit quicker since it uses direct issue for >>> underlying queue, without io scheduler involved. >> >> Because if you lose the race for getting the request, you'll have some >> arbitrary delay before trying again, potentially. Compared to the direct > > But the restart still works, one request is completed, then the queue > is return immediately because we use mod_delayed_work_on(0), so looks > no such issue. There are no pending requests for this case, nothing to restart the queue. When you fail that blk_get_request(), you are idle, nothing is pending. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote: > On 1/19/18 9:05 AM, Ming Lei wrote: > > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > >> On 1/19/18 8:40 AM, Ming Lei wrote: > >> Where does the dm STS_RESOURCE error usually come from - what's exact > >> resource are we running out of? > > > > It is from blk_get_request(underlying queue), see > > multipath_clone_and_map(). > > That's what I thought. So for a low queue depth underlying queue, it's > quite possible that this situation can happen. Two potential solutions > I see: > > 1) As described earlier in this thread, having a mechanism for being > notified when the scarce resource becomes available. It would not > be hard to tap into the existing sbitmap wait queue for that. > > 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > allocation. I haven't read the dm code to know if this is a > possibility or not. > > I'd probably prefer #1. It's a classic case of trying to get the > request, and if it fails, add ourselves to the sbitmap tag wait > queue head, retry, and bail if that also fails. Connecting the > scarce resource and the consumer is the only way to really fix > this, without bogus arbitrary delays. > >>> > >>> Right, as I have replied to Bart, using mod_delayed_work_on() with > >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > >>> resource should fix this issue. > >> > >> It'll fix the forever stall, but it won't really fix it, as we'll slow > >> down the dm device by some random amount. > >> > >> A simple test case would be to have a null_blk device with a queue depth > >> of one, and dm on top of that. Start a fio job that runs two jobs: one > >> that does IO to the underlying device, and one that does IO to the dm > >> device. If the job on the dm device runs substantially slower than the > >> one to the underlying device, then the problem isn't really fixed. > > > > I remembered that I tried this test on scsi-debug & dm-mpath over > > scsi-debug, > > seems not observed this issue, could you explain a bit why IO over dm-mpath > > may be slower? Because both two IO contexts call same get_request(), and > > in theory dm-mpath should be a bit quicker since it uses direct issue for > > underlying queue, without io scheduler involved. > > Because if you lose the race for getting the request, you'll have some > arbitrary delay before trying again, potentially. Compared to the direct But the restart still works, one request is completed, then the queue is return immediately because we use mod_delayed_work_on(0), so looks no such issue. -- Ming
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 9:13 AM, Mike Snitzer wrote: > On Fri, Jan 19 2018 at 10:48am -0500, > Jens Axboewrote: > >> On 1/19/18 8:40 AM, Ming Lei wrote: >> Where does the dm STS_RESOURCE error usually come from - what's exact >> resource are we running out of? > > It is from blk_get_request(underlying queue), see > multipath_clone_and_map(). That's what I thought. So for a low queue depth underlying queue, it's quite possible that this situation can happen. Two potential solutions I see: 1) As described earlier in this thread, having a mechanism for being notified when the scarce resource becomes available. It would not be hard to tap into the existing sbitmap wait queue for that. 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource allocation. I haven't read the dm code to know if this is a possibility or not. > > Right, #2 is _not_ the way forward. Historically request-based DM used > its own mempool for requests, this was to be able to have some measure > of control and resiliency in the face of low memory conditions that > might be affecting the broader system. > > Then Christoph switched over to adding per-request-data; which ushered > in the use of blk_get_request using ATOMIC allocations. I like the > result of that line of development. But taking the next step of setting > BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this > dm-mpath.c code is common to old .request_fn and blk-mq, at least the > call to blk_get_request is). Ultimately dm-mpath like to avoid blocking > for a request because for this dm-mpath device we have multiple queues > to allocate from if need be (provided we have an active-active storage > network topology). If you can go to multiple devices, obviously it should not block on a single device. That's only true for the case where you can only go to one device, blocking at that point would probably be fine. Or if all your paths are busy, then blocking would also be OK. But it's a much larger change, and would entail changing more than just the actual call to blk_get_request(). >> A simple test case would be to have a null_blk device with a queue depth >> of one, and dm on top of that. Start a fio job that runs two jobs: one >> that does IO to the underlying device, and one that does IO to the dm >> device. If the job on the dm device runs substantially slower than the >> one to the underlying device, then the problem isn't really fixed. > > Not sure DM will allow the underlying device to be opened (due to > master/slave ownership that is part of loading a DM table)? There are many ways it could be setup - just partition the underlying device then, and have one partition be part of the dm setup and the other used directly. >> That said, I'm fine with ensuring that we make forward progress always >> first, and then we can come up with a proper solution to the issue. The >> forward progress guarantee will be needed for the more rare failure >> cases, like allocation failures. nvme needs that too, for instance, for >> the discard range struct allocation. > > Yeap, I'd be OK with that too. We'd be better for revisted this and > then have some time to develop the ultimate robust fix (#1, callback > from above). Yeah, we need the quick and dirty sooner, which just brings us back to what we had before, essentially. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 9:05 AM, Ming Lei wrote: > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: >> On 1/19/18 8:40 AM, Ming Lei wrote: >> Where does the dm STS_RESOURCE error usually come from - what's exact >> resource are we running out of? > > It is from blk_get_request(underlying queue), see > multipath_clone_and_map(). That's what I thought. So for a low queue depth underlying queue, it's quite possible that this situation can happen. Two potential solutions I see: 1) As described earlier in this thread, having a mechanism for being notified when the scarce resource becomes available. It would not be hard to tap into the existing sbitmap wait queue for that. 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource allocation. I haven't read the dm code to know if this is a possibility or not. I'd probably prefer #1. It's a classic case of trying to get the request, and if it fails, add ourselves to the sbitmap tag wait queue head, retry, and bail if that also fails. Connecting the scarce resource and the consumer is the only way to really fix this, without bogus arbitrary delays. >>> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce >>> resource should fix this issue. >> >> It'll fix the forever stall, but it won't really fix it, as we'll slow >> down the dm device by some random amount. >> >> A simple test case would be to have a null_blk device with a queue depth >> of one, and dm on top of that. Start a fio job that runs two jobs: one >> that does IO to the underlying device, and one that does IO to the dm >> device. If the job on the dm device runs substantially slower than the >> one to the underlying device, then the problem isn't really fixed. > > I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, > seems not observed this issue, could you explain a bit why IO over dm-mpath > may be slower? Because both two IO contexts call same get_request(), and > in theory dm-mpath should be a bit quicker since it uses direct issue for > underlying queue, without io scheduler involved. Because if you lose the race for getting the request, you'll have some arbitrary delay before trying again, potentially. Compared to the direct user of the underlying device, who will simply sleep on the resource and get woken the instant it's available. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19 2018 at 10:48am -0500, Jens Axboewrote: > On 1/19/18 8:40 AM, Ming Lei wrote: > Where does the dm STS_RESOURCE error usually come from - what's exact > resource are we running out of? > >>> > >>> It is from blk_get_request(underlying queue), see > >>> multipath_clone_and_map(). > >> > >> That's what I thought. So for a low queue depth underlying queue, it's > >> quite possible that this situation can happen. Two potential solutions > >> I see: > >> > >> 1) As described earlier in this thread, having a mechanism for being > >>notified when the scarce resource becomes available. It would not > >>be hard to tap into the existing sbitmap wait queue for that. > >> > >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>allocation. I haven't read the dm code to know if this is a > >>possibility or not. Right, #2 is _not_ the way forward. Historically request-based DM used its own mempool for requests, this was to be able to have some measure of control and resiliency in the face of low memory conditions that might be affecting the broader system. Then Christoph switched over to adding per-request-data; which ushered in the use of blk_get_request using ATOMIC allocations. I like the result of that line of development. But taking the next step of setting BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this dm-mpath.c code is common to old .request_fn and blk-mq, at least the call to blk_get_request is). Ultimately dm-mpath like to avoid blocking for a request because for this dm-mpath device we have multiple queues to allocate from if need be (provided we have an active-active storage network topology). > >> I'd probably prefer #1. It's a classic case of trying to get the > >> request, and if it fails, add ourselves to the sbitmap tag wait > >> queue head, retry, and bail if that also fails. Connecting the > >> scarce resource and the consumer is the only way to really fix > >> this, without bogus arbitrary delays. > > > > Right, as I have replied to Bart, using mod_delayed_work_on() with > > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > > resource should fix this issue. > > It'll fix the forever stall, but it won't really fix it, as we'll slow > down the dm device by some random amount. Agreed. > A simple test case would be to have a null_blk device with a queue depth > of one, and dm on top of that. Start a fio job that runs two jobs: one > that does IO to the underlying device, and one that does IO to the dm > device. If the job on the dm device runs substantially slower than the > one to the underlying device, then the problem isn't really fixed. Not sure DM will allow the underlying device to be opened (due to master/slave ownership that is part of loading a DM table)? > That said, I'm fine with ensuring that we make forward progress always > first, and then we can come up with a proper solution to the issue. The > forward progress guarantee will be needed for the more rare failure > cases, like allocation failures. nvme needs that too, for instance, for > the discard range struct allocation. Yeap, I'd be OK with that too. We'd be better for revisted this and then have some time to develop the ultimate robust fix (#1, callback from above). Mike
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote: > On 1/19/18 8:40 AM, Ming Lei wrote: > Where does the dm STS_RESOURCE error usually come from - what's exact > resource are we running out of? > >>> > >>> It is from blk_get_request(underlying queue), see > >>> multipath_clone_and_map(). > >> > >> That's what I thought. So for a low queue depth underlying queue, it's > >> quite possible that this situation can happen. Two potential solutions > >> I see: > >> > >> 1) As described earlier in this thread, having a mechanism for being > >>notified when the scarce resource becomes available. It would not > >>be hard to tap into the existing sbitmap wait queue for that. > >> > >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >>allocation. I haven't read the dm code to know if this is a > >>possibility or not. > >> > >> I'd probably prefer #1. It's a classic case of trying to get the > >> request, and if it fails, add ourselves to the sbitmap tag wait > >> queue head, retry, and bail if that also fails. Connecting the > >> scarce resource and the consumer is the only way to really fix > >> this, without bogus arbitrary delays. > > > > Right, as I have replied to Bart, using mod_delayed_work_on() with > > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > > resource should fix this issue. > > It'll fix the forever stall, but it won't really fix it, as we'll slow > down the dm device by some random amount. > > A simple test case would be to have a null_blk device with a queue depth > of one, and dm on top of that. Start a fio job that runs two jobs: one > that does IO to the underlying device, and one that does IO to the dm > device. If the job on the dm device runs substantially slower than the > one to the underlying device, then the problem isn't really fixed. I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug, seems not observed this issue, could you explain a bit why IO over dm-mpath may be slower? Because both two IO contexts call same get_request(), and in theory dm-mpath should be a bit quicker since it uses direct issue for underlying queue, without io scheduler involved. -- Ming
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, 2018-01-19 at 23:33 +0800, Ming Lei wrote: > On Fri, Jan 19, 2018 at 03:20:13PM +, Bart Van Assche wrote: > > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote: > > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > > > scheduling can make progress during the 100ms. > > > > How about addressing that as follows: > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index f7515dd95a36..57f8379a476d 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct > > blk_mq_hw_ctx *hctx, bool async, > > put_cpu(); > > } > > > > - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > > ->run_work, > > -msecs_to_jiffies(msecs)); > > + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > > + >run_work, > > + msecs_to_jiffies(msecs)); > > } > > > > void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long > > msecs) > > > > Bart. > > Yes, this one together with Jen's suggestion with returning > BLK_STS_NO_DEV_RESOURCE should fix this issue. > > Could you cook a fix for this issue? Otherwise I am happy to do > that. Hello Ming, I will look further into this. Bart.
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 8:40 AM, Ming Lei wrote: Where does the dm STS_RESOURCE error usually come from - what's exact resource are we running out of? >>> >>> It is from blk_get_request(underlying queue), see >>> multipath_clone_and_map(). >> >> That's what I thought. So for a low queue depth underlying queue, it's >> quite possible that this situation can happen. Two potential solutions >> I see: >> >> 1) As described earlier in this thread, having a mechanism for being >>notified when the scarce resource becomes available. It would not >>be hard to tap into the existing sbitmap wait queue for that. >> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>allocation. I haven't read the dm code to know if this is a >>possibility or not. >> >> I'd probably prefer #1. It's a classic case of trying to get the >> request, and if it fails, add ourselves to the sbitmap tag wait >> queue head, retry, and bail if that also fails. Connecting the >> scarce resource and the consumer is the only way to really fix >> this, without bogus arbitrary delays. > > Right, as I have replied to Bart, using mod_delayed_work_on() with > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > resource should fix this issue. It'll fix the forever stall, but it won't really fix it, as we'll slow down the dm device by some random amount. A simple test case would be to have a null_blk device with a queue depth of one, and dm on top of that. Start a fio job that runs two jobs: one that does IO to the underlying device, and one that does IO to the dm device. If the job on the dm device runs substantially slower than the one to the underlying device, then the problem isn't really fixed. That said, I'm fine with ensuring that we make forward progress always first, and then we can come up with a proper solution to the issue. The forward progress guarantee will be needed for the more rare failure cases, like allocation failures. nvme needs that too, for instance, for the discard range struct allocation. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 08:24:06AM -0700, Jens Axboe wrote: > On 1/19/18 12:26 AM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: > >> On 1/18/18 7:32 PM, Ming Lei wrote: > >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > On 1/18/18 11:47 AM, Bart Van Assche wrote: > >> This is all very tiresome. > > > > Yes, this is tiresome. It is very annoying to me that others keep > > introducing so many regressions in such important parts of the kernel. > > It is also annoying to me that I get blamed if I report a regression > > instead of seeing that the regression gets fixed. > > I agree, it sucks that any change there introduces the regression. I'm > fine with doing the delay insert again until a new patch is proven to be > better. > >>> > >>> That way is still buggy as I explained, since rerun queue before adding > >>> request to hctx->dispatch_list isn't correct. Who can make sure the > >>> request > >>> is visible when __blk_mq_run_hw_queue() is called? > >> > >> That race basically doesn't exist for a 10ms gap. > >> > >>> Not mention this way will cause performance regression again. > >> > >> How so? It's _exactly_ the same as what you are proposing, except mine > >> will potentially run the queue when it need not do so. But given that > >> these are random 10ms queue kicks because we are screwed, it should not > >> matter. The key point is that it only should be if we have NO better > >> options. If it's a frequently occurring event that we have to return > >> BLK_STS_RESOURCE, then we need to get a way to register an event for > >> when that condition clears. That event will then kick the necessary > >> queue(s). > > > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > > scheduling can make progress during the 100ms. > > That's a bug, plain and simple. If someone does "run this queue in > 100ms" and someone else comes in and says "run this queue now", the > correct outcome is running this queue now. > > From the original topic of this email, we have conditions that can cause > the driver to not be able to submit an IO. A set of those conditions can > only happen if IO is in flight, and those cases we have covered just > fine. Another set can potentially trigger without IO being in flight. > These are cases where a non-device resource is unavailable at the time > of submission. This might be iommu running out of space, for instance, > or it might be a memory allocation of some sort. For these cases, we > don't get any notification when the shortage clears. All we can do is > ensure that we restart operations at some point in the future. We're SOL > at that point, but we have to ensure that we make forward progress. > >>> > >>> Right, it is a generic issue, not DM-specific one, almost all drivers > >>> call kmalloc(GFP_ATOMIC) in IO path. > >> > >> GFP_ATOMIC basically never fails, unless we are out of memory. The > > > > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be > > possible, and it is mentioned[1] there is such code in mm allocation > > path, also OOM can happen too. > > > > if (some randomly generated condition) && (request is atomic) > > return NULL; > > > > [1] https://lwn.net/Articles/276731/ > > That article is 10 years old. Once you run large scale production, you > see what the real failures are. Fact is, for zero order allocation, if > the atomic alloc fails the shit has really hit the fan. In that case, a > delay of 10ms is not your main issue. It's a total red herring when you > compare to the frequency of what Bart is seeing. It's noise, and > irrelevant here. For an atomic zero order allocation failure, doing a > short random sleep is perfectly fine. > > >> exception is higher order allocations. If a driver has a higher order > >> atomic allocation in its IO path, the device driver writer needs to be > >> taken out behind the barn and shot. Simple as that. It will NEVER work > >> well in a production environment. Witness the disaster that so many NIC > >> driver writers have learned. > >> > >> This is NOT the case we care about here. It's resources that are more > >> readily depleted because other devices are using them. If it's a high > >> frequency or generally occurring event, then we simply must have a > >> callback to restart the queue from that. The condition then becomes > >> identical to device private starvation, the only difference being from > >> where we restart the queue. > >> > >>> IMO, there is enough time for figuring out a generic solution before > >>> 4.16 release. > >> > >> I would hope so, but the proposed solutions have not filled me with > >> a lot of confidence in the end result so far. > >> > That last set of conditions better not be a a common occurence, since > performance is down
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On Fri, Jan 19, 2018 at 03:20:13PM +, Bart Van Assche wrote: > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote: > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > > scheduling can make progress during the 100ms. > > How about addressing that as follows: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f7515dd95a36..57f8379a476d 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct > blk_mq_hw_ctx *hctx, bool async, > put_cpu(); > } > > - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > - >run_work, > - msecs_to_jiffies(msecs)); > + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > + >run_work, > + msecs_to_jiffies(msecs)); > } > > void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long > msecs) > > Bart. Yes, this one together with Jen's suggestion with returning BLK_STS_NO_DEV_RESOURCE should fix this issue. Could you cook a fix for this issue? Otherwise I am happy to do that. -- Ming
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 8:20 AM, Bart Van Assche wrote: > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote: >> Please see queue_delayed_work_on(), hctx->run_work is shared by all >> scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new >> scheduling can make progress during the 100ms. > > How about addressing that as follows: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f7515dd95a36..57f8379a476d 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct > blk_mq_hw_ctx *hctx, bool async, > put_cpu(); > } > > - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > - >run_work, > - msecs_to_jiffies(msecs)); > + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), > + >run_work, > + msecs_to_jiffies(msecs)); > } Exactly. That's why I said it was just a bug in my previous email, not honoring a newer run is just stupid. Only other thing you have to be careful with here is the STOPPED bit. -- Jens Axboe
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 1/19/18 12:26 AM, Ming Lei wrote: > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: >> On 1/18/18 7:32 PM, Ming Lei wrote: >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: On 1/18/18 11:47 AM, Bart Van Assche wrote: >> This is all very tiresome. > > Yes, this is tiresome. It is very annoying to me that others keep > introducing so many regressions in such important parts of the kernel. > It is also annoying to me that I get blamed if I report a regression > instead of seeing that the regression gets fixed. I agree, it sucks that any change there introduces the regression. I'm fine with doing the delay insert again until a new patch is proven to be better. >>> >>> That way is still buggy as I explained, since rerun queue before adding >>> request to hctx->dispatch_list isn't correct. Who can make sure the request >>> is visible when __blk_mq_run_hw_queue() is called? >> >> That race basically doesn't exist for a 10ms gap. >> >>> Not mention this way will cause performance regression again. >> >> How so? It's _exactly_ the same as what you are proposing, except mine >> will potentially run the queue when it need not do so. But given that >> these are random 10ms queue kicks because we are screwed, it should not >> matter. The key point is that it only should be if we have NO better >> options. If it's a frequently occurring event that we have to return >> BLK_STS_RESOURCE, then we need to get a way to register an event for >> when that condition clears. That event will then kick the necessary >> queue(s). > > Please see queue_delayed_work_on(), hctx->run_work is shared by all > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new > scheduling can make progress during the 100ms. That's a bug, plain and simple. If someone does "run this queue in 100ms" and someone else comes in and says "run this queue now", the correct outcome is running this queue now. From the original topic of this email, we have conditions that can cause the driver to not be able to submit an IO. A set of those conditions can only happen if IO is in flight, and those cases we have covered just fine. Another set can potentially trigger without IO being in flight. These are cases where a non-device resource is unavailable at the time of submission. This might be iommu running out of space, for instance, or it might be a memory allocation of some sort. For these cases, we don't get any notification when the shortage clears. All we can do is ensure that we restart operations at some point in the future. We're SOL at that point, but we have to ensure that we make forward progress. >>> >>> Right, it is a generic issue, not DM-specific one, almost all drivers >>> call kmalloc(GFP_ATOMIC) in IO path. >> >> GFP_ATOMIC basically never fails, unless we are out of memory. The > > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be > possible, and it is mentioned[1] there is such code in mm allocation > path, also OOM can happen too. > > if (some randomly generated condition) && (request is atomic) > return NULL; > > [1] https://lwn.net/Articles/276731/ That article is 10 years old. Once you run large scale production, you see what the real failures are. Fact is, for zero order allocation, if the atomic alloc fails the shit has really hit the fan. In that case, a delay of 10ms is not your main issue. It's a total red herring when you compare to the frequency of what Bart is seeing. It's noise, and irrelevant here. For an atomic zero order allocation failure, doing a short random sleep is perfectly fine. >> exception is higher order allocations. If a driver has a higher order >> atomic allocation in its IO path, the device driver writer needs to be >> taken out behind the barn and shot. Simple as that. It will NEVER work >> well in a production environment. Witness the disaster that so many NIC >> driver writers have learned. >> >> This is NOT the case we care about here. It's resources that are more >> readily depleted because other devices are using them. If it's a high >> frequency or generally occurring event, then we simply must have a >> callback to restart the queue from that. The condition then becomes >> identical to device private starvation, the only difference being from >> where we restart the queue. >> >>> IMO, there is enough time for figuring out a generic solution before >>> 4.16 release. >> >> I would hope so, but the proposed solutions have not filled me with >> a lot of confidence in the end result so far. >> That last set of conditions better not be a a common occurence, since performance is down the toilet at that point. I don't want to introduce hot path code to rectify it. Have the driver return if that happens in a way that is DIFFERENT from needing a normal restart. The driver knows if this is a resource that will become available