Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 4:52 PM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:37 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
 On 1/19/18 9:26 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:05 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
 On 1/19/18 8:40 AM, Ming Lei wrote:
 Where does the dm STS_RESOURCE error usually come from - what's 
 exact
 resource are we running out of?
>>>
>>> It is from blk_get_request(underlying queue), see
>>> multipath_clone_and_map().
>>
>> That's what I thought. So for a low queue depth underlying queue, 
>> it's
>> quite possible that this situation can happen. Two potential 
>> solutions
>> I see:
>>
>> 1) As described earlier in this thread, having a mechanism for being
>>notified when the scarce resource becomes available. It would not
>>be hard to tap into the existing sbitmap wait queue for that.
>>
>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>allocation. I haven't read the dm code to know if this is a
>>possibility or not.
>>
>> I'd probably prefer #1. It's a classic case of trying to get the
>> request, and if it fails, add ourselves to the sbitmap tag wait
>> queue head, retry, and bail if that also fails. Connecting the
>> scarce resource and the consumer is the only way to really fix
>> this, without bogus arbitrary delays.
>
> Right, as I have replied to Bart, using mod_delayed_work_on() with
> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> resource should fix this issue.

 It'll fix the forever stall, but it won't really fix it, as we'll slow
 down the dm device by some random amount.

 A simple test case would be to have a null_blk device with a queue 
 depth
 of one, and dm on top of that. Start a fio job that runs two jobs: one
 that does IO to the underlying device, and one that does IO to the dm
 device. If the job on the dm device runs substantially slower than the
 one to the underlying device, then the problem isn't really fixed.
>>>
>>> I remembered that I tried this test on scsi-debug & dm-mpath over 
>>> scsi-debug,
>>> seems not observed this issue, could you explain a bit why IO over 
>>> dm-mpath
>>> may be slower? Because both two IO contexts call same get_request(), and
>>> in theory dm-mpath should be a bit quicker since it uses direct issue 
>>> for
>>> underlying queue, without io scheduler involved.
>>
>> Because if you lose the race for getting the request, you'll have some
>> arbitrary delay before trying again, potentially. Compared to the direct
>
> But the restart still works, one request is completed, then the queue
> is return immediately because we use mod_delayed_work_on(0), so looks
> no such issue.

 There are no pending requests for this case, nothing to restart the
 queue. When you fail that blk_get_request(), you are idle, nothing
 is pending.
>>>
>>> I think we needn't worry about that, once a device is attached to
>>> dm-rq, it can't be mounted any more, and usually user don't use the device
>>> directly and by dm-mpath at the same time.
>>
>> Here's an example of that, using my current block tree (merged into
>> master).  The setup is dm-mpath on top of null_blk, the latter having
>> just a single request. Both are mq devices.
>>
>> Fio direct 4k random reads on dm_mq: ~250K iops
>>
>> Start dd on underlying device (or partition on same device), just doing
>> sequential reads.
>>
>> Fio direct 4k random reads on dm_mq with dd running: 9 iops
>>
>> No schedulers involved.
>>
>> https://i.imgur.com/WTDnnwE.gif
> 
> If null_blk's timer mode is used with a bit delay introduced, I guess
> the effect from direct access to underlying queue shouldn't be so
> serious. But it still won't be good as direct access.

Doesn't matter if it's used at the default of 10usec completion
latency, or inline complete. Same result, I ran both.

> Another way may be to introduce a variants blk_get_request(), such as
> blk_get_request_with_notify(), then pass the current dm-rq's hctx to
> it, and use the tag's waitqueue to handle that. But the change can be
> a bit big.

Yes, that's exactly the solution I suggested both yesterday and today.

-- 
Jens Axboe



Re: [PATCH v2] bdi: show error log when fail to create bdi debugfs entry

2018-01-19 Thread weiping zhang
2018-01-20 3:54 GMT+08:00 Jens Axboe :
> On 1/19/18 10:36 AM, weiping zhang wrote:
>> 2018-01-11 0:36 GMT+08:00 weiping zhang :
>>> bdi debugfs dir/file may create fail, add error log here.
>>>
>>> Signed-off-by: weiping zhang 
>>> ---
>>> V1->V2:
>>> fix indentation and make log message more clear
>>>
>>>  mm/backing-dev.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>>> index b5f940c..0a49665 100644
>>> --- a/mm/backing-dev.c
>>> +++ b/mm/backing-dev.c
>>> @@ -885,7 +885,9 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
>>> char *fmt, va_list args)
>>> cgwb_bdi_register(bdi);
>>> bdi->dev = dev;
>>>
>>> -   bdi_debug_register(bdi, dev_name(dev));
>>> +   if (bdi_debug_register(bdi, dev_name(dev)))
>>> +   pr_warn("blkdev %s: creation of bdi debugfs entries 
>>> failed.\n",
>>> +   dev_name(dev));
>>> set_bit(WB_registered, >wb.state);
>>>
>>> spin_lock_bh(_lock);
>>> --
>>
>> Hi Jens,
>>
>> madam has no permission to create debuts entry if SELINUX is enable at
>> Fedora and Centos,
>> and we have revert 6d0e4827b72 Revert "bdi: add error handle for
>> bdi_debug_register", that is to say
>> bdi debugfs is not the key component of block device, this patch just
>> add warning log.
>
> Have we fixed the case where we know it will trigger?

The reason is that mdadm has no permission to create dir/file under
/sys/kernel/debug/
, I think we can solve it in two possible ways.

1.Add proper SELINUX policy to allow mdadm create
/sys/kernel/debug/bdi/xxx, but not
every user add this allowance, so kernel show a warning for this case.

2.Split mdadm into 2 part, Firstly, user proccess mdadm trigger a
kwork and wait a event
done, secondly kwork will create gendisk)and then wake up event. But
it is more likely a
hack, and it may break SELINUX mechanism, so I give up this way.
https://marc.info/?l=linux-mm=151456540928231=2

> --
> Jens Axboe
>


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 09:23:35AM -0700, Jens Axboe wrote:
> On 1/19/18 9:13 AM, Mike Snitzer wrote:
> > On Fri, Jan 19 2018 at 10:48am -0500,
> > Jens Axboe  wrote:
> > 
> >> On 1/19/18 8:40 AM, Ming Lei wrote:
> >> Where does the dm STS_RESOURCE error usually come from - what's exact
> >> resource are we running out of?
> >
> > It is from blk_get_request(underlying queue), see
> > multipath_clone_and_map().
> 
>  That's what I thought. So for a low queue depth underlying queue, it's
>  quite possible that this situation can happen. Two potential solutions
>  I see:
> 
>  1) As described earlier in this thread, having a mechanism for being
> notified when the scarce resource becomes available. It would not
> be hard to tap into the existing sbitmap wait queue for that.
> 
>  2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> allocation. I haven't read the dm code to know if this is a
> possibility or not.
> > 
> > Right, #2 is _not_ the way forward.  Historically request-based DM used
> > its own mempool for requests, this was to be able to have some measure
> > of control and resiliency in the face of low memory conditions that
> > might be affecting the broader system.
> > 
> > Then Christoph switched over to adding per-request-data; which ushered
> > in the use of blk_get_request using ATOMIC allocations.  I like the
> > result of that line of development.  But taking the next step of setting
> > BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this
> > dm-mpath.c code is common to old .request_fn and blk-mq, at least the
> > call to blk_get_request is).  Ultimately dm-mpath like to avoid blocking
> > for a request because for this dm-mpath device we have multiple queues
> > to allocate from if need be (provided we have an active-active storage
> > network topology).
> 
> If you can go to multiple devices, obviously it should not block on a
> single device. That's only true for the case where you can only go to
> one device, blocking at that point would probably be fine. Or if all
> your paths are busy, then blocking would also be OK.

Introducing one extra block point will hurt AIO performance, in which
there is usually much less jobs/processes to submit IO.

-- 
Ming


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote:
> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>  On 1/19/18 9:05 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> >> On 1/19/18 8:40 AM, Ming Lei wrote:
> >> Where does the dm STS_RESOURCE error usually come from - what's 
> >> exact
> >> resource are we running out of?
> >
> > It is from blk_get_request(underlying queue), see
> > multipath_clone_and_map().
> 
>  That's what I thought. So for a low queue depth underlying queue, 
>  it's
>  quite possible that this situation can happen. Two potential 
>  solutions
>  I see:
> 
>  1) As described earlier in this thread, having a mechanism for being
> notified when the scarce resource becomes available. It would not
> be hard to tap into the existing sbitmap wait queue for that.
> 
>  2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> allocation. I haven't read the dm code to know if this is a
> possibility or not.
> 
>  I'd probably prefer #1. It's a classic case of trying to get the
>  request, and if it fails, add ourselves to the sbitmap tag wait
>  queue head, retry, and bail if that also fails. Connecting the
>  scarce resource and the consumer is the only way to really fix
>  this, without bogus arbitrary delays.
> >>>
> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with
> >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> >>> resource should fix this issue.
> >>
> >> It'll fix the forever stall, but it won't really fix it, as we'll slow
> >> down the dm device by some random amount.
> >>
> >> A simple test case would be to have a null_blk device with a queue 
> >> depth
> >> of one, and dm on top of that. Start a fio job that runs two jobs: one
> >> that does IO to the underlying device, and one that does IO to the dm
> >> device. If the job on the dm device runs substantially slower than the
> >> one to the underlying device, then the problem isn't really fixed.
> >
> > I remembered that I tried this test on scsi-debug & dm-mpath over 
> > scsi-debug,
> > seems not observed this issue, could you explain a bit why IO over 
> > dm-mpath
> > may be slower? Because both two IO contexts call same get_request(), and
> > in theory dm-mpath should be a bit quicker since it uses direct issue 
> > for
> > underlying queue, without io scheduler involved.
> 
>  Because if you lose the race for getting the request, you'll have some
>  arbitrary delay before trying again, potentially. Compared to the direct
> >>>
> >>> But the restart still works, one request is completed, then the queue
> >>> is return immediately because we use mod_delayed_work_on(0), so looks
> >>> no such issue.
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> > 
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
> 
> Here's an example of that, using my current block tree (merged into
> master).  The setup is dm-mpath on top of null_blk, the latter having
> just a single request. Both are mq devices.
> 
> Fio direct 4k random reads on dm_mq: ~250K iops
> 
> Start dd on underlying device (or partition on same device), just doing
> sequential reads.
> 
> Fio direct 4k random reads on dm_mq with dd running: 9 iops
> 
> No schedulers involved.
> 
> https://i.imgur.com/WTDnnwE.gif

If null_blk's timer mode is used with a bit delay introduced, I guess
the effect from direct access to underlying queue shouldn't be so
serious. But it still won't be good as direct access.

Another way may be to introduce a variants blk_get_request(), such as
blk_get_request_with_notify(), then pass the current dm-rq's hctx to
it, and use the tag's waitqueue to handle that. But the change can be
a bit big.

-- 
Ming


Re: [PATCH] blk-throttle: use queue_is_rq_based

2018-01-19 Thread weiping zhang
2018-01-20 3:58 GMT+08:00 Jens Axboe :
> On 1/19/18 10:40 AM, weiping zhang wrote:
>> use queue_is_rq_based instead of open code.
>>
>> Signed-off-by: weiping zhang 
>> ---
>>  block/blk-throttle.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 96ad326..457e985 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -2456,7 +2456,7 @@ void blk_throtl_register_queue(struct request_queue *q)
>>   td->throtl_slice = DFL_THROTL_SLICE_HD;
>>  #endif
>>
>> - td->track_bio_latency = !q->mq_ops && !q->request_fn;
>> + td->track_bio_latency = !(queue_is_rq_based(q));
>
> Kill the extra parenthesis here.

Oh, kill it at V2.
> --
> Jens Axboe
>


[PATCH v2] blk-throttle: use queue_is_rq_based

2018-01-19 Thread weiping zhang
use queue_is_rq_based instead of open code.

Signed-off-by: weiping zhang 
---
 block/blk-throttle.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e136f5e..c475f0f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2489,7 +2489,7 @@ void blk_throtl_register_queue(struct request_queue *q)
td->throtl_slice = DFL_THROTL_SLICE_HD;
 #endif
 
-   td->track_bio_latency = !q->mq_ops && !q->request_fn;
+   td->track_bio_latency = !queue_is_rq_based(q);
if (!td->track_bio_latency)
blk_stat_enable_accounting(q);
 }
-- 
2.9.4



Hello There

2018-01-19 Thread FINANCE CAPITAL
 UNSECURED BUSINESS/PERSONAL LOAN BY LOAN CAPITAL FINANCE
  - NO COLLATERAL
  - MINIMUM DOCUMENTATION
  - BUSINESS LOAN UP TO FIVE(5) MILLION US DOLLARS

  CONTACT US TODAY VIA EMAIL: financecapital...@mail.com

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: [PATCH] blk-throttle: use queue_is_rq_based

2018-01-19 Thread Jens Axboe
On 1/19/18 10:40 AM, weiping zhang wrote:
> use queue_is_rq_based instead of open code.
> 
> Signed-off-by: weiping zhang 
> ---
>  block/blk-throttle.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 96ad326..457e985 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2456,7 +2456,7 @@ void blk_throtl_register_queue(struct request_queue *q)
>   td->throtl_slice = DFL_THROTL_SLICE_HD;
>  #endif
>  
> - td->track_bio_latency = !q->mq_ops && !q->request_fn;
> + td->track_bio_latency = !(queue_is_rq_based(q));

Kill the extra parenthesis here.

-- 
Jens Axboe



Re: [PATCH v2] bdi: show error log when fail to create bdi debugfs entry

2018-01-19 Thread Jens Axboe
On 1/19/18 10:36 AM, weiping zhang wrote:
> 2018-01-11 0:36 GMT+08:00 weiping zhang :
>> bdi debugfs dir/file may create fail, add error log here.
>>
>> Signed-off-by: weiping zhang 
>> ---
>> V1->V2:
>> fix indentation and make log message more clear
>>
>>  mm/backing-dev.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index b5f940c..0a49665 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -885,7 +885,9 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
>> char *fmt, va_list args)
>> cgwb_bdi_register(bdi);
>> bdi->dev = dev;
>>
>> -   bdi_debug_register(bdi, dev_name(dev));
>> +   if (bdi_debug_register(bdi, dev_name(dev)))
>> +   pr_warn("blkdev %s: creation of bdi debugfs entries 
>> failed.\n",
>> +   dev_name(dev));
>> set_bit(WB_registered, >wb.state);
>>
>> spin_lock_bh(_lock);
>> --
> 
> Hi Jens,
> 
> madam has no permission to create debuts entry if SELINUX is enable at
> Fedora and Centos,
> and we have revert 6d0e4827b72 Revert "bdi: add error handle for
> bdi_debug_register", that is to say
> bdi debugfs is not the key component of block device, this patch just
> add warning log.

Have we fixed the case where we know it will trigger?

-- 
Jens Axboe



Re: [PATCH 0/3] Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays

2018-01-19 Thread Jens Axboe
On 1/19/18 9:58 AM, Bart Van Assche wrote:
> Hello Jens,
> 
> Earlier today it was agreed that a blk_mq_delay_run_hw_queue() call followed
> by a blk_mq_run_hw_queue() call should result in an immediate queue run. Hence
> this patch series. Please consider this patch series for kernel v4.16.

Thanks Bart, applied.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Bart Van Assche
On Fri, 2018-01-19 at 15:34 +0800, Ming Lei wrote:
> Could you explain a bit when SCSI target replies with BUSY very often?
> 
> Inside initiator, we have limited the max per-LUN requests and per-host
> requests already before calling .queue_rq().

That's correct. However, when a SCSI initiator and target system are
communicating with each other there is no guarantee that initiator and target
queue depth have been tuned properly. If the initiator queue depth and the
number of requests that can be in flight according to the network protocol
are both larger than the target queue depth and if the target system uses
relatively slow storage (e.g. harddisks) then it can happen that the target
replies with BUSY often.

The Linux iSCSI initiator limits MaxOutstandingR2T (the number of requests
an initiator may sent without having received an answer from the target) to
one so I don't think this can happen when using iSCSI/TCP.

With the SRP initiator however the maximum requests that can be in flight
between initiator and target depends on the number of credits that were
negotiated during login between initiator and target. Some time ago I modified
the SRP initiator such that it limits the initiator queue depth to the number
of SRP credits minus one (for task management). That resulted in a performance
improvement due to fewer BUSY conditions at the initiator side (see also commit
7ade400aba9a ("IB/srp: Reduce number of BUSY conditions")). Another patch for
the SCST SRP target driver limited the number of SRP credits to the queue depth
of the block device at the target side. I'm referring to the following code:
ch->rq_size = min(MAX_SRPT_RQ_SIZE, scst_get_max_lun_commands(NULL, 0)) (I have
not yet had the time to port this change to LIO).

Without such tuning across initiator and target it can happen often that the
target system sends the reply "BUSY" back to the initiator. I think that's why
there is code in the SCSI core to automatically adjust the initiator queue
depth if the "BUSY" condition is encountered frequently. See also
scsi_track_queue_full().

Bart.

Re: [PATCH] lib/scatterlist: Fix chaining support in sgl_alloc_order()

2018-01-19 Thread Jens Axboe
On 1/19/18 12:00 PM, Bart Van Assche wrote:
> This patch avoids that workloads with large block sizes (megabytes)
> can trigger the following call stack with the ib_srpt driver (that
> driver is the only driver that chains scatterlists allocated by
> sgl_alloc_order()):
> 
> BUG: Bad page state in process kworker/0:1H  pfn:2423a78
> page:fb03d08e9e00 count:-3 mapcount:0 mapping:  (null) index:0x0
> flags: 0x57c000()
> raw: 0057c000   fffd
> raw: dead0100 dead0200  
> page dumped because: nonzero _count
> CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G  I  
> 4.15.0-rc7.bart+ #1
> Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> Call Trace:
>  dump_stack+0x5c/0x83
>  bad_page+0xf5/0x10f
>  get_page_from_freelist+0xa46/0x11b0
>  __alloc_pages_nodemask+0x103/0x290
>  sgl_alloc_order+0x101/0x180
>  target_alloc_sgl+0x2c/0x40 [target_core_mod]
>  srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt]
>  srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt]
>  __ib_process_cq+0x55/0xa0 [ib_core]
>  ib_cq_poll_work+0x1b/0x60 [ib_core]
>  process_one_work+0x141/0x340
>  worker_thread+0x47/0x3e0
>  kthread+0xf5/0x130
>  ret_from_fork+0x1f/0x30

Applied, thanks.

-- 
Jens Axboe



Re: [GIT PULL] additional nvme fixes and cleanups for Linux 4.16

2018-01-19 Thread Jens Axboe
On 1/19/18 12:02 PM, Christoph Hellwig wrote:
> The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993:
> 
>   blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() 
> (2018-01-14 10:46:24 -0700)
> 
> are available in the git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.16
> 
> for you to fetch changes up to 88de4598bca84e27b261685c06fff816b8d932a1:
> 
>   nvme-pci: clean up SMBSZ bit definitions (2018-01-17 17:55:14 +0100)
> 
> 
> Christoph Hellwig (2):
>   nvme-pci: clean up CMB initialization
>   nvme-pci: clean up SMBSZ bit definitions
> 
> James Smart (2):
>   nvme-fc: fix rogue admin cmds stalling teardown
>   nvme-fc: correct hang in nvme_ns_remove()
> 
> Minwoo Im (1):
>   nvme: fix comment typos in nvme_create_io_queues
> 
> Roland Dreier (1):
>   nvme-fabrics: fix memory leak when parsing host ID option
> 
> Roy Shterman (1):
>   nvme: host delete_work and reset_work on separate workqueues
> 
> Sagi Grimberg (3):
>   nvme-pci: serialize pci resets
>   nvme-pci: allocate device queues storage space at probe
>   nvmet: release a ns reference in nvmet_req_uninit if needed
> 
>  drivers/nvme/host/core.c|  47 ++--
>  drivers/nvme/host/fabrics.c |   4 +-
>  drivers/nvme/host/fc.c  |   6 ++
>  drivers/nvme/host/nvme.h|   3 +
>  drivers/nvme/host/pci.c | 133 
> 
>  drivers/nvme/host/rdma.c|   2 +-
>  drivers/nvme/target/core.c  |   3 +
>  drivers/nvme/target/loop.c  |   2 +-
>  include/linux/nvme.h|  22 +---
>  9 files changed, 132 insertions(+), 90 deletions(-)

Pulled, thanks.

-- 
Jens Axboe



Re: [PATCH] lib/scatterlist: Fix chaining support in sgl_alloc_order()

2018-01-19 Thread Laurence Oberman
On Fri, 2018-01-19 at 11:00 -0800, Bart Van Assche wrote:
> This patch avoids that workloads with large block sizes (megabytes)
> can trigger the following call stack with the ib_srpt driver (that
> driver is the only driver that chains scatterlists allocated by
> sgl_alloc_order()):
> 
> BUG: Bad page state in process kworker/0:1H  pfn:2423a78
> page:fb03d08e9e00 count:-3 mapcount:0 mapping:  (null)
> index:0x0
> flags: 0x57c000()
> raw: 0057c000  
> fffd
> raw: dead0100 dead0200 
> 
> page dumped because: nonzero _count
> CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G  I  4.15.0-
> rc7.bart+ #1
> Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> Call Trace:
>  dump_stack+0x5c/0x83
>  bad_page+0xf5/0x10f
>  get_page_from_freelist+0xa46/0x11b0
>  __alloc_pages_nodemask+0x103/0x290
>  sgl_alloc_order+0x101/0x180
>  target_alloc_sgl+0x2c/0x40 [target_core_mod]
>  srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt]
>  srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt]
>  __ib_process_cq+0x55/0xa0 [ib_core]
>  ib_cq_poll_work+0x1b/0x60 [ib_core]
>  process_one_work+0x141/0x340
>  worker_thread+0x47/0x3e0
>  kthread+0xf5/0x130
>  ret_from_fork+0x1f/0x30
> 
> Fixes: e80a0af4759a ("lib/scatterlist: Introduce sgl_alloc() and
> sgl_free()")
> Reported-by: Laurence Oberman 
> Signed-off-by: Bart Van Assche 
> Cc: Nicholas A. Bellinger 
> Cc: Laurence Oberman 
> ---
>  drivers/target/target_core_transport.c |  2 +-
>  include/linux/scatterlist.h|  1 +
>  lib/scatterlist.c  | 32
> +++-
>  3 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/target/target_core_transport.c
> b/drivers/target/target_core_transport.c
> index a001ba711cca..c03a78ee26cd 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -2300,7 +2300,7 @@ static void target_complete_ok_work(struct
> work_struct *work)
>  
>  void target_free_sgl(struct scatterlist *sgl, int nents)
>  {
> - sgl_free(sgl);
> + sgl_free_n_order(sgl, nents, 0);
>  }
>  EXPORT_SYMBOL(target_free_sgl);
>  
> diff --git a/include/linux/scatterlist.h
> b/include/linux/scatterlist.h
> index b8a7c1d1dbe3..22b2131bcdcd 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -282,6 +282,7 @@ struct scatterlist *sgl_alloc_order(unsigned long
> long length,
>   gfp_t gfp, unsigned int
> *nent_p);
>  struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
>     unsigned int *nent_p);
> +void sgl_free_n_order(struct scatterlist *sgl, int nents, int
> order);
>  void sgl_free_order(struct scatterlist *sgl, int order);
>  void sgl_free(struct scatterlist *sgl);
>  #endif /* CONFIG_SGL_ALLOC */
> diff --git a/lib/scatterlist.c b/lib/scatterlist.c
> index 9afc9b432083..53728d391d3a 100644
> --- a/lib/scatterlist.c
> +++ b/lib/scatterlist.c
> @@ -512,7 +512,7 @@ struct scatterlist *sgl_alloc_order(unsigned long
> long length,
>   if (!sgl)
>   return NULL;
>  
> - sg_init_table(sgl, nent);
> + sg_init_table(sgl, nalloc);
>   sg = sgl;
>   while (length) {
>   elem_len = min_t(u64, length, PAGE_SIZE << order);
> @@ -526,7 +526,7 @@ struct scatterlist *sgl_alloc_order(unsigned long
> long length,
>   length -= elem_len;
>   sg = sg_next(sg);
>   }
> - WARN_ON_ONCE(sg);
> + WARN_ONCE(length, "length = %lld\n", length);
>   if (nent_p)
>   *nent_p = nent;
>   return sgl;
> @@ -549,22 +549,44 @@ struct scatterlist *sgl_alloc(unsigned long
> long length, gfp_t gfp,
>  EXPORT_SYMBOL(sgl_alloc);
>  
>  /**
> - * sgl_free_order - free a scatterlist and its pages
> + * sgl_free_n_order - free a scatterlist and its pages
>   * @sgl: Scatterlist with one or more elements
> + * @nents: Maximum number of elements to free
>   * @order: Second argument for __free_pages()
> + *
> + * Notes:
> + * - If several scatterlists have been chained and each chain
> element is
> + *   freed separately then it's essential to set nents correctly to
> avoid that a
> + *   page would get freed twice.
> + * - All pages in a chained scatterlist can be freed at once by
> setting @nents
> + *   to a high number.
>   */
> -void sgl_free_order(struct scatterlist *sgl, int order)
> +void sgl_free_n_order(struct scatterlist *sgl, int nents, int order)
>  {
>   struct scatterlist *sg;
>   struct page *page;
> + int i;
>  
> - for (sg = sgl; sg; sg = sg_next(sg)) {
> + for_each_sg(sgl, sg, nents, i) {
> + if (!sg)
> + break;
>   page = sg_page(sg);
>   if (page)
>     

[GIT PULL] additional nvme fixes and cleanups for Linux 4.16

2018-01-19 Thread Christoph Hellwig
The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993:

  blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() (2018-01-14 
10:46:24 -0700)

are available in the git repository at:

  git://git.infradead.org/nvme.git nvme-4.16

for you to fetch changes up to 88de4598bca84e27b261685c06fff816b8d932a1:

  nvme-pci: clean up SMBSZ bit definitions (2018-01-17 17:55:14 +0100)


Christoph Hellwig (2):
  nvme-pci: clean up CMB initialization
  nvme-pci: clean up SMBSZ bit definitions

James Smart (2):
  nvme-fc: fix rogue admin cmds stalling teardown
  nvme-fc: correct hang in nvme_ns_remove()

Minwoo Im (1):
  nvme: fix comment typos in nvme_create_io_queues

Roland Dreier (1):
  nvme-fabrics: fix memory leak when parsing host ID option

Roy Shterman (1):
  nvme: host delete_work and reset_work on separate workqueues

Sagi Grimberg (3):
  nvme-pci: serialize pci resets
  nvme-pci: allocate device queues storage space at probe
  nvmet: release a ns reference in nvmet_req_uninit if needed

 drivers/nvme/host/core.c|  47 ++--
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/fc.c  |   6 ++
 drivers/nvme/host/nvme.h|   3 +
 drivers/nvme/host/pci.c | 133 
 drivers/nvme/host/rdma.c|   2 +-
 drivers/nvme/target/core.c  |   3 +
 drivers/nvme/target/loop.c  |   2 +-
 include/linux/nvme.h|  22 +---
 9 files changed, 132 insertions(+), 90 deletions(-)


[PATCH] lib/scatterlist: Fix chaining support in sgl_alloc_order()

2018-01-19 Thread Bart Van Assche
This patch avoids that workloads with large block sizes (megabytes)
can trigger the following call stack with the ib_srpt driver (that
driver is the only driver that chains scatterlists allocated by
sgl_alloc_order()):

BUG: Bad page state in process kworker/0:1H  pfn:2423a78
page:fb03d08e9e00 count:-3 mapcount:0 mapping:  (null) index:0x0
flags: 0x57c000()
raw: 0057c000   fffd
raw: dead0100 dead0200  
page dumped because: nonzero _count
CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G  I  4.15.0-rc7.bart+ 
#1
Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Call Trace:
 dump_stack+0x5c/0x83
 bad_page+0xf5/0x10f
 get_page_from_freelist+0xa46/0x11b0
 __alloc_pages_nodemask+0x103/0x290
 sgl_alloc_order+0x101/0x180
 target_alloc_sgl+0x2c/0x40 [target_core_mod]
 srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt]
 srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt]
 __ib_process_cq+0x55/0xa0 [ib_core]
 ib_cq_poll_work+0x1b/0x60 [ib_core]
 process_one_work+0x141/0x340
 worker_thread+0x47/0x3e0
 kthread+0xf5/0x130
 ret_from_fork+0x1f/0x30

Fixes: e80a0af4759a ("lib/scatterlist: Introduce sgl_alloc() and sgl_free()")
Reported-by: Laurence Oberman 
Signed-off-by: Bart Van Assche 
Cc: Nicholas A. Bellinger 
Cc: Laurence Oberman 
---
 drivers/target/target_core_transport.c |  2 +-
 include/linux/scatterlist.h|  1 +
 lib/scatterlist.c  | 32 +++-
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/drivers/target/target_core_transport.c 
b/drivers/target/target_core_transport.c
index a001ba711cca..c03a78ee26cd 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -2300,7 +2300,7 @@ static void target_complete_ok_work(struct work_struct 
*work)
 
 void target_free_sgl(struct scatterlist *sgl, int nents)
 {
-   sgl_free(sgl);
+   sgl_free_n_order(sgl, nents, 0);
 }
 EXPORT_SYMBOL(target_free_sgl);
 
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index b8a7c1d1dbe3..22b2131bcdcd 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -282,6 +282,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
gfp_t gfp, unsigned int *nent_p);
 struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
  unsigned int *nent_p);
+void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 9afc9b432083..53728d391d3a 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -512,7 +512,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
if (!sgl)
return NULL;
 
-   sg_init_table(sgl, nent);
+   sg_init_table(sgl, nalloc);
sg = sgl;
while (length) {
elem_len = min_t(u64, length, PAGE_SIZE << order);
@@ -526,7 +526,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
length -= elem_len;
sg = sg_next(sg);
}
-   WARN_ON_ONCE(sg);
+   WARN_ONCE(length, "length = %lld\n", length);
if (nent_p)
*nent_p = nent;
return sgl;
@@ -549,22 +549,44 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
 EXPORT_SYMBOL(sgl_alloc);
 
 /**
- * sgl_free_order - free a scatterlist and its pages
+ * sgl_free_n_order - free a scatterlist and its pages
  * @sgl: Scatterlist with one or more elements
+ * @nents: Maximum number of elements to free
  * @order: Second argument for __free_pages()
+ *
+ * Notes:
+ * - If several scatterlists have been chained and each chain element is
+ *   freed separately then it's essential to set nents correctly to avoid that 
a
+ *   page would get freed twice.
+ * - All pages in a chained scatterlist can be freed at once by setting @nents
+ *   to a high number.
  */
-void sgl_free_order(struct scatterlist *sgl, int order)
+void sgl_free_n_order(struct scatterlist *sgl, int nents, int order)
 {
struct scatterlist *sg;
struct page *page;
+   int i;
 
-   for (sg = sgl; sg; sg = sg_next(sg)) {
+   for_each_sg(sgl, sg, nents, i) {
+   if (!sg)
+   break;
page = sg_page(sg);
if (page)
__free_pages(page, order);
}
kfree(sgl);
 }
+EXPORT_SYMBOL(sgl_free_n_order);
+
+/**
+ * sgl_free_order - free a scatterlist and its pages
+ * @sgl: Scatterlist with one or more elements
+ * @order: Second argument for __free_pages()
+ 

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Mike Snitzer
On Fri, Jan 19 2018 at 12:38pm -0500,
Jens Axboe  wrote:

> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> > 
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
> 
> Here's an example of that, using my current block tree (merged into
> master).  The setup is dm-mpath on top of null_blk, the latter having
> just a single request. Both are mq devices.
> 
> Fio direct 4k random reads on dm_mq: ~250K iops
> 
> Start dd on underlying device (or partition on same device), just doing
> sequential reads.
> 
> Fio direct 4k random reads on dm_mq with dd running: 9 iops
> 
> No schedulers involved.
> 
> https://i.imgur.com/WTDnnwE.gif

FYI, your tree doesn't have these dm-4.16 changes (which are needed to
make Ming's blk-mq chnages that are in your tree, commit 396eaf21e et
al, have Ming's desired effect on DM):

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=050af08ffb1b62af69196d61c22a0755f9a3cdbd
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=459b54019cfeb7330ed4863ad40f78489e0ff23d
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec

Fact that you're seeing such shit results without dm-4.16 commit
ec3eaf9a67 (that reverts older commit 6077c2d706) means: yeap, this
is really awful, let's fix it!  But it is a different flavor of awful
because the dm-rq.c:map_request() will handle the DM_MAPIO_DELAY_REQUEUE
result from the null_blk's blk_get_request() failure using
dm_mq_delay_requeue_request() against the dm-mq mpath device:

blk_mq_requeue_request(rq, false);
__dm_mq_kick_requeue_list(rq->q, msecs);

So begs the question: why are we stalling _exactly_?  (you may have it
all figured out, as your gif implies.. but I'm not there yet).

Might be interesting to see how your same test behaves without all of
the churn we've staged for 4.16, e.g. against v4.15-rc8

Mike


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote:
> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>  On 1/19/18 9:05 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> >> On 1/19/18 8:40 AM, Ming Lei wrote:
> >> Where does the dm STS_RESOURCE error usually come from - what's 
> >> exact
> >> resource are we running out of?
> >
> > It is from blk_get_request(underlying queue), see
> > multipath_clone_and_map().
> 
>  That's what I thought. So for a low queue depth underlying queue, 
>  it's
>  quite possible that this situation can happen. Two potential 
>  solutions
>  I see:
> 
>  1) As described earlier in this thread, having a mechanism for being
> notified when the scarce resource becomes available. It would not
> be hard to tap into the existing sbitmap wait queue for that.
> 
>  2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> allocation. I haven't read the dm code to know if this is a
> possibility or not.
> 
>  I'd probably prefer #1. It's a classic case of trying to get the
>  request, and if it fails, add ourselves to the sbitmap tag wait
>  queue head, retry, and bail if that also fails. Connecting the
>  scarce resource and the consumer is the only way to really fix
>  this, without bogus arbitrary delays.
> >>>
> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with
> >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> >>> resource should fix this issue.
> >>
> >> It'll fix the forever stall, but it won't really fix it, as we'll slow
> >> down the dm device by some random amount.
> >>
> >> A simple test case would be to have a null_blk device with a queue 
> >> depth
> >> of one, and dm on top of that. Start a fio job that runs two jobs: one
> >> that does IO to the underlying device, and one that does IO to the dm
> >> device. If the job on the dm device runs substantially slower than the
> >> one to the underlying device, then the problem isn't really fixed.
> >
> > I remembered that I tried this test on scsi-debug & dm-mpath over 
> > scsi-debug,
> > seems not observed this issue, could you explain a bit why IO over 
> > dm-mpath
> > may be slower? Because both two IO contexts call same get_request(), and
> > in theory dm-mpath should be a bit quicker since it uses direct issue 
> > for
> > underlying queue, without io scheduler involved.
> 
>  Because if you lose the race for getting the request, you'll have some
>  arbitrary delay before trying again, potentially. Compared to the direct
> >>>
> >>> But the restart still works, one request is completed, then the queue
> >>> is return immediately because we use mod_delayed_work_on(0), so looks
> >>> no such issue.
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> > 
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
> 
> Here's an example of that, using my current block tree (merged into
> master).  The setup is dm-mpath on top of null_blk, the latter having
> just a single request. Both are mq devices.
> 
> Fio direct 4k random reads on dm_mq: ~250K iops
> 
> Start dd on underlying device (or partition on same device), just doing
> sequential reads.
> 
> Fio direct 4k random reads on dm_mq with dd running: 9 iops
> 
> No schedulers involved.
> 
> https://i.imgur.com/WTDnnwE.gif

This DM specific issue might be addressed by applying notifier_chain
(or similar mechanism)between the two queues, will think about the
details tomorrow.


-- 
Ming


[PATCH] blk-throttle: use queue_is_rq_based

2018-01-19 Thread weiping zhang
use queue_is_rq_based instead of open code.

Signed-off-by: weiping zhang 
---
 block/blk-throttle.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 96ad326..457e985 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2456,7 +2456,7 @@ void blk_throtl_register_queue(struct request_queue *q)
td->throtl_slice = DFL_THROTL_SLICE_HD;
 #endif
 
-   td->track_bio_latency = !q->mq_ops && !q->request_fn;
+   td->track_bio_latency = !(queue_is_rq_based(q));
if (!td->track_bio_latency)
blk_stat_enable_accounting(q);
 }
-- 
2.9.4



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 9:37 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:26 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
 On 1/19/18 9:05 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>> On 1/19/18 8:40 AM, Ming Lei wrote:
>> Where does the dm STS_RESOURCE error usually come from - what's exact
>> resource are we running out of?
>
> It is from blk_get_request(underlying queue), see
> multipath_clone_and_map().

 That's what I thought. So for a low queue depth underlying queue, it's
 quite possible that this situation can happen. Two potential solutions
 I see:

 1) As described earlier in this thread, having a mechanism for being
notified when the scarce resource becomes available. It would not
be hard to tap into the existing sbitmap wait queue for that.

 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
allocation. I haven't read the dm code to know if this is a
possibility or not.

 I'd probably prefer #1. It's a classic case of trying to get the
 request, and if it fails, add ourselves to the sbitmap tag wait
 queue head, retry, and bail if that also fails. Connecting the
 scarce resource and the consumer is the only way to really fix
 this, without bogus arbitrary delays.
>>>
>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>> resource should fix this issue.
>>
>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>> down the dm device by some random amount.
>>
>> A simple test case would be to have a null_blk device with a queue depth
>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>> that does IO to the underlying device, and one that does IO to the dm
>> device. If the job on the dm device runs substantially slower than the
>> one to the underlying device, then the problem isn't really fixed.
>
> I remembered that I tried this test on scsi-debug & dm-mpath over 
> scsi-debug,
> seems not observed this issue, could you explain a bit why IO over 
> dm-mpath
> may be slower? Because both two IO contexts call same get_request(), and
> in theory dm-mpath should be a bit quicker since it uses direct issue for
> underlying queue, without io scheduler involved.

 Because if you lose the race for getting the request, you'll have some
 arbitrary delay before trying again, potentially. Compared to the direct
>>>
>>> But the restart still works, one request is completed, then the queue
>>> is return immediately because we use mod_delayed_work_on(0), so looks
>>> no such issue.
>>
>> There are no pending requests for this case, nothing to restart the
>> queue. When you fail that blk_get_request(), you are idle, nothing
>> is pending.
> 
> I think we needn't worry about that, once a device is attached to
> dm-rq, it can't be mounted any more, and usually user don't use the device
> directly and by dm-mpath at the same time.

Here's an example of that, using my current block tree (merged into
master).  The setup is dm-mpath on top of null_blk, the latter having
just a single request. Both are mq devices.

Fio direct 4k random reads on dm_mq: ~250K iops

Start dd on underlying device (or partition on same device), just doing
sequential reads.

Fio direct 4k random reads on dm_mq with dd running: 9 iops

No schedulers involved.

https://i.imgur.com/WTDnnwE.gif

-- 
Jens Axboe



Re: [PATCH v2] bdi: show error log when fail to create bdi debugfs entry

2018-01-19 Thread weiping zhang
2018-01-11 0:36 GMT+08:00 weiping zhang :
> bdi debugfs dir/file may create fail, add error log here.
>
> Signed-off-by: weiping zhang 
> ---
> V1->V2:
> fix indentation and make log message more clear
>
>  mm/backing-dev.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index b5f940c..0a49665 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -885,7 +885,9 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
> char *fmt, va_list args)
> cgwb_bdi_register(bdi);
> bdi->dev = dev;
>
> -   bdi_debug_register(bdi, dev_name(dev));
> +   if (bdi_debug_register(bdi, dev_name(dev)))
> +   pr_warn("blkdev %s: creation of bdi debugfs entries 
> failed.\n",
> +   dev_name(dev));
> set_bit(WB_registered, >wb.state);
>
> spin_lock_bh(_lock);
> --

Hi Jens,

madam has no permission to create debuts entry if SELINUX is enable at
Fedora and Centos,
and we have revert 6d0e4827b72 Revert "bdi: add error handle for
bdi_debug_register", that is to say
bdi debugfs is not the key component of block device, this patch just
add warning log.

Thanks


Re: [PATCH 3/3] block: Remove kblockd_schedule_delayed_work{,_on}()

2018-01-19 Thread Mike Snitzer
On Fri, Jan 19 2018 at 11:58am -0500,
Bart Van Assche  wrote:

> The previous patch removed all users of these two functions. Hence
> also remove the functions themselves.
> 
> Signed-off-by: Bart Van Assche 

Reviewed-by: Mike Snitzer 


Re: [PATCH 2/3] blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays

2018-01-19 Thread Mike Snitzer
On Fri, Jan 19 2018 at 11:58am -0500,
Bart Van Assche  wrote:

> Make sure that calling blk_mq_run_hw_queue() or
> blk_mq_kick_requeue_list() triggers a queue run without delay even
> if blk_mq_delay_run_hw_queue() has been called recently and if its
> delay has not yet expired.
> 
> Signed-off-by: Bart Van Assche 

Reviewed-by: Mike Snitzer 


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 10:09:11AM -0700, Jens Axboe wrote:
> On 1/19/18 10:05 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:47 AM, Mike Snitzer wrote:
> >>> On Fri, Jan 19 2018 at 11:41am -0500,
> >>> Jens Axboe  wrote:
> >>>
>  On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> >
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the 
> > device
> > directly and by dm-mpath at the same time.
> 
>  Even if it doesn't happen for a normal dm setup, it is a case that
>  needs to be handled. The request allocation is just one example of
>  a wider scope resource that can be unavailable. If the driver returns
>  NO_DEV_RESOURCE (or whatever name), it will be a possibility that
>  the device itself is currently idle.
> >>>
> >>> How would a driver's resources be exhausted yet the device is idle (so
> >>> as not to be able to benefit from RESTART)?
> >>
> >> I've outlined a number of these examples already. Another case might be:
> >>
> >> 1) Device is idle
> >> 2) Device gets request
> >> 3) Device attempts to DMA map
> >> 4) DMA map fails because the IOMMU is out of space (nic is using it all)
> >> 5) Device returns STS_RESOURCE
> >> 6) Queue is marked as needing a restart
> >>
> >> All's well, except there is no IO on this device that will notice the
> >> restart bit and retry the operation.
> >>
> >> Replace IOMMU failure with any other resource that the driver might need
> >> for an IO, which isn't tied to a device specific resource.
> >> blk_get_request() on dm is an example, as is any allocation failure
> >> occurring in the queue IO path for the driver.
> > 
> > Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all
> > the current STS_RESOURCE for non-device resource allocation(kmalloc, dma
> > map, get_request, ...) should be converted to NO_DEV_RESOURCE.

Or simply introduce BLK_STS_DEV_RESOURCE and convert out of real device
resource into it, this way may be much easy to do.

-- 
Ming


Re: [PATCH 1/3] blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()

2018-01-19 Thread Mike Snitzer
On Fri, Jan 19 2018 at 11:58am -0500,
Bart Van Assche  wrote:

> Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
> However, the function name blk_mq_request_direct_issue is an exception.
> Hence rename this function. This patch does not change any functionality.
> 
> Signed-off-by: Bart Van Assche 

Reviewed-by: Mike Snitzer 


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote:
> On 1/19/18 9:47 AM, Mike Snitzer wrote:
> > On Fri, Jan 19 2018 at 11:41am -0500,
> > Jens Axboe  wrote:
> > 
> >> On 1/19/18 9:37 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>  On 1/19/18 9:26 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> 
>  There are no pending requests for this case, nothing to restart the
>  queue. When you fail that blk_get_request(), you are idle, nothing
>  is pending.
> >>>
> >>> I think we needn't worry about that, once a device is attached to
> >>> dm-rq, it can't be mounted any more, and usually user don't use the device
> >>> directly and by dm-mpath at the same time.
> >>
> >> Even if it doesn't happen for a normal dm setup, it is a case that
> >> needs to be handled. The request allocation is just one example of
> >> a wider scope resource that can be unavailable. If the driver returns
> >> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
> >> the device itself is currently idle.
> > 
> > How would a driver's resources be exhausted yet the device is idle (so
> > as not to be able to benefit from RESTART)?
> 
> I've outlined a number of these examples already. Another case might be:
> 
> 1) Device is idle
> 2) Device gets request
> 3) Device attempts to DMA map
> 4) DMA map fails because the IOMMU is out of space (nic is using it all)
> 5) Device returns STS_RESOURCE
> 6) Queue is marked as needing a restart
> 
> All's well, except there is no IO on this device that will notice the
> restart bit and retry the operation.
> 
> Replace IOMMU failure with any other resource that the driver might need
> for an IO, which isn't tied to a device specific resource.
> blk_get_request() on dm is an example, as is any allocation failure
> occurring in the queue IO path for the driver.

Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all
the current STS_RESOURCE for non-device resource allocation(kmalloc, dma
map, get_request, ...) should be converted to NO_DEV_RESOURCE.

And it is a generic issue, which need generic solution.

Seems running queue after arbitrary in this case is the only way we
thought of, or other solutions?

If the decision is made, let's make/review the patch, :-)

-- 
Ming


[PATCH 3/3] block: Remove kblockd_schedule_delayed_work{,_on}()

2018-01-19 Thread Bart Van Assche
The previous patch removed all users of these two functions. Hence
also remove the functions themselves.

Signed-off-by: Bart Van Assche 
---
 block/blk-core.c   | 14 --
 include/linux/blkdev.h |  2 --
 2 files changed, 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5c21d6b97985..a2005a485335 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3446,20 +3446,6 @@ int kblockd_mod_delayed_work_on(int cpu, struct 
delayed_work *dwork,
 }
 EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
 
-int kblockd_schedule_delayed_work(struct delayed_work *dwork,
- unsigned long delay)
-{
-   return queue_delayed_work(kblockd_workqueue, dwork, delay);
-}
-EXPORT_SYMBOL(kblockd_schedule_delayed_work);
-
-int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork,
-unsigned long delay)
-{
-   return queue_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
-}
-EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
-
 /**
  * blk_start_plug - initialize blk_plug and track it inside the task_struct
  * @plug:  The  blk_plug that needs to be initialized
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a4c7aafb686d..4f3df807cf8f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1817,8 +1817,6 @@ static inline bool req_gap_front_merge(struct request 
*req, struct bio *bio)
 
 int kblockd_schedule_work(struct work_struct *work);
 int kblockd_schedule_work_on(int cpu, struct work_struct *work);
-int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long 
delay);
-int kblockd_schedule_delayed_work_on(int cpu, struct delayed_work *dwork, 
unsigned long delay);
 int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned 
long delay);
 
 #ifdef CONFIG_BLK_CGROUP
-- 
2.15.1



[PATCH 0/3] Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays

2018-01-19 Thread Bart Van Assche
Hello Jens,

Earlier today it was agreed that a blk_mq_delay_run_hw_queue() call followed
by a blk_mq_run_hw_queue() call should result in an immediate queue run. Hence
this patch series. Please consider this patch series for kernel v4.16.

Thanks,

Bart.

Bart Van Assche (3):
  blk-mq: Rename blk_mq_request_direct_issue() into
blk_mq_request_issue_directly()
  blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended
delays
  block: Remove kblockd_schedule_delayed_work{,_on}()

 block/blk-core.c   | 16 +---
 block/blk-mq.c | 11 +--
 block/blk-mq.h |  2 +-
 include/linux/blkdev.h |  2 --
 4 files changed, 7 insertions(+), 24 deletions(-)

-- 
2.15.1



[PATCH 2/3] blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays

2018-01-19 Thread Bart Van Assche
Make sure that calling blk_mq_run_hw_queue() or
blk_mq_kick_requeue_list() triggers a queue run without delay even
if blk_mq_delay_run_hw_queue() has been called recently and if its
delay has not yet expired.

Signed-off-by: Bart Van Assche 
---
 block/blk-mq.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6859b5492509..01f271d40825 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -787,7 +787,7 @@ EXPORT_SYMBOL(blk_mq_add_to_requeue_list);
 
 void blk_mq_kick_requeue_list(struct request_queue *q)
 {
-   kblockd_schedule_delayed_work(>requeue_work, 0);
+   kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, >requeue_work, 0);
 }
 EXPORT_SYMBOL(blk_mq_kick_requeue_list);
 
@@ -1403,9 +1403,8 @@ static void __blk_mq_delay_run_hw_queue(struct 
blk_mq_hw_ctx *hctx, bool async,
put_cpu();
}
 
-   kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
->run_work,
-msecs_to_jiffies(msecs));
+   kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), >run_work,
+   msecs_to_jiffies(msecs));
 }
 
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
-- 
2.15.1



[PATCH 1/3] blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()

2018-01-19 Thread Bart Van Assche
Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
However, the function name blk_mq_request_direct_issue is an exception.
Hence rename this function. This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
---
 block/blk-core.c | 2 +-
 block/blk-mq.c   | 4 ++--
 block/blk-mq.h   | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ae295c1be2d..5c21d6b97985 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2525,7 +2525,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   return blk_mq_request_direct_issue(rq);
+   return blk_mq_request_issue_directly(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7515dd95a36..6859b5492509 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1787,7 +1787,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 * RCU or SRCU read lock is needed before checking quiesced flag.
 *
 * When queue is stopped or quiesced, ignore 'bypass_insert' from
-* blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
+* blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
 * and avoid driver to try to dispatch again.
 */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
@@ -1835,7 +1835,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
hctx_unlock(hctx, srcu_idx);
 }
 
-blk_status_t blk_mq_request_direct_issue(struct request *rq)
+blk_status_t blk_mq_request_issue_directly(struct request *rq)
 {
blk_status_t ret;
int srcu_idx;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e3ebc93646ca..88c558f71819 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -75,7 +75,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, 
struct blk_mq_ctx *ctx,
struct list_head *list);
 
 /* Used by blk_insert_cloned_request() to issue request directly */
-blk_status_t blk_mq_request_direct_issue(struct request *rq);
+blk_status_t blk_mq_request_issue_directly(struct request *rq);
 
 /*
  * CPU -> queue mappings
-- 
2.15.1



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 9:47 AM, Mike Snitzer wrote:
> On Fri, Jan 19 2018 at 11:41am -0500,
> Jens Axboe  wrote:
> 
>> On 1/19/18 9:37 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
 On 1/19/18 9:26 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:

 There are no pending requests for this case, nothing to restart the
 queue. When you fail that blk_get_request(), you are idle, nothing
 is pending.
>>>
>>> I think we needn't worry about that, once a device is attached to
>>> dm-rq, it can't be mounted any more, and usually user don't use the device
>>> directly and by dm-mpath at the same time.
>>
>> Even if it doesn't happen for a normal dm setup, it is a case that
>> needs to be handled. The request allocation is just one example of
>> a wider scope resource that can be unavailable. If the driver returns
>> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
>> the device itself is currently idle.
> 
> How would a driver's resources be exhausted yet the device is idle (so
> as not to be able to benefit from RESTART)?

I've outlined a number of these examples already. Another case might be:

1) Device is idle
2) Device gets request
3) Device attempts to DMA map
4) DMA map fails because the IOMMU is out of space (nic is using it all)
5) Device returns STS_RESOURCE
6) Queue is marked as needing a restart

All's well, except there is no IO on this device that will notice the
restart bit and retry the operation.

Replace IOMMU failure with any other resource that the driver might need
for an IO, which isn't tied to a device specific resource.
blk_get_request() on dm is an example, as is any allocation failure
occurring in the queue IO path for the driver.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Mike Snitzer
On Fri, Jan 19 2018 at 11:41am -0500,
Jens Axboe  wrote:

> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> > 
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
> 
> Even if it doesn't happen for a normal dm setup, it is a case that
> needs to be handled. The request allocation is just one example of
> a wider scope resource that can be unavailable. If the driver returns
> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
> the device itself is currently idle.

How would a driver's resources be exhausted yet the device is idle (so
as not to be able to benefit from RESTART)?


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 9:37 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:26 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
 On 1/19/18 9:05 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>> On 1/19/18 8:40 AM, Ming Lei wrote:
>> Where does the dm STS_RESOURCE error usually come from - what's exact
>> resource are we running out of?
>
> It is from blk_get_request(underlying queue), see
> multipath_clone_and_map().

 That's what I thought. So for a low queue depth underlying queue, it's
 quite possible that this situation can happen. Two potential solutions
 I see:

 1) As described earlier in this thread, having a mechanism for being
notified when the scarce resource becomes available. It would not
be hard to tap into the existing sbitmap wait queue for that.

 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
allocation. I haven't read the dm code to know if this is a
possibility or not.

 I'd probably prefer #1. It's a classic case of trying to get the
 request, and if it fails, add ourselves to the sbitmap tag wait
 queue head, retry, and bail if that also fails. Connecting the
 scarce resource and the consumer is the only way to really fix
 this, without bogus arbitrary delays.
>>>
>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>> resource should fix this issue.
>>
>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>> down the dm device by some random amount.
>>
>> A simple test case would be to have a null_blk device with a queue depth
>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>> that does IO to the underlying device, and one that does IO to the dm
>> device. If the job on the dm device runs substantially slower than the
>> one to the underlying device, then the problem isn't really fixed.
>
> I remembered that I tried this test on scsi-debug & dm-mpath over 
> scsi-debug,
> seems not observed this issue, could you explain a bit why IO over 
> dm-mpath
> may be slower? Because both two IO contexts call same get_request(), and
> in theory dm-mpath should be a bit quicker since it uses direct issue for
> underlying queue, without io scheduler involved.

 Because if you lose the race for getting the request, you'll have some
 arbitrary delay before trying again, potentially. Compared to the direct
>>>
>>> But the restart still works, one request is completed, then the queue
>>> is return immediately because we use mod_delayed_work_on(0), so looks
>>> no such issue.
>>
>> There are no pending requests for this case, nothing to restart the
>> queue. When you fail that blk_get_request(), you are idle, nothing
>> is pending.
> 
> I think we needn't worry about that, once a device is attached to
> dm-rq, it can't be mounted any more, and usually user don't use the device
> directly and by dm-mpath at the same time.

Even if it doesn't happen for a normal dm setup, it is a case that
needs to be handled. The request allocation is just one example of
a wider scope resource that can be unavailable. If the driver returns
NO_DEV_RESOURCE (or whatever name), it will be a possibility that
the device itself is currently idle.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> On 1/19/18 9:26 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:05 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>  On 1/19/18 8:40 AM, Ming Lei wrote:
>  Where does the dm STS_RESOURCE error usually come from - what's exact
>  resource are we running out of?
> >>>
> >>> It is from blk_get_request(underlying queue), see
> >>> multipath_clone_and_map().
> >>
> >> That's what I thought. So for a low queue depth underlying queue, it's
> >> quite possible that this situation can happen. Two potential solutions
> >> I see:
> >>
> >> 1) As described earlier in this thread, having a mechanism for being
> >>notified when the scarce resource becomes available. It would not
> >>be hard to tap into the existing sbitmap wait queue for that.
> >>
> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>allocation. I haven't read the dm code to know if this is a
> >>possibility or not.
> >>
> >> I'd probably prefer #1. It's a classic case of trying to get the
> >> request, and if it fails, add ourselves to the sbitmap tag wait
> >> queue head, retry, and bail if that also fails. Connecting the
> >> scarce resource and the consumer is the only way to really fix
> >> this, without bogus arbitrary delays.
> >
> > Right, as I have replied to Bart, using mod_delayed_work_on() with
> > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> > resource should fix this issue.
> 
>  It'll fix the forever stall, but it won't really fix it, as we'll slow
>  down the dm device by some random amount.
> 
>  A simple test case would be to have a null_blk device with a queue depth
>  of one, and dm on top of that. Start a fio job that runs two jobs: one
>  that does IO to the underlying device, and one that does IO to the dm
>  device. If the job on the dm device runs substantially slower than the
>  one to the underlying device, then the problem isn't really fixed.
> >>>
> >>> I remembered that I tried this test on scsi-debug & dm-mpath over 
> >>> scsi-debug,
> >>> seems not observed this issue, could you explain a bit why IO over 
> >>> dm-mpath
> >>> may be slower? Because both two IO contexts call same get_request(), and
> >>> in theory dm-mpath should be a bit quicker since it uses direct issue for
> >>> underlying queue, without io scheduler involved.
> >>
> >> Because if you lose the race for getting the request, you'll have some
> >> arbitrary delay before trying again, potentially. Compared to the direct
> > 
> > But the restart still works, one request is completed, then the queue
> > is return immediately because we use mod_delayed_work_on(0), so looks
> > no such issue.
> 
> There are no pending requests for this case, nothing to restart the
> queue. When you fail that blk_get_request(), you are idle, nothing
> is pending.

I think we needn't worry about that, once a device is attached to
dm-rq, it can't be mounted any more, and usually user don't use the device
directly and by dm-mpath at the same time.

-- 
Ming


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 9:26 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:05 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
 On 1/19/18 8:40 AM, Ming Lei wrote:
 Where does the dm STS_RESOURCE error usually come from - what's exact
 resource are we running out of?
>>>
>>> It is from blk_get_request(underlying queue), see
>>> multipath_clone_and_map().
>>
>> That's what I thought. So for a low queue depth underlying queue, it's
>> quite possible that this situation can happen. Two potential solutions
>> I see:
>>
>> 1) As described earlier in this thread, having a mechanism for being
>>notified when the scarce resource becomes available. It would not
>>be hard to tap into the existing sbitmap wait queue for that.
>>
>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>allocation. I haven't read the dm code to know if this is a
>>possibility or not.
>>
>> I'd probably prefer #1. It's a classic case of trying to get the
>> request, and if it fails, add ourselves to the sbitmap tag wait
>> queue head, retry, and bail if that also fails. Connecting the
>> scarce resource and the consumer is the only way to really fix
>> this, without bogus arbitrary delays.
>
> Right, as I have replied to Bart, using mod_delayed_work_on() with
> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> resource should fix this issue.

 It'll fix the forever stall, but it won't really fix it, as we'll slow
 down the dm device by some random amount.

 A simple test case would be to have a null_blk device with a queue depth
 of one, and dm on top of that. Start a fio job that runs two jobs: one
 that does IO to the underlying device, and one that does IO to the dm
 device. If the job on the dm device runs substantially slower than the
 one to the underlying device, then the problem isn't really fixed.
>>>
>>> I remembered that I tried this test on scsi-debug & dm-mpath over 
>>> scsi-debug,
>>> seems not observed this issue, could you explain a bit why IO over dm-mpath
>>> may be slower? Because both two IO contexts call same get_request(), and
>>> in theory dm-mpath should be a bit quicker since it uses direct issue for
>>> underlying queue, without io scheduler involved.
>>
>> Because if you lose the race for getting the request, you'll have some
>> arbitrary delay before trying again, potentially. Compared to the direct
> 
> But the restart still works, one request is completed, then the queue
> is return immediately because we use mod_delayed_work_on(0), so looks
> no such issue.

There are no pending requests for this case, nothing to restart the
queue. When you fail that blk_get_request(), you are idle, nothing
is pending.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> On 1/19/18 9:05 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> >> On 1/19/18 8:40 AM, Ming Lei wrote:
> >> Where does the dm STS_RESOURCE error usually come from - what's exact
> >> resource are we running out of?
> >
> > It is from blk_get_request(underlying queue), see
> > multipath_clone_and_map().
> 
>  That's what I thought. So for a low queue depth underlying queue, it's
>  quite possible that this situation can happen. Two potential solutions
>  I see:
> 
>  1) As described earlier in this thread, having a mechanism for being
> notified when the scarce resource becomes available. It would not
> be hard to tap into the existing sbitmap wait queue for that.
> 
>  2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> allocation. I haven't read the dm code to know if this is a
> possibility or not.
> 
>  I'd probably prefer #1. It's a classic case of trying to get the
>  request, and if it fails, add ourselves to the sbitmap tag wait
>  queue head, retry, and bail if that also fails. Connecting the
>  scarce resource and the consumer is the only way to really fix
>  this, without bogus arbitrary delays.
> >>>
> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with
> >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> >>> resource should fix this issue.
> >>
> >> It'll fix the forever stall, but it won't really fix it, as we'll slow
> >> down the dm device by some random amount.
> >>
> >> A simple test case would be to have a null_blk device with a queue depth
> >> of one, and dm on top of that. Start a fio job that runs two jobs: one
> >> that does IO to the underlying device, and one that does IO to the dm
> >> device. If the job on the dm device runs substantially slower than the
> >> one to the underlying device, then the problem isn't really fixed.
> > 
> > I remembered that I tried this test on scsi-debug & dm-mpath over 
> > scsi-debug,
> > seems not observed this issue, could you explain a bit why IO over dm-mpath
> > may be slower? Because both two IO contexts call same get_request(), and
> > in theory dm-mpath should be a bit quicker since it uses direct issue for
> > underlying queue, without io scheduler involved.
> 
> Because if you lose the race for getting the request, you'll have some
> arbitrary delay before trying again, potentially. Compared to the direct

But the restart still works, one request is completed, then the queue
is return immediately because we use mod_delayed_work_on(0), so looks
no such issue.


-- 
Ming


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 9:13 AM, Mike Snitzer wrote:
> On Fri, Jan 19 2018 at 10:48am -0500,
> Jens Axboe  wrote:
> 
>> On 1/19/18 8:40 AM, Ming Lei wrote:
>> Where does the dm STS_RESOURCE error usually come from - what's exact
>> resource are we running out of?
>
> It is from blk_get_request(underlying queue), see
> multipath_clone_and_map().

 That's what I thought. So for a low queue depth underlying queue, it's
 quite possible that this situation can happen. Two potential solutions
 I see:

 1) As described earlier in this thread, having a mechanism for being
notified when the scarce resource becomes available. It would not
be hard to tap into the existing sbitmap wait queue for that.

 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
allocation. I haven't read the dm code to know if this is a
possibility or not.
> 
> Right, #2 is _not_ the way forward.  Historically request-based DM used
> its own mempool for requests, this was to be able to have some measure
> of control and resiliency in the face of low memory conditions that
> might be affecting the broader system.
> 
> Then Christoph switched over to adding per-request-data; which ushered
> in the use of blk_get_request using ATOMIC allocations.  I like the
> result of that line of development.  But taking the next step of setting
> BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this
> dm-mpath.c code is common to old .request_fn and blk-mq, at least the
> call to blk_get_request is).  Ultimately dm-mpath like to avoid blocking
> for a request because for this dm-mpath device we have multiple queues
> to allocate from if need be (provided we have an active-active storage
> network topology).

If you can go to multiple devices, obviously it should not block on a
single device. That's only true for the case where you can only go to
one device, blocking at that point would probably be fine. Or if all
your paths are busy, then blocking would also be OK.

But it's a much larger change, and would entail changing more than just
the actual call to blk_get_request().

>> A simple test case would be to have a null_blk device with a queue depth
>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>> that does IO to the underlying device, and one that does IO to the dm
>> device. If the job on the dm device runs substantially slower than the
>> one to the underlying device, then the problem isn't really fixed.
> 
> Not sure DM will allow the underlying device to be opened (due to
> master/slave ownership that is part of loading a DM table)?

There are many ways it could be setup - just partition the underlying
device then, and have one partition be part of the dm setup and the
other used directly.

>> That said, I'm fine with ensuring that we make forward progress always
>> first, and then we can come up with a proper solution to the issue. The
>> forward progress guarantee will be needed for the more rare failure
>> cases, like allocation failures. nvme needs that too, for instance, for
>> the discard range struct allocation.
> 
> Yeap, I'd be OK with that too.  We'd be better for revisted this and
> then have some time to develop the ultimate robust fix (#1, callback
> from above).

Yeah, we need the quick and dirty sooner, which just brings us back to
what we had before, essentially.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 9:05 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>> On 1/19/18 8:40 AM, Ming Lei wrote:
>> Where does the dm STS_RESOURCE error usually come from - what's exact
>> resource are we running out of?
>
> It is from blk_get_request(underlying queue), see
> multipath_clone_and_map().

 That's what I thought. So for a low queue depth underlying queue, it's
 quite possible that this situation can happen. Two potential solutions
 I see:

 1) As described earlier in this thread, having a mechanism for being
notified when the scarce resource becomes available. It would not
be hard to tap into the existing sbitmap wait queue for that.

 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
allocation. I haven't read the dm code to know if this is a
possibility or not.

 I'd probably prefer #1. It's a classic case of trying to get the
 request, and if it fails, add ourselves to the sbitmap tag wait
 queue head, retry, and bail if that also fails. Connecting the
 scarce resource and the consumer is the only way to really fix
 this, without bogus arbitrary delays.
>>>
>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>> resource should fix this issue.
>>
>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>> down the dm device by some random amount.
>>
>> A simple test case would be to have a null_blk device with a queue depth
>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>> that does IO to the underlying device, and one that does IO to the dm
>> device. If the job on the dm device runs substantially slower than the
>> one to the underlying device, then the problem isn't really fixed.
> 
> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
> seems not observed this issue, could you explain a bit why IO over dm-mpath
> may be slower? Because both two IO contexts call same get_request(), and
> in theory dm-mpath should be a bit quicker since it uses direct issue for
> underlying queue, without io scheduler involved.

Because if you lose the race for getting the request, you'll have some
arbitrary delay before trying again, potentially. Compared to the direct
user of the underlying device, who will simply sleep on the resource and
get woken the instant it's available.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Mike Snitzer
On Fri, Jan 19 2018 at 10:48am -0500,
Jens Axboe  wrote:

> On 1/19/18 8:40 AM, Ming Lei wrote:
>  Where does the dm STS_RESOURCE error usually come from - what's exact
>  resource are we running out of?
> >>>
> >>> It is from blk_get_request(underlying queue), see
> >>> multipath_clone_and_map().
> >>
> >> That's what I thought. So for a low queue depth underlying queue, it's
> >> quite possible that this situation can happen. Two potential solutions
> >> I see:
> >>
> >> 1) As described earlier in this thread, having a mechanism for being
> >>notified when the scarce resource becomes available. It would not
> >>be hard to tap into the existing sbitmap wait queue for that.
> >>
> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>allocation. I haven't read the dm code to know if this is a
> >>possibility or not.

Right, #2 is _not_ the way forward.  Historically request-based DM used
its own mempool for requests, this was to be able to have some measure
of control and resiliency in the face of low memory conditions that
might be affecting the broader system.

Then Christoph switched over to adding per-request-data; which ushered
in the use of blk_get_request using ATOMIC allocations.  I like the
result of that line of development.  But taking the next step of setting
BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this
dm-mpath.c code is common to old .request_fn and blk-mq, at least the
call to blk_get_request is).  Ultimately dm-mpath like to avoid blocking
for a request because for this dm-mpath device we have multiple queues
to allocate from if need be (provided we have an active-active storage
network topology).

> >> I'd probably prefer #1. It's a classic case of trying to get the
> >> request, and if it fails, add ourselves to the sbitmap tag wait
> >> queue head, retry, and bail if that also fails. Connecting the
> >> scarce resource and the consumer is the only way to really fix
> >> this, without bogus arbitrary delays.
> > 
> > Right, as I have replied to Bart, using mod_delayed_work_on() with
> > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> > resource should fix this issue.
> 
> It'll fix the forever stall, but it won't really fix it, as we'll slow
> down the dm device by some random amount.

Agreed.

> A simple test case would be to have a null_blk device with a queue depth
> of one, and dm on top of that. Start a fio job that runs two jobs: one
> that does IO to the underlying device, and one that does IO to the dm
> device. If the job on the dm device runs substantially slower than the
> one to the underlying device, then the problem isn't really fixed.

Not sure DM will allow the underlying device to be opened (due to
master/slave ownership that is part of loading a DM table)?

> That said, I'm fine with ensuring that we make forward progress always
> first, and then we can come up with a proper solution to the issue. The
> forward progress guarantee will be needed for the more rare failure
> cases, like allocation failures. nvme needs that too, for instance, for
> the discard range struct allocation.

Yeap, I'd be OK with that too.  We'd be better for revisted this and
then have some time to develop the ultimate robust fix (#1, callback
from above).

Mike


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> On 1/19/18 8:40 AM, Ming Lei wrote:
>  Where does the dm STS_RESOURCE error usually come from - what's exact
>  resource are we running out of?
> >>>
> >>> It is from blk_get_request(underlying queue), see
> >>> multipath_clone_and_map().
> >>
> >> That's what I thought. So for a low queue depth underlying queue, it's
> >> quite possible that this situation can happen. Two potential solutions
> >> I see:
> >>
> >> 1) As described earlier in this thread, having a mechanism for being
> >>notified when the scarce resource becomes available. It would not
> >>be hard to tap into the existing sbitmap wait queue for that.
> >>
> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>allocation. I haven't read the dm code to know if this is a
> >>possibility or not.
> >>
> >> I'd probably prefer #1. It's a classic case of trying to get the
> >> request, and if it fails, add ourselves to the sbitmap tag wait
> >> queue head, retry, and bail if that also fails. Connecting the
> >> scarce resource and the consumer is the only way to really fix
> >> this, without bogus arbitrary delays.
> > 
> > Right, as I have replied to Bart, using mod_delayed_work_on() with
> > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> > resource should fix this issue.
> 
> It'll fix the forever stall, but it won't really fix it, as we'll slow
> down the dm device by some random amount.
> 
> A simple test case would be to have a null_blk device with a queue depth
> of one, and dm on top of that. Start a fio job that runs two jobs: one
> that does IO to the underlying device, and one that does IO to the dm
> device. If the job on the dm device runs substantially slower than the
> one to the underlying device, then the problem isn't really fixed.

I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
seems not observed this issue, could you explain a bit why IO over dm-mpath
may be slower? Because both two IO contexts call same get_request(), and
in theory dm-mpath should be a bit quicker since it uses direct issue for
underlying queue, without io scheduler involved.

-- 
Ming


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Bart Van Assche
On Fri, 2018-01-19 at 23:33 +0800, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 03:20:13PM +, Bart Van Assche wrote:
> > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote:
> > > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > > scheduling can make progress during the 100ms.
> > 
> > How about addressing that as follows:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f7515dd95a36..57f8379a476d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct 
> > blk_mq_hw_ctx *hctx, bool async,
> > put_cpu();
> > }
> >  
> > -   kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> > ->run_work,
> > -msecs_to_jiffies(msecs));
> > +   kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> > +   >run_work,
> > +   msecs_to_jiffies(msecs));
> >  }
> >  
> >  void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long 
> > msecs)
> > 
> > Bart.
> 
> Yes, this one together with Jen's suggestion with returning
> BLK_STS_NO_DEV_RESOURCE should fix this issue.
> 
> Could you cook a fix for this issue? Otherwise I am happy to do
> that.

Hello Ming,

I will look further into this.

Bart.

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 8:40 AM, Ming Lei wrote:
 Where does the dm STS_RESOURCE error usually come from - what's exact
 resource are we running out of?
>>>
>>> It is from blk_get_request(underlying queue), see
>>> multipath_clone_and_map().
>>
>> That's what I thought. So for a low queue depth underlying queue, it's
>> quite possible that this situation can happen. Two potential solutions
>> I see:
>>
>> 1) As described earlier in this thread, having a mechanism for being
>>notified when the scarce resource becomes available. It would not
>>be hard to tap into the existing sbitmap wait queue for that.
>>
>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>allocation. I haven't read the dm code to know if this is a
>>possibility or not.
>>
>> I'd probably prefer #1. It's a classic case of trying to get the
>> request, and if it fails, add ourselves to the sbitmap tag wait
>> queue head, retry, and bail if that also fails. Connecting the
>> scarce resource and the consumer is the only way to really fix
>> this, without bogus arbitrary delays.
> 
> Right, as I have replied to Bart, using mod_delayed_work_on() with
> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> resource should fix this issue.

It'll fix the forever stall, but it won't really fix it, as we'll slow
down the dm device by some random amount.

A simple test case would be to have a null_blk device with a queue depth
of one, and dm on top of that. Start a fio job that runs two jobs: one
that does IO to the underlying device, and one that does IO to the dm
device. If the job on the dm device runs substantially slower than the
one to the underlying device, then the problem isn't really fixed.

That said, I'm fine with ensuring that we make forward progress always
first, and then we can come up with a proper solution to the issue. The
forward progress guarantee will be needed for the more rare failure
cases, like allocation failures. nvme needs that too, for instance, for
the discard range struct allocation.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 08:24:06AM -0700, Jens Axboe wrote:
> On 1/19/18 12:26 AM, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
> >> On 1/18/18 7:32 PM, Ming Lei wrote:
> >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
>  On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> >
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
> 
>  I agree, it sucks that any change there introduces the regression. I'm
>  fine with doing the delay insert again until a new patch is proven to be
>  better.
> >>>
> >>> That way is still buggy as I explained, since rerun queue before adding
> >>> request to hctx->dispatch_list isn't correct. Who can make sure the 
> >>> request
> >>> is visible when __blk_mq_run_hw_queue() is called?
> >>
> >> That race basically doesn't exist for a 10ms gap.
> >>
> >>> Not mention this way will cause performance regression again.
> >>
> >> How so? It's _exactly_ the same as what you are proposing, except mine
> >> will potentially run the queue when it need not do so. But given that
> >> these are random 10ms queue kicks because we are screwed, it should not
> >> matter. The key point is that it only should be if we have NO better
> >> options. If it's a frequently occurring event that we have to return
> >> BLK_STS_RESOURCE, then we need to get a way to register an event for
> >> when that condition clears. That event will then kick the necessary
> >> queue(s).
> > 
> > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > scheduling can make progress during the 100ms.
> 
> That's a bug, plain and simple. If someone does "run this queue in
> 100ms" and someone else comes in and says "run this queue now", the
> correct outcome is running this queue now.
> 
>  From the original topic of this email, we have conditions that can cause
>  the driver to not be able to submit an IO. A set of those conditions can
>  only happen if IO is in flight, and those cases we have covered just
>  fine. Another set can potentially trigger without IO being in flight.
>  These are cases where a non-device resource is unavailable at the time
>  of submission. This might be iommu running out of space, for instance,
>  or it might be a memory allocation of some sort. For these cases, we
>  don't get any notification when the shortage clears. All we can do is
>  ensure that we restart operations at some point in the future. We're SOL
>  at that point, but we have to ensure that we make forward progress.
> >>>
> >>> Right, it is a generic issue, not DM-specific one, almost all drivers
> >>> call kmalloc(GFP_ATOMIC) in IO path.
> >>
> >> GFP_ATOMIC basically never fails, unless we are out of memory. The
> > 
> > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
> > possible, and it is mentioned[1] there is such code in mm allocation
> > path, also OOM can happen too.
> > 
> >   if (some randomly generated condition) && (request is atomic)
> >   return NULL;
> > 
> > [1] https://lwn.net/Articles/276731/
> 
> That article is 10 years old. Once you run large scale production, you
> see what the real failures are. Fact is, for zero order allocation, if
> the atomic alloc fails the shit has really hit the fan. In that case, a
> delay of 10ms is not your main issue. It's a total red herring when you
> compare to the frequency of what Bart is seeing. It's noise, and
> irrelevant here. For an atomic zero order allocation failure, doing a
> short random sleep is perfectly fine.
> 
> >> exception is higher order allocations. If a driver has a higher order
> >> atomic allocation in its IO path, the device driver writer needs to be
> >> taken out behind the barn and shot. Simple as that. It will NEVER work
> >> well in a production environment. Witness the disaster that so many NIC
> >> driver writers have learned.
> >>
> >> This is NOT the case we care about here. It's resources that are more
> >> readily depleted because other devices are using them. If it's a high
> >> frequency or generally occurring event, then we simply must have a
> >> callback to restart the queue from that. The condition then becomes
> >> identical to device private starvation, the only difference being from
> >> where we restart the queue.
> >>
> >>> IMO, there is enough time for figuring out a generic solution before
> >>> 4.16 release.
> >>
> >> I would hope so, but the proposed solutions have not filled me with
> >> a lot of confidence in the end result so far.
> >>
>  That last set of conditions better not be a a common occurence, since
>  performance is down 

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Ming Lei
On Fri, Jan 19, 2018 at 03:20:13PM +, Bart Van Assche wrote:
> On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote:
> > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > scheduling can make progress during the 100ms.
> 
> How about addressing that as follows:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7515dd95a36..57f8379a476d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct 
> blk_mq_hw_ctx *hctx, bool async,
>   put_cpu();
>   }
>  
> - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> -  >run_work,
> -  msecs_to_jiffies(msecs));
> + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> + >run_work,
> + msecs_to_jiffies(msecs));
>  }
>  
>  void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long 
> msecs)
> 
> Bart.

Yes, this one together with Jen's suggestion with returning
BLK_STS_NO_DEV_RESOURCE should fix this issue.

Could you cook a fix for this issue? Otherwise I am happy to do
that.

-- 
Ming


Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 8:20 AM, Bart Van Assche wrote:
> On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote:
>> Please see queue_delayed_work_on(), hctx->run_work is shared by all
>> scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
>> scheduling can make progress during the 100ms.
> 
> How about addressing that as follows:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7515dd95a36..57f8379a476d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct 
> blk_mq_hw_ctx *hctx, bool async,
>   put_cpu();
>   }
>  
> - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> -  >run_work,
> -  msecs_to_jiffies(msecs));
> + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> + >run_work,
> + msecs_to_jiffies(msecs));
>  }

Exactly. That's why I said it was just a bug in my previous email, not
honoring a newer run is just stupid. Only other thing you have to be
careful with here is the STOPPED bit.

-- 
Jens Axboe



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Jens Axboe
On 1/19/18 12:26 AM, Ming Lei wrote:
> On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
>> On 1/18/18 7:32 PM, Ming Lei wrote:
>>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
 On 1/18/18 11:47 AM, Bart Van Assche wrote:
>> This is all very tiresome.
>
> Yes, this is tiresome. It is very annoying to me that others keep
> introducing so many regressions in such important parts of the kernel.
> It is also annoying to me that I get blamed if I report a regression
> instead of seeing that the regression gets fixed.

 I agree, it sucks that any change there introduces the regression. I'm
 fine with doing the delay insert again until a new patch is proven to be
 better.
>>>
>>> That way is still buggy as I explained, since rerun queue before adding
>>> request to hctx->dispatch_list isn't correct. Who can make sure the request
>>> is visible when __blk_mq_run_hw_queue() is called?
>>
>> That race basically doesn't exist for a 10ms gap.
>>
>>> Not mention this way will cause performance regression again.
>>
>> How so? It's _exactly_ the same as what you are proposing, except mine
>> will potentially run the queue when it need not do so. But given that
>> these are random 10ms queue kicks because we are screwed, it should not
>> matter. The key point is that it only should be if we have NO better
>> options. If it's a frequently occurring event that we have to return
>> BLK_STS_RESOURCE, then we need to get a way to register an event for
>> when that condition clears. That event will then kick the necessary
>> queue(s).
> 
> Please see queue_delayed_work_on(), hctx->run_work is shared by all
> scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> scheduling can make progress during the 100ms.

That's a bug, plain and simple. If someone does "run this queue in
100ms" and someone else comes in and says "run this queue now", the
correct outcome is running this queue now.

 From the original topic of this email, we have conditions that can cause
 the driver to not be able to submit an IO. A set of those conditions can
 only happen if IO is in flight, and those cases we have covered just
 fine. Another set can potentially trigger without IO being in flight.
 These are cases where a non-device resource is unavailable at the time
 of submission. This might be iommu running out of space, for instance,
 or it might be a memory allocation of some sort. For these cases, we
 don't get any notification when the shortage clears. All we can do is
 ensure that we restart operations at some point in the future. We're SOL
 at that point, but we have to ensure that we make forward progress.
>>>
>>> Right, it is a generic issue, not DM-specific one, almost all drivers
>>> call kmalloc(GFP_ATOMIC) in IO path.
>>
>> GFP_ATOMIC basically never fails, unless we are out of memory. The
> 
> I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
> possible, and it is mentioned[1] there is such code in mm allocation
> path, also OOM can happen too.
> 
>   if (some randomly generated condition) && (request is atomic)
>   return NULL;
> 
> [1] https://lwn.net/Articles/276731/

That article is 10 years old. Once you run large scale production, you
see what the real failures are. Fact is, for zero order allocation, if
the atomic alloc fails the shit has really hit the fan. In that case, a
delay of 10ms is not your main issue. It's a total red herring when you
compare to the frequency of what Bart is seeing. It's noise, and
irrelevant here. For an atomic zero order allocation failure, doing a
short random sleep is perfectly fine.

>> exception is higher order allocations. If a driver has a higher order
>> atomic allocation in its IO path, the device driver writer needs to be
>> taken out behind the barn and shot. Simple as that. It will NEVER work
>> well in a production environment. Witness the disaster that so many NIC
>> driver writers have learned.
>>
>> This is NOT the case we care about here. It's resources that are more
>> readily depleted because other devices are using them. If it's a high
>> frequency or generally occurring event, then we simply must have a
>> callback to restart the queue from that. The condition then becomes
>> identical to device private starvation, the only difference being from
>> where we restart the queue.
>>
>>> IMO, there is enough time for figuring out a generic solution before
>>> 4.16 release.
>>
>> I would hope so, but the proposed solutions have not filled me with
>> a lot of confidence in the end result so far.
>>
 That last set of conditions better not be a a common occurence, since
 performance is down the toilet at that point. I don't want to introduce
 hot path code to rectify it. Have the driver return if that happens in a
 way that is DIFFERENT from needing a normal restart. The driver knows if
 this is a resource that will become available