Re: [PATCH] blk-mq: don't queue more if we get a busy return

2018-06-28 Thread Ming Lei
On Thu, Jun 28, 2018 at 08:18:04PM -0600, Jens Axboe wrote:
> On 6/28/18 7:59 PM, Ming Lei wrote:
> > On Thu, Jun 28, 2018 at 09:46:50AM -0600, Jens Axboe wrote:
> >> Some devices have different queue limits depending on the type of IO. A
> >> classic case is SATA NCQ, where some commands can queue, but others
> >> cannot. If we have NCQ commands inflight and encounter a non-queueable
> >> command, the driver returns busy. Currently we attempt to dispatch more
> >> from the scheduler, if we were able to queue some commands. But for the
> >> case where we ended up stopping due to BUSY, we should not attempt to
> >> retrieve more from the scheduler. If we do, we can get into a situation
> >> where we attempt to queue a non-queueable command, get BUSY, then
> >> successfully retrieve more commands from that scheduler and queue those.
> >> This can repeat forever, starving the non-queuable command indefinitely.
> >>
> >> Fix this by NOT attempting to pull more commands from the scheduler, if
> >> we get a BUSY return. This should also be more optimal in terms of
> >> letting requests stay in the scheduler for as long as possible, if we
> >> get a BUSY due to the regular out-of-tags condition.
> >>
> >> Signed-off-by: Jens Axboe 
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index b6888ff556cf..d394cdd8d8c6 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -1075,6 +1075,9 @@ static bool blk_mq_mark_tag_wait(struct 
> >> blk_mq_hw_ctx **hctx,
> >>  
> >>  #define BLK_MQ_RESOURCE_DELAY 3   /* ms units */
> >>  
> >> +/*
> >> + * Returns true if we did some work AND can potentially do more.
> >> + */
> >>  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head 
> >> *list,
> >> bool got_budget)
> >>  {
> >> @@ -1205,8 +1208,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
> >> *q, struct list_head *list,
> >>blk_mq_run_hw_queue(hctx, true);
> >>else if (needs_restart && (ret == BLK_STS_RESOURCE))
> >>blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> >> +
> >> +  return false;
> >>}
> >>  
> >> +  /*
> >> +   * If the host/device is unable to accept more work, inform the
> >> +   * caller of that.
> >> +   */
> >> +  if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> >> +  return false;
> > 
> > The above change may not be needed since one invariant is that
> > !list_empty(list) becomes true if either BLK_STS_RESOURCE or 
> > BLK_STS_DEV_RESOURCE
> > is returned from .queue_rq().
> 
> Agree, that's one case, but it's more bullet proof this way. And explicit,
> I'd rather not break this odd case again.

OK, just two-line dead code, not a big deal.

I guess this patch may improve sequential IO performance a bit on SCSI HDD.,
so:

Reviewed-by: Ming Lei 

Thanks,
Ming


Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Jens Axboe
On 6/28/18 5:16 PM, Kent Overstreet wrote:
> On Fri, Jun 29, 2018 at 07:10:47AM +0800, Ming Lei wrote:
>> On Thu, Jun 28, 2018 at 11:21 PM, Bart Van Assche
>>  wrote:
>>> On 06/27/18 17:30, Ming Lei wrote:

 One core idea of immutable bvec is to use bio->bi_iter and the original
 bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
 needs to copy, but not see any reason why .bi_vcnt needs to do.

 Do you have use cases on .bi_vcnt for cloned bio?
>>>
>>>
>>> So far this is only a theoretical concern. There are many functions in the
>>> block layer that use .bi_vcnt, and it is a lot of work to figure out all the
>>> callers of all these functions.
> 
> Back when I implemented immutable biovecs I thoroughly audited all the bi_vcnt
> uses and removed all of them that weren't by the code that owns/submits the 
> bio.
> 
> Grepping around I see one or two suspicious uses.. blk-merge.c in particular
> 
>> No, any functions using .bi_vcnt on a cloned-bio may be a bug, and we should
>> take a close look.
> 
> not just cloned bios, any code using bi_vcnt on a bio it didn't create is 
> wrong.
> 
> so big nack to this patch (I wasn't ccd on it though and it doesn't seem to 
> have
> hit lkml, so I can't find the original patch...)

Yeah, you are both right, I was smoking crack.

-- 
Jens Axboe



Re: [PATCH] blk-mq: don't queue more if we get a busy return

2018-06-28 Thread Jens Axboe
On 6/28/18 7:59 PM, Ming Lei wrote:
> On Thu, Jun 28, 2018 at 09:46:50AM -0600, Jens Axboe wrote:
>> Some devices have different queue limits depending on the type of IO. A
>> classic case is SATA NCQ, where some commands can queue, but others
>> cannot. If we have NCQ commands inflight and encounter a non-queueable
>> command, the driver returns busy. Currently we attempt to dispatch more
>> from the scheduler, if we were able to queue some commands. But for the
>> case where we ended up stopping due to BUSY, we should not attempt to
>> retrieve more from the scheduler. If we do, we can get into a situation
>> where we attempt to queue a non-queueable command, get BUSY, then
>> successfully retrieve more commands from that scheduler and queue those.
>> This can repeat forever, starving the non-queuable command indefinitely.
>>
>> Fix this by NOT attempting to pull more commands from the scheduler, if
>> we get a BUSY return. This should also be more optimal in terms of
>> letting requests stay in the scheduler for as long as possible, if we
>> get a BUSY due to the regular out-of-tags condition.
>>
>> Signed-off-by: Jens Axboe 
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b6888ff556cf..d394cdd8d8c6 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1075,6 +1075,9 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
>> **hctx,
>>  
>>  #define BLK_MQ_RESOURCE_DELAY   3   /* ms units */
>>  
>> +/*
>> + * Returns true if we did some work AND can potentially do more.
>> + */
>>  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head 
>> *list,
>>   bool got_budget)
>>  {
>> @@ -1205,8 +1208,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
>> struct list_head *list,
>>  blk_mq_run_hw_queue(hctx, true);
>>  else if (needs_restart && (ret == BLK_STS_RESOURCE))
>>  blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>> +
>> +return false;
>>  }
>>  
>> +/*
>> + * If the host/device is unable to accept more work, inform the
>> + * caller of that.
>> + */
>> +if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
>> +return false;
> 
> The above change may not be needed since one invariant is that
> !list_empty(list) becomes true if either BLK_STS_RESOURCE or 
> BLK_STS_DEV_RESOURCE
> is returned from .queue_rq().

Agree, that's one case, but it's more bullet proof this way. And explicit,
I'd rather not break this odd case again.

-- 
Jens Axboe



Re: [PATCH] blk-mq: don't queue more if we get a busy return

2018-06-28 Thread Ming Lei
On Thu, Jun 28, 2018 at 09:46:50AM -0600, Jens Axboe wrote:
> Some devices have different queue limits depending on the type of IO. A
> classic case is SATA NCQ, where some commands can queue, but others
> cannot. If we have NCQ commands inflight and encounter a non-queueable
> command, the driver returns busy. Currently we attempt to dispatch more
> from the scheduler, if we were able to queue some commands. But for the
> case where we ended up stopping due to BUSY, we should not attempt to
> retrieve more from the scheduler. If we do, we can get into a situation
> where we attempt to queue a non-queueable command, get BUSY, then
> successfully retrieve more commands from that scheduler and queue those.
> This can repeat forever, starving the non-queuable command indefinitely.
> 
> Fix this by NOT attempting to pull more commands from the scheduler, if
> we get a BUSY return. This should also be more optimal in terms of
> letting requests stay in the scheduler for as long as possible, if we
> get a BUSY due to the regular out-of-tags condition.
> 
> Signed-off-by: Jens Axboe 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b6888ff556cf..d394cdd8d8c6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1075,6 +1075,9 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
> **hctx,
>  
>  #define BLK_MQ_RESOURCE_DELAY3   /* ms units */
>  
> +/*
> + * Returns true if we did some work AND can potentially do more.
> + */
>  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>bool got_budget)
>  {
> @@ -1205,8 +1208,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list,
>   blk_mq_run_hw_queue(hctx, true);
>   else if (needs_restart && (ret == BLK_STS_RESOURCE))
>   blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> +
> + return false;
>   }
>  
> + /*
> +  * If the host/device is unable to accept more work, inform the
> +  * caller of that.
> +  */
> + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> + return false;

The above change may not be needed since one invariant is that
!list_empty(list) becomes true if either BLK_STS_RESOURCE or 
BLK_STS_DEV_RESOURCE
is returned from .queue_rq().

Thanks,
Ming


[PATCH] Blktrace: bail out early if block debugfs is not configured

2018-06-28 Thread Liu Bo
Since @blk_debugfs_root couldn't be configured dynamically, we can
save a few memory allocation if it's not there.

Signed-off-by: Liu Bo 
---
 kernel/trace/blktrace.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 987d9a9ae283..b951aa1fac61 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -494,6 +494,9 @@ static int do_blk_trace_setup(struct request_queue *q, char 
*name, dev_t dev,
if (!buts->buf_size || !buts->buf_nr)
return -EINVAL;
 
+   if (!blk_debugfs_root)
+   return -ENOENT;
+
strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
 
@@ -518,9 +521,6 @@ static int do_blk_trace_setup(struct request_queue *q, char 
*name, dev_t dev,
 
ret = -ENOENT;
 
-   if (!blk_debugfs_root)
-   goto err;
-
dir = debugfs_lookup(buts->name, blk_debugfs_root);
if (!dir)
bt->dir = dir = debugfs_create_dir(buts->name, 
blk_debugfs_root);
-- 
1.8.3.1



Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Kent Overstreet
On Thu, Jun 28, 2018 at 04:54:44PM -0700, Bart Van Assche wrote:
> On 06/28/18 16:16, Kent Overstreet wrote:
> > On Fri, Jun 29, 2018 at 07:10:47AM +0800, Ming Lei wrote:
> > > On Thu, Jun 28, 2018 at 11:21 PM, Bart Van Assche
> > >  wrote:
> > > > On 06/27/18 17:30, Ming Lei wrote:
> > > > > 
> > > > > One core idea of immutable bvec is to use bio->bi_iter and the 
> > > > > original
> > > > > bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
> > > > > needs to copy, but not see any reason why .bi_vcnt needs to do.
> > > > > 
> > > > > Do you have use cases on .bi_vcnt for cloned bio?
> > > > 
> > > > So far this is only a theoretical concern. There are many functions in 
> > > > the
> > > > block layer that use .bi_vcnt, and it is a lot of work to figure out 
> > > > all the
> > > > callers of all these functions.
> > 
> > Back when I implemented immutable biovecs I thoroughly audited all the 
> > bi_vcnt
> > uses and removed all of them that weren't by the code that owns/submits the 
> > bio.
> > 
> > Grepping around I see one or two suspicious uses.. blk-merge.c in particular
> > 
> > > No, any functions using .bi_vcnt on a cloned-bio may be a bug, and we 
> > > should
> > > take a close look.
> > 
> > not just cloned bios, any code using bi_vcnt on a bio it didn't create is 
> > wrong.
> > 
> > so big nack to this patch (I wasn't ccd on it though and it doesn't seem to 
> > have
> > hit lkml, so I can't find the original patch...)
> 
> Hello Kent,
> 
> Thanks for chiming in. The linux-block mailing list is archived by multiple
> websites. The entire e-mail thread is available on e.g.
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg23006.html.
> 
> I have a question for you: at least in kernel v4.17 bio_clone_bioset()
> copies bi_vcnt from the source to the destination bio. However,
> __bio_clone_fast() doesn't copy bi_vcnt. Isn't that an inconsistency?

No - when you use bio_clone_bioset() you get a bio that you own and can do
whatever you want with, so it does make sense for it to initialize bi_vcnt.

e.g. you could use bio_clone_bioset() when you're going to be bouncing a bio,
iterating over each bvec and allocating a new page and copying data from the old
page to the new page.


Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Bart Van Assche

On 06/28/18 16:16, Kent Overstreet wrote:

On Fri, Jun 29, 2018 at 07:10:47AM +0800, Ming Lei wrote:

On Thu, Jun 28, 2018 at 11:21 PM, Bart Van Assche
 wrote:

On 06/27/18 17:30, Ming Lei wrote:


One core idea of immutable bvec is to use bio->bi_iter and the original
bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
needs to copy, but not see any reason why .bi_vcnt needs to do.

Do you have use cases on .bi_vcnt for cloned bio?


So far this is only a theoretical concern. There are many functions in the
block layer that use .bi_vcnt, and it is a lot of work to figure out all the
callers of all these functions.


Back when I implemented immutable biovecs I thoroughly audited all the bi_vcnt
uses and removed all of them that weren't by the code that owns/submits the bio.

Grepping around I see one or two suspicious uses.. blk-merge.c in particular


No, any functions using .bi_vcnt on a cloned-bio may be a bug, and we should
take a close look.


not just cloned bios, any code using bi_vcnt on a bio it didn't create is wrong.

so big nack to this patch (I wasn't ccd on it though and it doesn't seem to have
hit lkml, so I can't find the original patch...)


Hello Kent,

Thanks for chiming in. The linux-block mailing list is archived by 
multiple websites. The entire e-mail thread is available on e.g. 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg23006.html.


I have a question for you: at least in kernel v4.17 bio_clone_bioset() 
copies bi_vcnt from the source to the destination bio. However,

__bio_clone_fast() doesn't copy bi_vcnt. Isn't that an inconsistency?

Thanks,

Bart.


Re: [PATCH blktests v2 3/3] Add tests for the SRP initiator and target drivers

2018-06-28 Thread Omar Sandoval
On Wed, Jun 27, 2018 at 02:49:08PM -0700, Bart Van Assche wrote:
> This patch adds the following tests:
> 001: Create and remove LUNs
> 002: File I/O on top of multipath concurrently with logout and login (mq)
> 003: File I/O on top of multipath concurrently with logout and login (sq)
> 004: File I/O on top of multipath concurrently with logout and login 
> (sq-on-mq)
> 005: Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M
> 006: Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M
> 007: Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M
> 008: Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M
> 009: Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M
> 010: Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M
> 011: Block I/O on top of multipath concurrently with logout and login
> 012: dm-mpath on top of multiple I/O schedulers
> 013: Direct I/O using a discontiguous buffer
> 
> Other changes in this patch are:
> - Document required kernel config options, user space packages and
>   multipath.conf in README.md.
> - Add file tests/srp/functions with shell functions that are used by
>   multiple SRP tests.
> - Add file tests/srp/rc.
> 
> Signed-off-by: Bart Van Assche 
> ---

Thanks, Bart, happy to see this making progress :)

The tests currently fail like this for me:

srp/001 (Create and remove LUNs) [failed]
runtime  18.238s  ...  17.925s
--- tests/srp/001.out   2018-06-28 15:18:36.533835920 -0700
+++ results/nodev/srp/001.out.bad   2018-06-28 16:21:53.187213618 -0700
@@ -1,6 +1,6 @@
+Unloaded the ib_srp kernel module
 Unloaded the ib_srpt kernel module
 Unloaded the rdma_rxe kernel module
-
 Configured SRP target driver
 Unloaded the ib_srp kernel module
 count_luns(): 3 <> 3
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) 
[failed]
runtime  6.680s  ...  6.640s
--- tests/srp/002.out   2018-06-28 15:18:36.537169282 -0700
+++ results/nodev/srp/002.out.bad   2018-06-28 16:21:59.930603931 -0700
@@ -3,5 +3,4 @@
 Unloaded the rdma_rxe kernel module
 Configured SRP target driver
 Unloaded the ib_srp kernel module
-Unloaded the ib_srpt kernel module
-Unloaded the rdma_rxe kernel module
+/dev/disk/by-id/dm-uuid-mpath-3600140572616d6469736b310: not found

And everything else fails like srp/002.

Some more comments below.

>  README.md   |   71 +++
>  tests/srp/001   |   71 +++
>  tests/srp/001.out   |8 +
>  tests/srp/002   |   49 ++
>  tests/srp/002.out   |7 +
>  tests/srp/003   |   50 ++
>  tests/srp/003.out   |7 +
>  tests/srp/004   |   50 ++
>  tests/srp/004.out   |7 +
>  tests/srp/005   |   40 ++
>  tests/srp/005.out   |7 +
>  tests/srp/006   |   40 ++
>  tests/srp/006.out   |7 +
>  tests/srp/007   |   40 ++
>  tests/srp/007.out   |7 +
>  tests/srp/008   |   39 ++
>  tests/srp/008.out   |7 +
>  tests/srp/009   |   40 ++
>  tests/srp/009.out   |7 +
>  tests/srp/010   |   40 ++
>  tests/srp/010.out   |7 +
>  tests/srp/011   |   44 ++
>  tests/srp/011.out   |7 +
>  tests/srp/012   |   52 ++
>  tests/srp/012.out   |8 +
>  tests/srp/013   |   48 ++
>  tests/srp/013.out   |8 +
>  tests/srp/functions | 1288 +++
>  tests/srp/rc|   54 ++
>  29 files changed, 2110 insertions(+)
>  create mode 100755 tests/srp/001
>  create mode 100644 tests/srp/001.out
>  create mode 100755 tests/srp/002
>  create mode 100644 tests/srp/002.out
>  create mode 100755 tests/srp/003
>  create mode 100644 tests/srp/003.out
>  create mode 100755 tests/srp/004
>  create mode 100644 tests/srp/004.out
>  create mode 100755 tests/srp/005
>  create mode 100644 tests/srp/005.out
>  create mode 100755 tests/srp/006
>  create mode 100644 tests/srp/006.out
>  create mode 100755 tests/srp/007
>  create mode 100644 tests/srp/007.out
>  create mode 100755 tests/srp/008
>  create mode 100644 tests/srp/008.out
>  create mode 100755 tests/srp/009
>  create mode 100644 tests/srp/009.out
>  create mode 100755 tests/srp/010
>  create mode 100644 tests/srp/010.out
>  create mode 100755 tests/srp/011
>  create mode 100644 tests/srp/011.out
>  create mode 100755 tests/srp/012
>  create mode 100644 tests/srp/012.out
>  create mode 100755 tests/srp/013
>  create mode 100644 tests/srp/013.out
>  create mode 100755 tests/srp/functions
>  create mode 100755 tests/srp/rc
> 
> diff --git a/README.md b/README.md
> index 78a41567416e..d8df938700cd 100644
> --- a/README.md
> +++ b/README.md
> @@ -6,6 +6,38 @@ filesystem testing framework.
>  
>  ## Getting Started
>  
> +Make sure that at least the following symbols are set in the kernel config:
> +
> +* CONFIG_BLK_DEV_DM
> +* CONFIG_BLK_DEV_NULL_BLK
> +* CONFIG_BLK_DEV_RAM
> +* CONFIG_BLK_DEV_SD
> +* CONFIG_CHR_DEV_SG
> 

Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Kent Overstreet
On Fri, Jun 29, 2018 at 07:10:47AM +0800, Ming Lei wrote:
> On Thu, Jun 28, 2018 at 11:21 PM, Bart Van Assche
>  wrote:
> > On 06/27/18 17:30, Ming Lei wrote:
> >>
> >> One core idea of immutable bvec is to use bio->bi_iter and the original
> >> bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
> >> needs to copy, but not see any reason why .bi_vcnt needs to do.
> >>
> >> Do you have use cases on .bi_vcnt for cloned bio?
> >
> >
> > So far this is only a theoretical concern. There are many functions in the
> > block layer that use .bi_vcnt, and it is a lot of work to figure out all the
> > callers of all these functions.

Back when I implemented immutable biovecs I thoroughly audited all the bi_vcnt
uses and removed all of them that weren't by the code that owns/submits the bio.

Grepping around I see one or two suspicious uses.. blk-merge.c in particular

> No, any functions using .bi_vcnt on a cloned-bio may be a bug, and we should
> take a close look.

not just cloned bios, any code using bi_vcnt on a bio it didn't create is wrong.

so big nack to this patch (I wasn't ccd on it though and it doesn't seem to have
hit lkml, so I can't find the original patch...)


Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Ming Lei
On Thu, Jun 28, 2018 at 11:21 PM, Bart Van Assche
 wrote:
> On 06/27/18 17:30, Ming Lei wrote:
>>
>> One core idea of immutable bvec is to use bio->bi_iter and the original
>> bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
>> needs to copy, but not see any reason why .bi_vcnt needs to do.
>>
>> Do you have use cases on .bi_vcnt for cloned bio?
>
>
> So far this is only a theoretical concern. There are many functions in the
> block layer that use .bi_vcnt, and it is a lot of work to figure out all the
> callers of all these functions.

No, any functions using .bi_vcnt on a cloned-bio may be a bug, and we should
take a close look.

Thanks,
Ming Lei


Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Ming Lei
On Thu, Jun 28, 2018 at 09:53:23AM -0600, Jens Axboe wrote:
> On 6/27/18 2:12 PM, Bart Van Assche wrote:
> > Although __bio_clone_fast() copies bi_io_vec, it does not copy bi_vcnt,
> > the number of elements in bi_io_vec[] that contains data. Copy bi_vcnt
> > such that code that needs this member behaves identically for original
> > and for cloned requests.
> 
> Applied - it's correct for the current base.
> 

Any users of cloned bio shouldn't use this bio's .bi_vcnt directly,
since this counter represents the original bio's actual io vector
number, nothing related with the cloned bio.

So I don't understand why we need to copy it.

Thanks,
Ming


Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests

2018-06-28 Thread Jens Axboe
On 6/28/18 1:52 PM, Steven Rostedt wrote:
> On Thu, 28 Jun 2018 09:27:08 +0200
> Johannes Thumshirn  wrote:
> 
>> Hi Ming,
>>
>> On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
>>> +   list_for_each_entry(rq, list, queuelist) {
>>> BUG_ON(rq->mq_ctx != ctx);
>>> -   list_del_init(>queuelist);
>>> -   __blk_mq_insert_req_list(hctx, rq, false);
>>> +   trace_block_rq_insert(hctx->queue, rq);
>>> }  
>>
>> I wonder if we really need the above loop unconditionally. It does
>> some BUG_ON() sanity checking (which I hate but it was already there
>> so not your problem) and tracing of the request insertion.
>>
>> So can we maybe only run this loop if tracing is enabled? Not sure if
>> this is possible though. Maybe Steven (Cced) can help here.
> 
> Yes:
> 
>   if (trace_block_rq_insert_enabled()) {
>   list_for_each_entry(rq, list, queuelist) {
>   BUG_ON(rq->mq_ctx != ctx);
>   list_del_init(>queuelist);
>   __blk_mq_insert_req_list(hctx, rq, false);
>   trace_block_rq_insert(hctx->queue, rq);
>   }
>   }
> 
> This will only call the loop if the trace event "block_rq_insert" has
> been activated. It also uses the jump label infrastructure, so that if
> statement is a non-conditional branch (static_key_false()).

This works for both ftrace and blktrace?

-- 
Jens Axboe



Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests

2018-06-28 Thread Steven Rostedt
On Thu, 28 Jun 2018 09:27:08 +0200
Johannes Thumshirn  wrote:

> Hi Ming,
> 
> On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
> > +   list_for_each_entry(rq, list, queuelist) {
> > BUG_ON(rq->mq_ctx != ctx);
> > -   list_del_init(>queuelist);
> > -   __blk_mq_insert_req_list(hctx, rq, false);
> > +   trace_block_rq_insert(hctx->queue, rq);
> > }  
> 
> I wonder if we really need the above loop unconditionally. It does
> some BUG_ON() sanity checking (which I hate but it was already there
> so not your problem) and tracing of the request insertion.
> 
> So can we maybe only run this loop if tracing is enabled? Not sure if
> this is possible though. Maybe Steven (Cced) can help here.

Yes:

if (trace_block_rq_insert_enabled()) {
list_for_each_entry(rq, list, queuelist) {
BUG_ON(rq->mq_ctx != ctx);
list_del_init(>queuelist);
__blk_mq_insert_req_list(hctx, rq, false);
trace_block_rq_insert(hctx->queue, rq);
}
}

This will only call the loop if the trace event "block_rq_insert" has
been activated. It also uses the jump label infrastructure, so that if
statement is a non-conditional branch (static_key_false()).

-- Steve



Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests

2018-06-28 Thread Jens Axboe
On 6/28/18 1:27 AM, Johannes Thumshirn wrote:
> Hi Ming,
> 
> On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
>> +list_for_each_entry(rq, list, queuelist) {
>>  BUG_ON(rq->mq_ctx != ctx);
>> -list_del_init(>queuelist);
>> -__blk_mq_insert_req_list(hctx, rq, false);
>> +trace_block_rq_insert(hctx->queue, rq);
>>  }
> 
> I wonder if we really need the above loop unconditionally. It does
> some BUG_ON() sanity checking (which I hate but it was already there
> so not your problem) and tracing of the request insertion.

We can probably kill that now, but jfyi, this was introduced because
of a specific issue.

> So can we maybe only run this loop if tracing is enabled? Not sure if
> this is possible though. Maybe Steven (Cced) can help here.

Don't think it is, since we can be tracing through a variety of
different ways. I think it's best to leave this as-is. If this
were to be changed, it'd have to be a separate patch anyway,
it should not be mixed up with this change.

It does need the list_splice_tail_init() as others found out,
or you'll leave the source list corrupted.

-- 
Jens Axboe



Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests

2018-06-28 Thread Johannes Thumshirn
Hi Ming,

On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
> + list_for_each_entry(rq, list, queuelist) {
>   BUG_ON(rq->mq_ctx != ctx);
> - list_del_init(>queuelist);
> - __blk_mq_insert_req_list(hctx, rq, false);
> + trace_block_rq_insert(hctx->queue, rq);
>   }

I wonder if we really need the above loop unconditionally. It does
some BUG_ON() sanity checking (which I hate but it was already there
so not your problem) and tracing of the request insertion.

So can we maybe only run this loop if tracing is enabled? Not sure if
this is possible though. Maybe Steven (Cced) can help here.

Byte,
 Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] blk-mq: don't queue more if we get a busy return

2018-06-28 Thread Omar Sandoval
On Thu, Jun 28, 2018 at 09:46:50AM -0600, Jens Axboe wrote:
> Some devices have different queue limits depending on the type of IO. A
> classic case is SATA NCQ, where some commands can queue, but others
> cannot. If we have NCQ commands inflight and encounter a non-queueable
> command, the driver returns busy. Currently we attempt to dispatch more
> from the scheduler, if we were able to queue some commands. But for the
> case where we ended up stopping due to BUSY, we should not attempt to
> retrieve more from the scheduler. If we do, we can get into a situation
> where we attempt to queue a non-queueable command, get BUSY, then
> successfully retrieve more commands from that scheduler and queue those.
> This can repeat forever, starving the non-queuable command indefinitely.
> 
> Fix this by NOT attempting to pull more commands from the scheduler, if
> we get a BUSY return. This should also be more optimal in terms of
> letting requests stay in the scheduler for as long as possible, if we
> get a BUSY due to the regular out-of-tags condition.

Reviewed-by: Omar Sandoval 
> Signed-off-by: Jens Axboe 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b6888ff556cf..d394cdd8d8c6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1075,6 +1075,9 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
> **hctx,
>  
>  #define BLK_MQ_RESOURCE_DELAY3   /* ms units */
>  
> +/*
> + * Returns true if we did some work AND can potentially do more.
> + */
>  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>bool got_budget)
>  {
> @@ -1205,8 +1208,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list,
>   blk_mq_run_hw_queue(hctx, true);
>   else if (needs_restart && (ret == BLK_STS_RESOURCE))
>   blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> +
> + return false;
>   }
>  
> + /*
> +  * If the host/device is unable to accept more work, inform the
> +  * caller of that.
> +  */
> + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> + return false;
> +
>   return (queued + errors) != 0;
>  }
>  
> -- 
> Jens Axboe
>  


Re: [PATCH] block: Fix cloning of requests with a special payload

2018-06-28 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH v2] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-06-28 Thread Jens Axboe
On 6/27/18 6:00 PM, Bart Van Assche wrote:
> On 06/27/18 16:27, Ming Lei wrote:
>> On Wed, Jun 27, 2018 at 01:02:12PM -0700, Bart Van Assche wrote:
>>> Because the hctx lock is not held around the only
>>> blk_mq_tag_wakeup_all() call in the block layer, the wait queue
>>> entry removal in blk_mq_dispatch_wake() is protected by the wait
>>> queue lock only. Since the hctx->dispatch_wait entry can occur on
>>> any of the SBQ_WAIT_QUEUES, the wait queue presence check, adding
>>> .dispatch_wait to a wait queue and removing the wait queue entry
>>> must all be protected by both the hctx lock and the wait queue
>>> lock.
>>
>> Actually we don't need to use hctx->lock for protecting
>> hctx->dispatch_wait, and one new lock of hctx->dispatch_wait_lock is
>> enough, please see the following patch:
>>
>>  https://marc.info/?l=linux-block=152998658713265=2
>>
>> Then we can avoid to disable irq when acquiring hctx->lock.
> 
> I think it's more a matter of taste than a technical decision to choose 
> which patch goes upstream.

I do think the split lock is cleaner in this case, since it avoids
making hctx->lock irq disabling.

-- 
Jens Axboe



Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Jens Axboe
On 6/27/18 2:12 PM, Bart Van Assche wrote:
> Although __bio_clone_fast() copies bi_io_vec, it does not copy bi_vcnt,
> the number of elements in bi_io_vec[] that contains data. Copy bi_vcnt
> such that code that needs this member behaves identically for original
> and for cloned requests.

Applied - it's correct for the current base.

-- 
Jens Axboe



Re: [PATCH] block: Fix cloning of requests with a special payload

2018-06-28 Thread Jens Axboe
On 6/27/18 1:55 PM, Bart Van Assche wrote:
> This patch avoids that removing a path controlled by the dm-mpath driver
> while mkfs is running triggers the following kernel bug:
> 
> kernel BUG at block/blk-core.c:3347!
> invalid opcode:  [#1] PREEMPT SMP KASAN
> CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
> RIP: 0010:blk_end_request_all+0x68/0x70
> Call Trace:
>  
>  dm_softirq_done+0x326/0x3d0 [dm_mod]
>  blk_done_softirq+0x19b/0x1e0
>  __do_softirq+0x128/0x60d
>  irq_exit+0x100/0x110
>  smp_call_function_single_interrupt+0x90/0x330
>  call_function_single_interrupt+0xf/0x20
>  

Thanks Bart, applied for 4.18.

-- 
Jens Axboe



[PATCH] blk-mq: don't queue more if we get a busy return

2018-06-28 Thread Jens Axboe
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.

Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.

Signed-off-by: Jens Axboe 

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b6888ff556cf..d394cdd8d8c6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1075,6 +1075,9 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 
 #define BLK_MQ_RESOURCE_DELAY  3   /* ms units */
 
+/*
+ * Returns true if we did some work AND can potentially do more.
+ */
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1205,8 +1208,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
blk_mq_run_hw_queue(hctx, true);
else if (needs_restart && (ret == BLK_STS_RESOURCE))
blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
+
+   return false;
}
 
+   /*
+* If the host/device is unable to accept more work, inform the
+* caller of that.
+*/
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
+   return false;
+
return (queued + errors) != 0;
 }
 
-- 
Jens Axboe
 


Re: block: Fix cloning of requests with a special payload

2018-06-28 Thread Mike Snitzer
On Wed, Jun 27 2018 at  3:55pm -0400,
Bart Van Assche  wrote:

> This patch avoids that removing a path controlled by the dm-mpath driver
> while mkfs is running triggers the following kernel bug:
> 
> kernel BUG at block/blk-core.c:3347!
> invalid opcode:  [#1] PREEMPT SMP KASAN
> CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
> RIP: 0010:blk_end_request_all+0x68/0x70
> Call Trace:
>  
>  dm_softirq_done+0x326/0x3d0 [dm_mod]
>  blk_done_softirq+0x19b/0x1e0
>  __do_softirq+0x128/0x60d
>  irq_exit+0x100/0x110
>  smp_call_function_single_interrupt+0x90/0x330
>  call_function_single_interrupt+0xf/0x20
>  
> 
> Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Mike Snitzer 
> Cc: Ming Lei 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> Cc: 

Acked-by: Mike Snitzer 


Re: block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Mike Snitzer
On Thu, Jun 28 2018 at 11:21am -0400,
Bart Van Assche  wrote:

> On 06/27/18 17:30, Ming Lei wrote:
> >One core idea of immutable bvec is to use bio->bi_iter and the original
> >bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
> >needs to copy, but not see any reason why .bi_vcnt needs to do.
> >
> >Do you have use cases on .bi_vcnt for cloned bio?
> 
> So far this is only a theoretical concern. There are many functions
> in the block layer that use .bi_vcnt, and it is a lot of work to
> figure out all the callers of all these functions.

No point wasting time with that.  I don't understand why Ming cares.
Your change is obviously correct.  The state should get transfered over
to reflect reality.

This patch doesn't harm anything, it just prevents some clone-specific
failure in the future.

Acked-by: Mike Snitzer 


Re: [PATCH] block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Bart Van Assche

On 06/27/18 17:30, Ming Lei wrote:

One core idea of immutable bvec is to use bio->bi_iter and the original
bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
needs to copy, but not see any reason why .bi_vcnt needs to do.

Do you have use cases on .bi_vcnt for cloned bio?


So far this is only a theoretical concern. There are many functions in 
the block layer that use .bi_vcnt, and it is a lot of work to figure out 
all the callers of all these functions.


Bart.


Re: [PATCH] block: Fix cloning of requests with a special payload

2018-06-28 Thread Christoph Hellwig
Thanks,

this looks good:

Reviewed-by: Christoph Hellwig 


Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread jdow

On 20180628 00:39, Geert Uytterhoeven wrote:

Hi Martin,

On Thu, Jun 28, 2018 at 9:29 AM Martin Steigerwald  wrote:

Michael Schmitz - 28.06.18, 06:58:
[…]

In the interest of least surprises, we have to fix the 32 bit
overflow (so we can even detect that it would have happened), and
give the user the chance to carefully consider whether to accept
the new behaviour. That means refusing to make available any
partition that would have been affected by such overflow.


That is acceptable for me as I told before. Either mount or refuse
to
mount, but do not overflow and mount nonetheless :)

Mind you, I am not using my Amiga machines either at the moment. And
I repurposed the 2 TB disk years ago.


That's fine - I understand the 'profile' image was a true binary copy
of the RDB, and placing that file at offset 0 in an image file is a
legitimate use?


You actually ask me to remember about what that 'profile' image was? :)

Well, in the attachment note on the bug report I wrote: "should be just
a binary copy", so I did not know exactly back then either. However the
file starts with "RDSK" and then it has "PART" headers and so on. That
looks pretty much like a binary copy of an RDB. I am a bit surprised by
its small size of 2 KiB. But I see three partitions in there. According
to the screenshot I also provided, the disk had three partitions. So
probably Media Toolbox has been intelligent enough to just copy the used
space of the reserved RDB area. Cause I think the reserved space must
have been higher than 2 KiB. However the RDB/disk geometry editing
screen does not display it and off hand I do not know where to look
inside the RDB to see how much space has been reserved. Interestingly
the "Total sectors" value in that Media Toolbox window also overflowed.


The RDB can be anywhere in the first 2 tracks of the disk, and is identified
by the "RDSK" block (with a correct checksum). The remainder (e.g. "PART"
blocks) is in a linked list. So 2 KiB sounds fine for 3 partitions (1 RDSK +
3 * PART = 4 blocks = 4 * 512 bytes).


The RDB can be anywhere in the first 16 blocks on the disk if we are speaking 
officially. That's the entire area that is guaranteed to be inspected for an 
RDSK block. The PART block can, I think, even be located in front of the RDSK if 
you want to be obscene about it. {^_-} That's the kind of thing I checked with 
the HardFrame device driver ROM image. I preferred block 3 to keep it away from 
the space used by other partitioning schemes. It also enabled me to embed it a 
reserved area within the first actual partition just for the halibut. (Pronounce 
it sideways and it makes more sense.) I used that technique fairly often. If you 
think about it that gives you a wee tiny bit more disk space because you can 
tailor the reserved area to precisely fit the filesystem image plus some extra 
in case of updates. I toyed with using a pointer to FDSK blocks in the Dev 
directory but that got too insane. RDBs are insanely flexible, which may not be 
a good thing.


{^_^}


Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread Geert Uytterhoeven
Hi Joanne,

On Thu, Jun 28, 2018 at 11:20 AM jdow  wrote:
> On 20180627 23:45, Geert Uytterhoeven wrote:
> > On Thu, Jun 28, 2018 at 6:59 AM Michael Schmitz  
> > wrote:
> >> Am 28.06.2018 um 09:20 schrieb Martin Steigerwald:
> > And as stated in my other reply to the patch:
> > partition needs 64 bit disk device support in AmigaOS or AmigaOS
> > like
> > operating systems (NSD64, TD64 or SCSI direct)
> 
>  I'd probably leave it at 'disk needs 64 bit disk device support on
>  native OS', and only print that warning once.
> >>>
> >>> This is fine with me.
> >>
> >> OK, I'll go with that.
> >
> > Do we really need the warning?
> > Once the parsing is fixed doing 64-bit math, it does not matter for Linux
> > anymore.
>
> Dual booting.

(Dial/Triple/...) Booting older AmigaOS variants is an issue anyway, with
or without Linux.

It's like _Linux_ printing warnings that you cannot use your disk, created with
Solaris 2, on an old SunOS 4.1 machine ;-)

IMHO (s)he who creates partitions on large disks should make sure (s)he only
uses them on machines that can handle them.  This is not a Linux issue.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread Geert Uytterhoeven
Hi Martin,

On Thu, Jun 28, 2018 at 9:13 AM Martin Steigerwald  wrote:
> Geert Uytterhoeven - 28.06.18, 08:45:
> > On Thu, Jun 28, 2018 at 6:59 AM Michael Schmitz 
> wrote:
> > > Am 28.06.2018 um 09:20 schrieb Martin Steigerwald:
> > > >>> And as stated in my other reply to the patch:
> > > >>> partition needs 64 bit disk device support in AmigaOS or AmigaOS
> > > >>> like
> > > >>> operating systems (NSD64, TD64 or SCSI direct)
> > > >>
> > > >> I'd probably leave it at 'disk needs 64 bit disk device support
> > > >> on
> > > >> native OS', and only print that warning once.
> > > >
> > > > This is fine with me.
> > >
> > > OK, I'll go with that.
> >
> > Do we really need the warning?
> > Once the parsing is fixed doing 64-bit math, it does not matter for
> > Linux anymore.
>
> Well, irony of this is: In my case the RDB has been created on a machine
> with a native OS. So Linux warns me about something I already did so on
> the native OS without any warning. In this case AmigaOS 4.0.

Exactly.

So moving a disk partitioned under AmigaOS 4.0 to a system running an
older version of AmigaOS can fail miserably. Not a Linux issue.
Linux also doesn't warn about disks with GPT failing to work on old MSDOS.

> > > > I would not name the kernel option "eat_my_rdb", but use a less
> > > > dramatizing name.
> > > >
> > > > Maybe just: "allow_64bit_rdb" or something like that.
> > >
> > > I don't expect to get away with that :-)
> >
> > I still fail to see what's the added value of the kernel option...
> > Either the partition is usable, or not.
>
> Well, I could try to contact some of the current AmigaOS developers
> about that and ask them whether they would like to give me a statement
> about this that I am allowed to post here.
>
> I would not know whether they answer and it may take a time. My offer
> stands, but I would only do this, if you really like to have that
> official feedback.

Let me clarify: what exactly would the kernel option allow? When to use it?

> Again, I am pretty sure that what I did is safe on AmigaOS 4 at least,
> but I bet also on AmigaOS <4 with NSD64 or TD64 (except for the
> filesystem sizes, but AmigaOS < 4 does not have JXFS anyway, and did not
> have SFS2 as well, maybe that is available now, I don´t know).
>
> However Joanne is without doubt an authority on RDBs, but she has not

As a former AmigaOS user, I'm fully aware of that (Thanks Joanne! ;-)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread jdow




On 20180627 23:45, Geert Uytterhoeven wrote:

Hi Michael,

On Thu, Jun 28, 2018 at 6:59 AM Michael Schmitz  wrote:

Am 28.06.2018 um 09:20 schrieb Martin Steigerwald:

And as stated in my other reply to the patch:
partition needs 64 bit disk device support in AmigaOS or AmigaOS
like
operating systems (NSD64, TD64 or SCSI direct)


I'd probably leave it at 'disk needs 64 bit disk device support on
native OS', and only print that warning once.


This is fine with me.


OK, I'll go with that.


Do we really need the warning?
Once the parsing is fixed doing 64-bit math, it does not matter for Linux
anymore.


Dual booting.
{^_^}



Re: [PATCH] lightnvm: pblk: Add read memory barrier when reading from rb

2018-06-28 Thread Javier Gonzalez
> On 28 Jun 2018, at 09.59, Matias Bjørling  wrote:
> 
> On 06/28/2018 01:31 AM, Heiner Litz wrote:
>> There is a control dependency between two disjoint variables (only read data 
>> if flags == WRITTEN). Because x86-TSO allows re-ordering of loads the 
>> control dependency can be violated.
> 
> I'm sorry, I do not see it :)
> 
> Here is my understanding:
> 
> entry->w_ctx.flags is used as a flagging mechanism to wait for data to become 
> available when "flags" has PBLK_WRITTEN_DATA.
> 
> The later access to entry->data is independent of this. It is assumed that 
> entry->w_ctx.flags is the guarding barrier.

This is correct. The motivation is to allow several producers to fill the
buffer simultaneously.

> 
> Here is the titbit from the control dependency section that makes me
> say that the entry->data loads is not possible to be done before
> entry->w_ctx.flags:
> 
> " (*) Control dependencies apply only to the then-clause and else-clause
>  of the if-statement containing the control dependency, including
>  any functions that these two clauses call.  Control dependencies
>  do -not- apply to code following the if-statement containing the
>  control dependency."
> 
> The read of entry->data must come after the READ_ONCE of entry->w_ctx.flags.
> 

I also understood it this way when implementing the barrier, but after
an offline discussion with Heiner, he convinced me that reordering could
occur.

>> BTW: ARM has an even more relaxed memory model and also allows to
>> re-order writes. If we want to support ARM correctly, I think we also
>> need to insert a memory barrier between writing data and writing
>> flags (and there might be a couple other places where we need to
>> check).

Heiner: Can you develop on this?

Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH] lightnvm: pblk: Add read memory barrier when reading from rb

2018-06-28 Thread Matias Bjørling

On 06/28/2018 01:31 AM, Heiner Litz wrote:
There is a control dependency between two disjoint variables (only read 
data if flags == WRITTEN). Because x86-TSO allows re-ordering of loads 
the control dependency can be violated.


I'm sorry, I do not see it :)

Here is my understanding:

entry->w_ctx.flags is used as a flagging mechanism to wait for data to 
become available when "flags" has PBLK_WRITTEN_DATA.


The later access to entry->data is independent of this. It is assumed 
that entry->w_ctx.flags is the guarding barrier.


Here is the titbit from the control dependency section that makes me say 
that the entry->data loads is not possible to be done before 
entry->w_ctx.flags:


" (*) Control dependencies apply only to the then-clause and else-clause
  of the if-statement containing the control dependency, including
  any functions that these two clauses call.  Control dependencies
  do -not- apply to code following the if-statement containing the
  control dependency."

The read of entry->data must come after the READ_ONCE of 
entry->w_ctx.flags.




https://www.kernel.org/doc/Documentation/memory-barriers.txt refers to 
the above scenario as control dependency but your description is of 
course also correct. Let me know if we should clarify the comment.


BTW: ARM has an even more relaxed memory model and also allows to 
re-order writes. If we want to support ARM correctly, I think we also 
need to insert a memory barrier between writing data and writing flags 
(and there might be a couple other places where we need to check).


On Fri, Jun 22, 2018, 11:02 AM Matias Bjørling > wrote:


On 06/21/2018 12:54 AM, Heiner Litz wrote:
 > READ_ONCE does not imply a read memory barrier in the presence of
control
 > dependencies between two separate memory locations (flags and
data). On x86
 > TSO, reading from the data page might be reordered before the
flags read.
 > See chapter CONTROL DEPENDENCIES in
 > https://www.kernel.org/doc/Documentation/memory-barriers.txt
 >
 > Signed-off-by: Heiner Litz mailto:hl...@ucsc.edu>>
 > ---
 >   drivers/lightnvm/pblk-rb.c | 3 +++
 >   1 file changed, 3 insertions(+)
 >
 > diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
 > index a81a97e..5f09983 100644
 > --- a/drivers/lightnvm/pblk-rb.c
 > +++ b/drivers/lightnvm/pblk-rb.c
 > @@ -545,6 +545,9 @@ unsigned int pblk_rb_read_to_bio(struct
pblk_rb *rb, struct nvm_rq *rqd,
 >                       goto try;
 >               }
 >
 > +             /* Observe control dependency between flags and
data read */
 > +             smp_rmb();
 > +
 >               page = virt_to_page(entry->data);
 >               if (!page) {
 >                       pr_err("pblk: could not allocate write bio
page\n");
 >

Hi Heiner,

Can you help explain how it is a control dependency? What case am I
missing?

The way I read the code, there is the case where a read of entry->data
happens before entry->w_ctx.flags, but I see that as a memory
reordering, which the smp_rmb() fixes. Would that be more accurate?

Thank you,
Matias





RE: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched

2018-06-28 Thread Kashyap Desai
> Right.
>
> Kashyap, could you test the following patch?
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2f20c9e3efda..7d972b1c3153 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1567,7 +1567,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx
> *hctx, struct blk_mq_ctx *ctx,
>   }
>
>   spin_lock(>lock);
> - list_splice_tail(list, >rq_list);
> + list_splice_tail_init(list, >rq_list);
>   blk_mq_hctx_mark_pending(hctx, ctx);
>   spin_unlock(>lock);

I tested above mentioned change and kernel panic as posted in this
discussion is resolved.

Kashyap

>  }
>
> Thanks,
> Ming


Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread Geert Uytterhoeven
Hi Martin,

On Thu, Jun 28, 2018 at 9:29 AM Martin Steigerwald  wrote:
> Michael Schmitz - 28.06.18, 06:58:
> […]
> > >> In the interest of least surprises, we have to fix the 32 bit
> > >> overflow (so we can even detect that it would have happened), and
> > >> give the user the chance to carefully consider whether to accept
> > >> the new behaviour. That means refusing to make available any
> > >> partition that would have been affected by such overflow.
> > >
> > > That is acceptable for me as I told before. Either mount or refuse
> > > to
> > > mount, but do not overflow and mount nonetheless :)
> > >
> > > Mind you, I am not using my Amiga machines either at the moment. And
> > > I repurposed the 2 TB disk years ago.
> >
> > That's fine - I understand the 'profile' image was a true binary copy
> > of the RDB, and placing that file at offset 0 in an image file is a
> > legitimate use?
>
> You actually ask me to remember about what that 'profile' image was? :)
>
> Well, in the attachment note on the bug report I wrote: "should be just
> a binary copy", so I did not know exactly back then either. However the
> file starts with "RDSK" and then it has "PART" headers and so on. That
> looks pretty much like a binary copy of an RDB. I am a bit surprised by
> its small size of 2 KiB. But I see three partitions in there. According
> to the screenshot I also provided, the disk had three partitions. So
> probably Media Toolbox has been intelligent enough to just copy the used
> space of the reserved RDB area. Cause I think the reserved space must
> have been higher than 2 KiB. However the RDB/disk geometry editing
> screen does not display it and off hand I do not know where to look
> inside the RDB to see how much space has been reserved. Interestingly
> the "Total sectors" value in that Media Toolbox window also overflowed.

The RDB can be anywhere in the first 2 tracks of the disk, and is identified
by the "RDSK" block (with a correct checksum). The remainder (e.g. "PART"
blocks) is in a linked list. So 2 KiB sounds fine for 3 partitions (1 RDSK +
3 * PART = 4 blocks = 4 * 512 bytes).

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread Martin Steigerwald
Hi Michael.

Probably I was right with not submitting a patch myself. I´d likely 
would have been overwhelmed by the discussion and feedback :)

Michael Schmitz - 28.06.18, 06:58:
[…]
> >> In the interest of least surprises, we have to fix the 32 bit
> >> overflow (so we can even detect that it would have happened), and
> >> give the user the chance to carefully consider whether to accept
> >> the new behaviour. That means refusing to make available any
> >> partition that would have been affected by such overflow.
> > 
> > That is acceptable for me as I told before. Either mount or refuse
> > to
> > mount, but do not overflow and mount nonetheless :)
> > 
> > Mind you, I am not using my Amiga machines either at the moment. And
> > I repurposed the 2 TB disk years ago.
> 
> That's fine - I understand the 'profile' image was a true binary copy
> of the RDB, and placing that file at offset 0 in an image file is a
> legitimate use?

You actually ask me to remember about what that 'profile' image was? :)

Well, in the attachment note on the bug report I wrote: "should be just 
a binary copy", so I did not know exactly back then either. However the 
file starts with "RDSK" and then it has "PART" headers and so on. That 
looks pretty much like a binary copy of an RDB. I am a bit surprised by 
its small size of 2 KiB. But I see three partitions in there. According 
to the screenshot I also provided, the disk had three partitions. So 
probably Media Toolbox has been intelligent enough to just copy the used 
space of the reserved RDB area. Cause I think the reserved space must 
have been higher than 2 KiB. However the RDB/disk geometry editing 
screen does not display it and off hand I do not know where to look 
inside the RDB to see how much space has been reserved. Interestingly 
the "Total sectors" value in that Media Toolbox window also overflowed. 
But from my memory this was just a cosmetic issue in Media Toolbox. The 
"*.device" device drivers, the filesystems and the RDB handling code in 
AmigaOS do their own math. That is what NSD64 / TD64 was about.

https://bugzilla.kernel.org/show_bug.cgi?id=43511

> > I would not name the kernel option "eat_my_rdb", but use a less
> > dramatizing name.
> > 
> > Maybe just: "allow_64bit_rdb" or something like that.
> 
> I don't expect to get away with that :-)

Heh. :)

> > How does the user come to know about this kernel option? Will you
> > print its name in kernel log?
> 
> Depends on how easy we want to make it for users. If I put a BUG()
> trap with the check, the resulting log section will point to a
> specific line in block/partitions/amiga.c, from which the override
> option will be obvious. But that might be a little too opaque for
> some...

kernel-parameters.txt or mentioning in the warning would also be an 
option.

Thanks,
-- 
Martin




Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread Martin Steigerwald
Hi Geert.

Geert Uytterhoeven - 28.06.18, 08:45: 
> On Thu, Jun 28, 2018 at 6:59 AM Michael Schmitz  
wrote:
> > Am 28.06.2018 um 09:20 schrieb Martin Steigerwald:
> > >>> And as stated in my other reply to the patch:
> > >>> partition needs 64 bit disk device support in AmigaOS or AmigaOS
> > >>> like
> > >>> operating systems (NSD64, TD64 or SCSI direct)
> > >> 
> > >> I'd probably leave it at 'disk needs 64 bit disk device support
> > >> on
> > >> native OS', and only print that warning once.
> > > 
> > > This is fine with me.
> > 
> > OK, I'll go with that.
> 
> Do we really need the warning?
> Once the parsing is fixed doing 64-bit math, it does not matter for
> Linux anymore.

Well, irony of this is: In my case the RDB has been created on a machine 
with a native OS. So Linux warns me about something I already did so on 
the native OS without any warning. In this case AmigaOS 4.0.
 
> Won't it make more sense to have the warning in the tool that created
> the partition table in the first place?

Well that would be up to the AmigaOS developers to decide.

And well for amiga-fdisk or parted developers if they ever choose to 
support this or already do. (I doubt that amiga-fdisk can handle this.)

> > > I would not name the kernel option "eat_my_rdb", but use a less
> > > dramatizing name.
> > > 
> > > Maybe just: "allow_64bit_rdb" or something like that.
> > 
> > I don't expect to get away with that :-)
> 
> I still fail to see what's the added value of the kernel option...
> Either the partition is usable, or not.

Well, I could try to contact some of the current AmigaOS developers 
about that and ask them whether they would like to give me a statement 
about this that I am allowed to post here.

I would not know whether they answer and it may take a time. My offer 
stands, but I would only do this, if you really like to have that 
official feedback.

Again, I am pretty sure that what I did is safe on AmigaOS 4 at least, 
but I bet also on AmigaOS <4 with NSD64 or TD64 (except for the 
filesystem sizes, but AmigaOS < 4 does not have JXFS anyway, and did not 
have SFS2 as well, maybe that is available now, I don´t know).

However Joanne is without doubt an authority on RDBs, but she has not 
been involved with AmigaOS development for quite some time and, correct 
me if this is wrong, Joanne, does not know as much about the recent 
versions, as I or even more so as current AmigaOS developers know.

Thanks,
-- 
Martin




Re: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched

2018-06-28 Thread Ming Lei
On Thu, Jun 28, 2018 at 02:03:47PM +0800, jianchao.wang wrote:
> 
> 
> On 06/28/2018 01:42 PM, Kashyap Desai wrote:
> > Ming -
> > 
> > Performance drop is resolved on my setup, but may be some stability of the
> > kernel is caused due to this patch set. I have not tried without patch
> > set, but in case you can figure out if below crash is due to this patch
> > set, I can try potential fix as well.  I am seeing kernel panic if I use
> > below three settings in my fio script.
> > 
> > iodepth_batch=8
> > iodepth_batch_complete_min=4
> > iodepth_batch_complete_max=16
> > 
> > 
> > Here is panic detail -
> > 
> > [65237.831029] [ cut here ]
> > [65237.831031] list_add corruption. prev->next should be next
> > (bb1d0a41fdf8), but was 94d0f0c57240. (prev=94d0f0c57240).
> > [65237.831046] WARNING: CPU: 18 PID: 13897 at lib/list_debug.c:28
> > __list_add_valid+0x6a/0x70
> > [65237.831046] Modules linked in: mpt3sas raid_class ses enclosure
> > scsi_transport_sas megaraid_sas(OE) xt_CHECKSUM ipt_MASQUERADE tun
> > ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6
> > xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc
> > ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
> > ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
> > nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> > iptable_security iptable_raw ebtable_filter ebtables ip6table_filter
> > ip6_tables iptable_filter intel_rapl skx_edac nfit libnvdimm
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> > snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel crypto_simd
> > cryptd
> > [65237.831079]  snd_hda_codec snd_hda_core glue_helper snd_hwdep iTCO_wdt
> > snd_seq iTCO_vendor_support snd_seq_device pcspkr snd_pcm sg i2c_i801
> > joydev wmi mei_me ipmi_si mei snd_timer acpi_power_meter ioatdma snd
> > acpi_pad ipmi_devintf ipmi_msghandler soundcore lpc_ich nfsd auth_rpcgss
> > nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ast i2c_algo_bit
> > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe
> > ahci libahci crc32c_intel libata mdio i2c_core dca dm_mirror
> > dm_region_hash dm_log dm_mod
> > [65237.831109] CPU: 18 PID: 13897 Comm: fio Kdump: loaded Tainted: G
> > OE 4.18.0-rc1+ #1
> > [65237.831110] Hardware name: Supermicro Super Server/X11DPG-QT, BIOS 1.0
> > 06/22/2017
> > [65237.831112] RIP: 0010:__list_add_valid+0x6a/0x70
> > [65237.831112] Code: 31 c0 48 c7 c7 90 08 ab a0 e8 22 b4 cd ff 0f 0b 31 c0
> > c3 48 89 d1 48 c7 c7 40 08 ab a0 48 89 f2 48 89 c6 31 c0 e8 06 b4 cd ff
> > <0f> 0b 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57
> > [65237.831130] RSP: 0018:bb1d0a41fdd8 EFLAGS: 00010286
> > [65237.831131] RAX:  RBX: 94d0ebf9 RCX:
> > 0006
> > [65237.831132] RDX:  RSI: 0086 RDI:
> > 94d91ec16970
> > [65237.831132] RBP: db1cff59e980 R08:  R09:
> > 0663
> > [65237.831133] R10: 0004 R11:  R12:
> > 0001
> > [65237.831134] R13:  R14: 94d0f0c57240 R15:
> > 94d0f0f04c40
> > [65237.831135] FS:  7fad10d02740() GS:94d91ec0()
> > knlGS:
> > [65237.831135] CS:  0010 DS:  ES:  CR0: 80050033
> > [65237.831136] CR2: 7fad10c2a004 CR3: 00085600e006 CR4:
> > 007606e0
> > [65237.831137] DR0:  DR1:  DR2:
> > 
> > [65237.831138] DR3:  DR6: fffe0ff0 DR7:
> > 0400
> > [65237.831138] PKRU: 5554
> > [65237.831138] Call Trace:
> > [65237.831145]  blk_mq_flush_plug_list+0x12d/0x260
> > [65237.831150]  blk_flush_plug_list+0xe7/0x240
> > [65237.831152]  blk_finish_plug+0x27/0x40
> > [65237.831156]  __x64_sys_io_submit+0xc9/0x180
> > [65237.831162]  do_syscall_64+0x5b/0x180
> > [65237.831166]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> I guess we need to clean list after list_splice_tail in the 1/1 patch as 
> following
> @@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx 
> *hctx, struct blk_mq_ctx *ctx,
>   struct list_head *list)
>  
>  {
> ...
> +
> + spin_lock(>lock);
> + list_splice_tail(list, >rq_list);
> +   INIT_LIST_HEAD(list);  // ---> HERE !
>   blk_mq_hctx_mark_pending(hctx, ctx);
>   spin_unlock(>lock);

Right.

Kashyap, could you test the following patch?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2f20c9e3efda..7d972b1c3153 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1567,7 +1567,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, 
struct blk_mq_ctx *ctx,
}
 
spin_lock(>lock);
-   list_splice_tail(list, >rq_list);
+   

Re: Subject: [PATCH RFC] block: fix Amiga RDB partition support for disks >= 2 TB

2018-06-28 Thread Geert Uytterhoeven
Hi Michael,

On Thu, Jun 28, 2018 at 6:59 AM Michael Schmitz  wrote:
> Am 28.06.2018 um 09:20 schrieb Martin Steigerwald:
> >>> And as stated in my other reply to the patch:
> >>> partition needs 64 bit disk device support in AmigaOS or AmigaOS
> >>> like
> >>> operating systems (NSD64, TD64 or SCSI direct)
> >>
> >> I'd probably leave it at 'disk needs 64 bit disk device support on
> >> native OS', and only print that warning once.
> >
> > This is fine with me.
>
> OK, I'll go with that.

Do we really need the warning?
Once the parsing is fixed doing 64-bit math, it does not matter for Linux
anymore.

Won't it make more sense to have the warning in the tool that created the
partition table in the first place?

> > I would not name the kernel option "eat_my_rdb", but use a less
> > dramatizing name.
> >
> > Maybe just: "allow_64bit_rdb" or something like that.
>
> I don't expect to get away with that :-)

I still fail to see what's the added value of the kernel option...
Either the partition is usable, or not.

> > How does the user come to know about this kernel option? Will you print
> > its name in kernel log?
>
> Depends on how easy we want to make it for users. If I put a BUG() trap
> with the check, the resulting log section will point to a specific line
> in block/partitions/amiga.c, from which the override option will be
> obvious. But that might be a little too opaque for some...

Please don't use BUG(), unless your goal is to attract attention (from
Linus, who dislikes BUG()!).

Using BUG() would be a nice way to DoS someones machine by plugging in
a USB stick with a malformed RDB.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH blktests v2 2/3] Add the discontiguous-io test program

2018-06-28 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH blktests v2 1/3] src/Makefile: Rename $(TARGETS) into $(C_TARGETS)

2018-06-28 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


RE: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched

2018-06-28 Thread Kashyap Desai
>
> I guess we need to clean list after list_splice_tail in the 1/1 patch as
> following
> @@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct
> blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
>   struct list_head *list)
>
>  {
> ...
> +
> + spin_lock(>lock);
> + list_splice_tail(list, >rq_list);
> +   INIT_LIST_HEAD(list);  // ---> HERE !

I will try replacing " list_splice_tail" with " list_splice_tail_init"

Kashyap

>   blk_mq_hctx_mark_pending(hctx, ctx);
>   spin_unlock(>lock);
>
> Thanks
> Jianchao


Re: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched

2018-06-28 Thread jianchao.wang



On 06/28/2018 01:42 PM, Kashyap Desai wrote:
> Ming -
> 
> Performance drop is resolved on my setup, but may be some stability of the
> kernel is caused due to this patch set. I have not tried without patch
> set, but in case you can figure out if below crash is due to this patch
> set, I can try potential fix as well.  I am seeing kernel panic if I use
> below three settings in my fio script.
> 
> iodepth_batch=8
> iodepth_batch_complete_min=4
> iodepth_batch_complete_max=16
> 
> 
> Here is panic detail -
> 
> [65237.831029] [ cut here ]
> [65237.831031] list_add corruption. prev->next should be next
> (bb1d0a41fdf8), but was 94d0f0c57240. (prev=94d0f0c57240).
> [65237.831046] WARNING: CPU: 18 PID: 13897 at lib/list_debug.c:28
> __list_add_valid+0x6a/0x70
> [65237.831046] Modules linked in: mpt3sas raid_class ses enclosure
> scsi_transport_sas megaraid_sas(OE) xt_CHECKSUM ipt_MASQUERADE tun
> ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6
> xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
> ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> iptable_security iptable_raw ebtable_filter ebtables ip6table_filter
> ip6_tables iptable_filter intel_rapl skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel crypto_simd
> cryptd
> [65237.831079]  snd_hda_codec snd_hda_core glue_helper snd_hwdep iTCO_wdt
> snd_seq iTCO_vendor_support snd_seq_device pcspkr snd_pcm sg i2c_i801
> joydev wmi mei_me ipmi_si mei snd_timer acpi_power_meter ioatdma snd
> acpi_pad ipmi_devintf ipmi_msghandler soundcore lpc_ich nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ast i2c_algo_bit
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe
> ahci libahci crc32c_intel libata mdio i2c_core dca dm_mirror
> dm_region_hash dm_log dm_mod
> [65237.831109] CPU: 18 PID: 13897 Comm: fio Kdump: loaded Tainted: G
> OE 4.18.0-rc1+ #1
> [65237.831110] Hardware name: Supermicro Super Server/X11DPG-QT, BIOS 1.0
> 06/22/2017
> [65237.831112] RIP: 0010:__list_add_valid+0x6a/0x70
> [65237.831112] Code: 31 c0 48 c7 c7 90 08 ab a0 e8 22 b4 cd ff 0f 0b 31 c0
> c3 48 89 d1 48 c7 c7 40 08 ab a0 48 89 f2 48 89 c6 31 c0 e8 06 b4 cd ff
> <0f> 0b 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57
> [65237.831130] RSP: 0018:bb1d0a41fdd8 EFLAGS: 00010286
> [65237.831131] RAX:  RBX: 94d0ebf9 RCX:
> 0006
> [65237.831132] RDX:  RSI: 0086 RDI:
> 94d91ec16970
> [65237.831132] RBP: db1cff59e980 R08:  R09:
> 0663
> [65237.831133] R10: 0004 R11:  R12:
> 0001
> [65237.831134] R13:  R14: 94d0f0c57240 R15:
> 94d0f0f04c40
> [65237.831135] FS:  7fad10d02740() GS:94d91ec0()
> knlGS:
> [65237.831135] CS:  0010 DS:  ES:  CR0: 80050033
> [65237.831136] CR2: 7fad10c2a004 CR3: 00085600e006 CR4:
> 007606e0
> [65237.831137] DR0:  DR1:  DR2:
> 
> [65237.831138] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [65237.831138] PKRU: 5554
> [65237.831138] Call Trace:
> [65237.831145]  blk_mq_flush_plug_list+0x12d/0x260
> [65237.831150]  blk_flush_plug_list+0xe7/0x240
> [65237.831152]  blk_finish_plug+0x27/0x40
> [65237.831156]  __x64_sys_io_submit+0xc9/0x180
> [65237.831162]  do_syscall_64+0x5b/0x180
> [65237.831166]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I guess we need to clean list after list_splice_tail in the 1/1 patch as 
following
@@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, 
struct blk_mq_ctx *ctx,
struct list_head *list)
 
 {
...
+
+   spin_lock(>lock);
+   list_splice_tail(list, >rq_list);
+   INIT_LIST_HEAD(list);  // ---> HERE !
blk_mq_hctx_mark_pending(hctx, ctx);
spin_unlock(>lock);

Thanks
Jianchao