Re: [PATCH 01/10] Initial multipath implementation.

2017-09-12 Thread Anish Jhaveri
On Tue, Sep 12, 2017 at 12:00:44PM -0400, Keith Busch wrote:
> 
> I find this patch series confusing to review. You declare these failover
> functions in patch 1, use them in patch 2, but they're not defined until
> patch 7.
Sorry for late reply. 

Idea was to keep header file changes as separate patch. 
I will move the function declaration to patch 7 and 
rearrange the patch series. If you or anyone else finds 
something which could help in browsing the changes, I
will try to incorporate next patchset.


Re: [PATCH 02/10] RDMA timeout triggers failover.

2017-09-12 Thread Anish Jhaveri

On Tue, Sep 12, 2017 at 08:48:58AM -0700, James Smart wrote:
> I don't know this is a good idea - just because there's a controller reset
> we need to failover a path ? Also putting failover smarts in the transport
> doesn't seem like a great idea. Also, there's more than just an rdma
> transport
> 
> -- james
>
Sorry for late reply. Configuring email client took a while.

Trigger failover from controller reset path can be removed, as
keep_alive timer failure would trigger failover.

What if target controller resets and comes back up within timeframe
that keep-alive timer gets serviced. This will lead to aborting of 
all IO request after multiple retries and retries completed within
same time frame before another keep_alive would get serviced. 
This will not trigger fail over. 



Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread jianchao.wang


On 09/13/2017 11:54 AM, Jens Axboe wrote:
> On 09/12/2017 09:39 PM, jianchao.wang wrote:
>>> Exactly, and especially the readability is the key element here. It's
>>> just not worth it to try and be too clever, especially not for
>>> something like this. When you read the above, you immediately know
>>> what the code does without needing a comment. That's not true for the
>>> other construct.  You both have to read other parts of the function
>>> to figure out what it does, AND read the entire function to ensure it
>>> always does the right thing. Fragile.
>>
>> Thanks for your comments , jens and ming. I'm really appreciative of
>> that.  About the fragility, do you mean the possibility that may
>> release the tag of the next rq which has a driver tag itself (maybe a
>> flush) ?
> 
> I mean that if you do:
> 
> if (!list_empty(list))
>   blk_mq_put_driver_tag(nxt);
> 
> it's fragile code in the sense that changes elsewhere in the function
> are harder to validate and/or can impact the functionality of that
> simple if and tag put.
> 
> The actual release must always be safe, of course.
> 

Got it, thanks a lot.


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Jens Axboe
On 09/12/2017 09:39 PM, jianchao.wang wrote:
>> Exactly, and especially the readability is the key element here. It's
>> just not worth it to try and be too clever, especially not for
>> something like this. When you read the above, you immediately know
>> what the code does without needing a comment. That's not true for the
>> other construct.  You both have to read other parts of the function
>> to figure out what it does, AND read the entire function to ensure it
>> always does the right thing. Fragile.
> 
> Thanks for your comments , jens and ming. I'm really appreciative of
> that.  About the fragility, do you mean the possibility that may
> release the tag of the next rq which has a driver tag itself (maybe a
> flush) ?

I mean that if you do:

if (!list_empty(list))
blk_mq_put_driver_tag(nxt);

it's fragile code in the sense that changes elsewhere in the function
are harder to validate and/or can impact the functionality of that
simple if and tag put.

The actual release must always be safe, of course.

-- 
Jens Axboe



Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Ming Lei
On Tue, Sep 12, 2017 at 08:45:19PM -0600, Jens Axboe wrote:
> On 09/12/2017 08:42 PM, jianchao.wang wrote:
> > 
> > 
> > On 09/13/2017 10:23 AM, Jens Axboe wrote:
> >> On 09/12/2017 07:39 PM, jianchao.wang wrote:
> >>>
> >>>
> >>> On 09/13/2017 09:24 AM, Ming Lei wrote:
>  On Wed, Sep 13, 2017 at 09:01:25AM +0800, jianchao.wang wrote:
> > Hi ming
> >
> > On 09/12/2017 06:23 PM, Ming Lei wrote:
> >>> @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct 
> >>> request_queue *q, struct list_head *list)
> >>>   if (list_empty(list))
> >>>   bd.last = true;
> >>>   else {
> >>> - struct request *nxt;
> >>> -
> >>>   nxt = list_first_entry(list, struct request, 
> >>> queuelist);
> >>>   bd.last = !blk_mq_get_driver_tag(nxt, NULL, 
> >>> false);
> >>>   }
> >>>  
> >>>   ret = q->mq_ops->queue_rq(hctx, );
> >>>   if (ret == BLK_STS_RESOURCE) {
> >>> + /*
> >>> +  * If an I/O scheduler has been configured and 
> >>> we got a
> >>> +  * driver tag for the next request already, 
> >>> free it again.
> >>> +  */
> >>> + if (!list_empty(list)) {
> >>> + nxt = list_first_entry(list, struct 
> >>> request, queuelist);
> >>> + blk_mq_put_driver_tag(nxt);
> >>> + }
> >> The following way might be more simple and clean:
> >>
> >>if (nxt)
> >>blk_mq_put_driver_tag(nxt);
> >>
> >> meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
> >> before .queue_rq().
> >
> > I had ever thought about that, but to avoid add extra command in the 
> > fast path, I made the patch above.
> 
>  Got it, so how about changing to the following way simply:
> 
>   if (nxt && !list_empty(list))
>   blk_mq_put_driver_tag(nxt);
> 
> >>> It seems that we even could change it as following:
> >>> if (!list_empty(list))
> >>>   blk_mq_put_driver_tag(nxt);
> >>
> >> This is starting to get too clever for its own good, I generally don't
> >> like to sacrifice readability for performance. In reality, the compiler
> >> probably figures it out anyway...
> >>
> >> So either make it explicit, or add a nice comment as to why it is the
> >> way that it is.
> >>
> > yes, it indeed leads to compiler warning of "may be used uninitialized"
> > maybe the original one could be taken back.
> > if (!list_empty(list)) {
> > nxt = list_first_entry(list, struct request, 
> > queuelist);
> > blk_mq_put_driver_tag(nxt);
> > }
> > It is more readable and could avoid the warning.
> 
> Exactly, and especially the readability is the key element here. It's
> just not worth it to try and be too clever, especially not for something
> like this. When you read the above, you immediately know what the code
> does without needing a comment. That's not true for the other construct.
> You both have to read other parts of the function to figure out what it
> does, AND read the entire function to ensure it always does the right
> thing. Fragile.

OK, fair enough wrt. readability.

Reviewed-by: Ming Lei 

For the original post.

-- 
Ming


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread jianchao.wang


On 09/13/2017 10:45 AM, Jens Axboe wrote:
 @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct 
 request_queue *q, struct list_head *list)
if (list_empty(list))
bd.last = true;
else {
 -  struct request *nxt;
 -
nxt = list_first_entry(list, struct request, 
 queuelist);
bd.last = !blk_mq_get_driver_tag(nxt, NULL, 
 false);
}
  
ret = q->mq_ops->queue_rq(hctx, );
if (ret == BLK_STS_RESOURCE) {
 +  /*
 +   * If an I/O scheduler has been configured and 
 we got a
 +   * driver tag for the next request already, 
 free it again.
 +   */
 +  if (!list_empty(list)) {
 +  nxt = list_first_entry(list, struct 
 request, queuelist);
 +  blk_mq_put_driver_tag(nxt);
 +  }
>>> The following way might be more simple and clean:
>>>
>>> if (nxt)
>>> blk_mq_put_driver_tag(nxt);
>>>
>>> meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
>>> before .queue_rq().
>> I had ever thought about that, but to avoid add extra command in the 
>> fast path, I made the patch above.
> Got it, so how about changing to the following way simply:
>
>   if (nxt && !list_empty(list))
>   blk_mq_put_driver_tag(nxt);
>
 It seems that we even could change it as following:
 if (!list_empty(list))
blk_mq_put_driver_tag(nxt);
>>> This is starting to get too clever for its own good, I generally don't
>>> like to sacrifice readability for performance. In reality, the compiler
>>> probably figures it out anyway...
>>>
>>> So either make it explicit, or add a nice comment as to why it is the
>>> way that it is.
>>>
>> yes, it indeed leads to compiler warning of "may be used uninitialized"
>> maybe the original one could be taken back.
>>  if (!list_empty(list)) {
>>  nxt = list_first_entry(list, struct request, 
>> queuelist);
>>  blk_mq_put_driver_tag(nxt);
>>  }
>> It is more readable and could avoid the warning.
> Exactly, and especially the readability is the key element here. It's
> just not worth it to try and be too clever, especially not for something
> like this. When you read the above, you immediately know what the code
> does without needing a comment. That's not true for the other construct.
> You both have to read other parts of the function to figure out what it
> does, AND read the entire function to ensure it always does the right
> thing. Fragile.

Thanks for your comments , jens and ming. I'm really appreciative of that.
About the fragility, do you mean the possibility that may release the tag of 
the next rq
which has a driver tag itself (maybe a flush) ?

Thanks
jianchao


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Jens Axboe
On 09/12/2017 08:42 PM, jianchao.wang wrote:
> 
> 
> On 09/13/2017 10:23 AM, Jens Axboe wrote:
>> On 09/12/2017 07:39 PM, jianchao.wang wrote:
>>>
>>>
>>> On 09/13/2017 09:24 AM, Ming Lei wrote:
 On Wed, Sep 13, 2017 at 09:01:25AM +0800, jianchao.wang wrote:
> Hi ming
>
> On 09/12/2017 06:23 PM, Ming Lei wrote:
>>> @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct 
>>> request_queue *q, struct list_head *list)
>>> if (list_empty(list))
>>> bd.last = true;
>>> else {
>>> -   struct request *nxt;
>>> -
>>> nxt = list_first_entry(list, struct request, 
>>> queuelist);
>>> bd.last = !blk_mq_get_driver_tag(nxt, NULL, 
>>> false);
>>> }
>>>  
>>> ret = q->mq_ops->queue_rq(hctx, );
>>> if (ret == BLK_STS_RESOURCE) {
>>> +   /*
>>> +* If an I/O scheduler has been configured and 
>>> we got a
>>> +* driver tag for the next request already, 
>>> free it again.
>>> +*/
>>> +   if (!list_empty(list)) {
>>> +   nxt = list_first_entry(list, struct 
>>> request, queuelist);
>>> +   blk_mq_put_driver_tag(nxt);
>>> +   }
>> The following way might be more simple and clean:
>>
>>  if (nxt)
>>  blk_mq_put_driver_tag(nxt);
>>
>> meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
>> before .queue_rq().
>
> I had ever thought about that, but to avoid add extra command in the 
> fast path, I made the patch above.

 Got it, so how about changing to the following way simply:

if (nxt && !list_empty(list))
blk_mq_put_driver_tag(nxt);

>>> It seems that we even could change it as following:
>>> if (!list_empty(list))
>>> blk_mq_put_driver_tag(nxt);
>>
>> This is starting to get too clever for its own good, I generally don't
>> like to sacrifice readability for performance. In reality, the compiler
>> probably figures it out anyway...
>>
>> So either make it explicit, or add a nice comment as to why it is the
>> way that it is.
>>
> yes, it indeed leads to compiler warning of "may be used uninitialized"
> maybe the original one could be taken back.
>   if (!list_empty(list)) {
>   nxt = list_first_entry(list, struct request, 
> queuelist);
>   blk_mq_put_driver_tag(nxt);
>   }
> It is more readable and could avoid the warning.

Exactly, and especially the readability is the key element here. It's
just not worth it to try and be too clever, especially not for something
like this. When you read the above, you immediately know what the code
does without needing a comment. That's not true for the other construct.
You both have to read other parts of the function to figure out what it
does, AND read the entire function to ensure it always does the right
thing. Fragile.

-- 
Jens Axboe



Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread jianchao.wang


On 09/13/2017 10:23 AM, Jens Axboe wrote:
> On 09/12/2017 07:39 PM, jianchao.wang wrote:
>>
>>
>> On 09/13/2017 09:24 AM, Ming Lei wrote:
>>> On Wed, Sep 13, 2017 at 09:01:25AM +0800, jianchao.wang wrote:
 Hi ming

 On 09/12/2017 06:23 PM, Ming Lei wrote:
>> @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct 
>> request_queue *q, struct list_head *list)
>>  if (list_empty(list))
>>  bd.last = true;
>>  else {
>> -struct request *nxt;
>> -
>>  nxt = list_first_entry(list, struct request, 
>> queuelist);
>>  bd.last = !blk_mq_get_driver_tag(nxt, NULL, 
>> false);
>>  }
>>  
>>  ret = q->mq_ops->queue_rq(hctx, );
>>  if (ret == BLK_STS_RESOURCE) {
>> +/*
>> + * If an I/O scheduler has been configured and 
>> we got a
>> + * driver tag for the next request already, 
>> free it again.
>> + */
>> +if (!list_empty(list)) {
>> +nxt = list_first_entry(list, struct 
>> request, queuelist);
>> +blk_mq_put_driver_tag(nxt);
>> +}
> The following way might be more simple and clean:
>
>   if (nxt)
>   blk_mq_put_driver_tag(nxt);
>
> meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
> before .queue_rq().

 I had ever thought about that, but to avoid add extra command in the 
 fast path, I made the patch above.
>>>
>>> Got it, so how about changing to the following way simply:
>>>
>>> if (nxt && !list_empty(list))
>>> blk_mq_put_driver_tag(nxt);
>>>
>> It seems that we even could change it as following:
>> if (!list_empty(list))
>>  blk_mq_put_driver_tag(nxt);
> 
> This is starting to get too clever for its own good, I generally don't
> like to sacrifice readability for performance. In reality, the compiler
> probably figures it out anyway...
> 
> So either make it explicit, or add a nice comment as to why it is the
> way that it is.
> 
yes, it indeed leads to compiler warning of "may be used uninitialized"
maybe the original one could be taken back.
if (!list_empty(list)) {
nxt = list_first_entry(list, struct request, 
queuelist);
blk_mq_put_driver_tag(nxt);
}
It is more readable and could avoid the warning.

Thanks
jianchao


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Jens Axboe
On 09/12/2017 07:39 PM, jianchao.wang wrote:
> 
> 
> On 09/13/2017 09:24 AM, Ming Lei wrote:
>> On Wed, Sep 13, 2017 at 09:01:25AM +0800, jianchao.wang wrote:
>>> Hi ming
>>>
>>> On 09/12/2017 06:23 PM, Ming Lei wrote:
> @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
> *q, struct list_head *list)
>   if (list_empty(list))
>   bd.last = true;
>   else {
> - struct request *nxt;
> -
>   nxt = list_first_entry(list, struct request, queuelist);
>   bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
>   }
>  
>   ret = q->mq_ops->queue_rq(hctx, );
>   if (ret == BLK_STS_RESOURCE) {
> + /*
> +  * If an I/O scheduler has been configured and we got a
> +  * driver tag for the next request already, free it 
> again.
> +  */
> + if (!list_empty(list)) {
> + nxt = list_first_entry(list, struct request, 
> queuelist);
> + blk_mq_put_driver_tag(nxt);
> + }
 The following way might be more simple and clean:

if (nxt)
blk_mq_put_driver_tag(nxt);

 meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
 before .queue_rq().
>>>
>>> I had ever thought about that, but to avoid add extra command in the 
>>> fast path, I made the patch above.
>>
>> Got it, so how about changing to the following way simply:
>>
>>  if (nxt && !list_empty(list))
>>  blk_mq_put_driver_tag(nxt);
>>
> It seems that we even could change it as following:
> if (!list_empty(list))
>   blk_mq_put_driver_tag(nxt);

This is starting to get too clever for its own good, I generally don't
like to sacrifice readability for performance. In reality, the compiler
probably figures it out anyway...

So either make it explicit, or add a nice comment as to why it is the
way that it is.

-- 
Jens Axboe



Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Ming Lei
On Wed, Sep 13, 2017 at 09:39:44AM +0800, jianchao.wang wrote:
> 
> 
> On 09/13/2017 09:24 AM, Ming Lei wrote:
> > On Wed, Sep 13, 2017 at 09:01:25AM +0800, jianchao.wang wrote:
> >> Hi ming
> >>
> >> On 09/12/2017 06:23 PM, Ming Lei wrote:
>  @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct 
>  request_queue *q, struct list_head *list)
>   if (list_empty(list))
>   bd.last = true;
>   else {
>  -struct request *nxt;
>  -
>   nxt = list_first_entry(list, struct request, 
>  queuelist);
>   bd.last = !blk_mq_get_driver_tag(nxt, NULL, 
>  false);
>   }
>   
>   ret = q->mq_ops->queue_rq(hctx, );
>   if (ret == BLK_STS_RESOURCE) {
>  +/*
>  + * If an I/O scheduler has been configured and 
>  we got a
>  + * driver tag for the next request already, 
>  free it again.
>  + */
>  +if (!list_empty(list)) {
>  +nxt = list_first_entry(list, struct 
>  request, queuelist);
>  +blk_mq_put_driver_tag(nxt);
>  +}
> >>> The following way might be more simple and clean:
> >>>
> >>>   if (nxt)
> >>>   blk_mq_put_driver_tag(nxt);
> >>>
> >>> meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
> >>> before .queue_rq().
> >>
> >> I had ever thought about that, but to avoid add extra command in the 
> >> fast path, I made the patch above.
> > 
> > Got it, so how about changing to the following way simply:
> > 
> > if (nxt && !list_empty(list))
> > blk_mq_put_driver_tag(nxt);
> > 
> It seems that we even could change it as following:
> if (!list_empty(list))
>   blk_mq_put_driver_tag(nxt);

Yeah, that is definitely more clean, :-)

Feel free to add "Reviewed-by: Ming Lei "
in your next post.

-- 
Ming


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread jianchao.wang


On 09/13/2017 09:24 AM, Ming Lei wrote:
> On Wed, Sep 13, 2017 at 09:01:25AM +0800, jianchao.wang wrote:
>> Hi ming
>>
>> On 09/12/2017 06:23 PM, Ming Lei wrote:
 @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
 *q, struct list_head *list)
if (list_empty(list))
bd.last = true;
else {
 -  struct request *nxt;
 -
nxt = list_first_entry(list, struct request, queuelist);
bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
}
  
ret = q->mq_ops->queue_rq(hctx, );
if (ret == BLK_STS_RESOURCE) {
 +  /*
 +   * If an I/O scheduler has been configured and we got a
 +   * driver tag for the next request already, free it 
 again.
 +   */
 +  if (!list_empty(list)) {
 +  nxt = list_first_entry(list, struct request, 
 queuelist);
 +  blk_mq_put_driver_tag(nxt);
 +  }
>>> The following way might be more simple and clean:
>>>
>>> if (nxt)
>>> blk_mq_put_driver_tag(nxt);
>>>
>>> meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
>>> before .queue_rq().
>>
>> I had ever thought about that, but to avoid add extra command in the 
>> fast path, I made the patch above.
> 
> Got it, so how about changing to the following way simply:
> 
>   if (nxt && !list_empty(list))
>   blk_mq_put_driver_tag(nxt);
> 
It seems that we even could change it as following:
if (!list_empty(list))
blk_mq_put_driver_tag(nxt);


thanks
jianchao


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Ming Lei
On Wed, Sep 13, 2017 at 09:01:25AM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 09/12/2017 06:23 PM, Ming Lei wrote:
> >> @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
> >> *q, struct list_head *list)
> >>if (list_empty(list))
> >>bd.last = true;
> >>else {
> >> -  struct request *nxt;
> >> -
> >>nxt = list_first_entry(list, struct request, queuelist);
> >>bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
> >>}
> >>  
> >>ret = q->mq_ops->queue_rq(hctx, );
> >>if (ret == BLK_STS_RESOURCE) {
> >> +  /*
> >> +   * If an I/O scheduler has been configured and we got a
> >> +   * driver tag for the next request already, free it 
> >> again.
> >> +   */
> >> +  if (!list_empty(list)) {
> >> +  nxt = list_first_entry(list, struct request, 
> >> queuelist);
> >> +  blk_mq_put_driver_tag(nxt);
> >> +  }
> > The following way might be more simple and clean:
> > 
> > if (nxt)
> > blk_mq_put_driver_tag(nxt);
> > 
> > meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
> > before .queue_rq().
> 
> I had ever thought about that, but to avoid add extra command in the 
> fast path, I made the patch above.

Got it, so how about changing to the following way simply:

if (nxt && !list_empty(list))
blk_mq_put_driver_tag(nxt);

-- 
Ming


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread jianchao.wang
Hi ming

On 09/12/2017 06:23 PM, Ming Lei wrote:
>> @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
>> *q, struct list_head *list)
>>  if (list_empty(list))
>>  bd.last = true;
>>  else {
>> -struct request *nxt;
>> -
>>  nxt = list_first_entry(list, struct request, queuelist);
>>  bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
>>  }
>>  
>>  ret = q->mq_ops->queue_rq(hctx, );
>>  if (ret == BLK_STS_RESOURCE) {
>> +/*
>> + * If an I/O scheduler has been configured and we got a
>> + * driver tag for the next request already, free it 
>> again.
>> + */
>> +if (!list_empty(list)) {
>> +nxt = list_first_entry(list, struct request, 
>> queuelist);
>> +blk_mq_put_driver_tag(nxt);
>> +}
> The following way might be more simple and clean:
> 
>   if (nxt)
>   blk_mq_put_driver_tag(nxt);
> 
> meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
> before .queue_rq().

I had ever thought about that, but to avoid add extra command in the 
fast path, I made the patch above.

Thanks
Jianchao


Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq

2017-09-12 Thread Damien Le Moal
Ming,

On 9/12/17 18:26, Ming Lei wrote:
>>> OK, I suggest to document this guarantee of no write reorder for ZBC
>>> somewhere, so that people will keep it in mind when trying to change
>>> the current code.
>>
>> Have you looked at the comments in sd_zbc.c ? That is explained there.
>> Granted, this is a little deep in the stack, but this is after all
>> dependent on the implementation of scsi_request_fn(). I can add comments
>> there too if you prefer.
> 
> Yeah, I looked at that, but seems it is too coarse.

OK. Will improve that in V3.

>> Zoned disks are recent and all of them support FUA. This means that a
>> write with FUA will be processed like any other write request (if I read
>> the code in blk-flush.c correctly). Well, at least for the mq case,
>> which does a blk_mq_sched_insert_request(), while the direct call to
> 
> blk_mq_sched_bypass_insert() can be called for flush requests too,
> since requests in flush sequence share one driver tag(rq->tag != -1),
> then the rq can be added to front of hctx->dispatch directly.

For pure flush commands, any order is OK, since these are not actual
writes. the ZBC/ZAC standards do not modify the behavior of
flush/synchronize cache commands.

But I will look into more detail at the PREFLUSH/POSTFLUSH mechanic and
relation to write request ordering to make sure no corner cases are left
ignored.

Thanks for the comments.

Best regards.

-- 
Damien Le Moal,
Western Digital


Re: [PATCH V4 0/10] block/scsi: safe SCSI quiescing

2017-09-12 Thread Cathy Avery

On 09/11/2017 07:10 AM, Ming Lei wrote:

Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be comming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated and wait forever,
meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
then system hangs forever, such as during system suspend or
sending SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt freeze, and solves the issue
by preempt freezing block queue during SCSI quiesce, and allows
to allocate request of RQF_PREEMPT when queue is in this state.

Oleksandr verified that V3 does fix the hang during suspend/resume,
and Cathy verified that revised V3 fixes hang in sending
SCSI domain validation.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing/unifying blk_freeze_queue_preempt() and
blk_unfreeze_queue_preempt(), and cleanup is done together.

The patchset can be found in the following gitweb:

https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V4

V4:
- reorganize patch order to make it more reasonable
- support nested preempt freeze, as required by SCSI transport spi
- check preempt freezing in slow path of of blk_queue_enter()
- add "SCSI: transport_spi: resume a quiesced device"
- wake up freeze queue in setting dying for both blk-mq and legacy
- rename blk_mq_[freeze|unfreeze]_queue() in one patch
- rename .mq_freeze_wq and .mq_freeze_depth
- improve comment

V3:
- introduce q->preempt_unfreezing to fix one bug of preempt freeze
- call blk_queue_enter_live() only when queue is preempt frozen
- cleanup a bit on the implementation of preempt freeze
- only patch 6 and 7 are changed

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun
- introduce preempt version of blk_[freeze|unfreeze]_queue
- sync between preempt freeze and normal freeze
- fix warning from percpu-refcount as reported by Oleksandr


[1]https://marc.info/?t=150340250100013=3=2


Thanks,
Ming


Ming Lei (10):
   blk-mq: only run hw queues for blk-mq
   block: tracking request allocation with q_usage_counter
   blk-mq: rename blk_mq_[freeze|unfreeze]_queue
   blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
   block: rename .mq_freeze_wq and .mq_freeze_depth
   block: pass flags to blk_queue_enter()
   block: introduce preempt version of blk_[freeze|unfreeze]_queue
   block: allow to allocate req with RQF_PREEMPT when queue is preempt
 frozen
   SCSI: transport_spi: resume a quiesced device
   SCSI: preempt freeze block queue when SCSI device is put into quiesce

  block/bfq-iosched.c   |   2 +-
  block/blk-cgroup.c|   8 +-
  block/blk-core.c  |  95 
  block/blk-mq.c| 180 --
  block/blk-mq.h|   1 -
  block/blk-timeout.c   |   2 +-
  block/blk.h   |  12 +++
  block/elevator.c  |   4 +-
  drivers/block/loop.c  |  24 ++---
  drivers/block/rbd.c   |   2 +-
  drivers/nvme/host/core.c  |   8 +-
  drivers/scsi/scsi_lib.c   |  25 +-
  drivers/scsi/scsi_transport_spi.c |   3 +
  fs/block_dev.c|   4 +-
  include/linux/blk-mq.h|  15 ++--
  include/linux/blkdev.h|  32 +--
  16 files changed, 313 insertions(+), 104 deletions(-)



I've tested this patch set for spi_transport issuing a domain validation 
under low blk_request conditions.


Tested-by: Cathy Avery 


Re: [PATCH 4/5] block: Make SCSI device suspend work reliably

2017-09-12 Thread Ming Lei
On Tue, Sep 12, 2017 at 03:45:50PM +, Bart Van Assche wrote:
> On Tue, 2017-09-12 at 10:29 +0800, Ming Lei wrote:
> > On Fri, Sep 08, 2017 at 04:52:25PM -0700, Bart Van Assche wrote:
> > > @@ -1350,6 +1368,16 @@ static struct request *get_request(struct 
> > > request_queue *q, unsigned int op,
> > >   lockdep_assert_held(q->queue_lock);
> > >   WARN_ON_ONCE(q->mq_ops);
> > >  
> > > + WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
> > > +
> > > + /*
> > > +  * Wait if the request queue is suspended or in the process of
> > > +  * suspending/resuming and the request being allocated will not be
> > > +  * used for power management purposes.
> > > +  */
> > > + if (!(op & REQ_PM) && !blk_wait_until_active(q, !(op & REQ_NOWAIT)))
> > > + return ERR_PTR(-EAGAIN);
> > > +
> > 
> > Firstly it is not enough to prevent new allocation only, because
> > requests pool may have been used up and all the allocated requests
> > just stay in block queue when SCSI device is put into quiesce.
> > So you may cause new I/O hang and wait forever here. That is why
> > I take freeze to do that because freezing queue can prevent new
> > allocation and drain in-queue requests both.
> 
> Sorry but I disagree. If all tags are allocated and it is attempted to
> suspend a queue then this patch not only will prevent allocation of new
> non-PM requests but it will also let these previously submitted non-PM
> requests finish. See also the blk_peek_request() changes in patch 4/5.
> Once a previously submitted request finished allocation of the PM request
> will succeed.

You just simply removed blk_pm_peek_request() from blk_peek_request(),
how does that work on blk-mq?

Also that may not work because SCSI quiesce may just happen between
request allocation and run queue, then nothing can be dispatched
to driver.

> 
> > Secondly, all RQF_PREEMPT(not PM) is blocked here too, that may
> > cause regression since we usually need/allow RQF_PREEMPT to run
> > during SCSI quiesce.
> 
> Sorry but I disagree with this comment too. Please keep in mind that my patch
> only affects the SCSI quiesce status if that status has been requested by the
> power management subsystem. After the power management subsystem has
> transitioned a SCSI queue into the quiesce state that queue has reached the
> RPM_SUSPENDED status. No new requests must be submitted against a suspended

No, RPM_SUSPEND only means runtime suspend, you misunderstand it as
system suspend.

Actually the reported issue is during system suspend/resume, which can
happen without runtime PM, such as, runtime PM is disabled via sysfs.

> queue. It is easy to see in the legacy block layer that only PM requests and
> no other RQF_PREEMPT requests are processed once the queue has reached the
> suspended status:
> 
> /*
>  * Don't process normal requests when queue is suspended
>  * or in the process of suspending/resuming
>  */
> static struct request *blk_pm_peek_request(struct request_queue *q,
>  struct request *rq)
> {
>   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
>   (q->rpm_status != RPM_ACTIVE && !(rq->cmd_flags & REQ_PM

As I explained, rpm_status represents runtime PM status, nothing to
do with system suspend/resume.

So RRF_PREEMPT can be dispatched to driver during system suspend.

-- 
Ming


Re: [PATCH 10/10] Reserved tag for active ns command

2017-09-12 Thread James Smart
Please commonize the # of reserved tags.  There's more than 1 transport 
and they all do the same things.


-- james


On 9/11/2017 9:33 PM, Anish M Jhaveri wrote:

Using reserved tag for setting standby namespace to active using nvme command.

Signed-off-by: Anish M Jhaveri 
---
  drivers/nvme/host/rdma.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index cb6a5f8..d094793 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1550,7 +1550,7 @@ static int nvme_rdma_configure_admin_queue(struct 
nvme_rdma_ctrl *ctrl)
memset(>admin_tag_set, 0, sizeof(ctrl->admin_tag_set));
ctrl->admin_tag_set.ops = _rdma_admin_mq_ops;
ctrl->admin_tag_set.queue_depth = NVME_RDMA_AQ_BLKMQ_DEPTH;
-   ctrl->admin_tag_set.reserved_tags = 2; /* connect + keep-alive */
+   ctrl->admin_tag_set.reserved_tags = 3; /* connect + keep-alive */
ctrl->admin_tag_set.numa_node = NUMA_NO_NODE;
ctrl->admin_tag_set.cmd_size = sizeof(struct nvme_rdma_request) +
SG_CHUNK_SIZE * sizeof(struct scatterlist);




Re: [PATCH 02/10] RDMA timeout triggers failover.

2017-09-12 Thread James Smart
I don't know this is a good idea - just because there's a controller 
reset we need to failover a path ? Also putting failover smarts in the 
transport doesn't seem like a great idea. Also, there's more than just 
an rdma transport


-- james



On 9/11/2017 9:22 PM, Anish M Jhaveri wrote:

Trigger failover functionality will be called on any RDMA timeout. This timeout 
can occur due failure for an IO to be returned from Target. This could be 
caused due to interface going down while leads to failover functionality being 
triggered.

Signed-off-by: Anish M Jhaveri 
---
  drivers/nvme/host/rdma.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index a03299d..cb6a5f8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -174,6 +174,7 @@ static void nvme_rdma_free_qe(struct ib_device *ibdev, 
struct nvme_rdma_qe *qe,
  {
ib_dma_unmap_single(ibdev, qe->dma, capsule_size, dir);
kfree(qe->data);
+   qe->data = NULL;
  }
  
  static int nvme_rdma_alloc_qe(struct ib_device *ibdev, struct nvme_rdma_qe *qe,

@@ -766,6 +767,8 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)
  
  	nvme_stop_ctrl(>ctrl);
  
+	nvme_trigger_failover(>ctrl);

+
for (i = 0; i < ctrl->ctrl.queue_count; i++)
clear_bit(NVME_RDMA_Q_LIVE, >queues[i].flags);
  




Re: [PATCH 4/5] block: Make SCSI device suspend work reliably

2017-09-12 Thread Bart Van Assche
On Tue, 2017-09-12 at 10:29 +0800, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 04:52:25PM -0700, Bart Van Assche wrote:
> > @@ -1350,6 +1368,16 @@ static struct request *get_request(struct 
> > request_queue *q, unsigned int op,
> > lockdep_assert_held(q->queue_lock);
> > WARN_ON_ONCE(q->mq_ops);
> >  
> > +   WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
> > +
> > +   /*
> > +* Wait if the request queue is suspended or in the process of
> > +* suspending/resuming and the request being allocated will not be
> > +* used for power management purposes.
> > +*/
> > +   if (!(op & REQ_PM) && !blk_wait_until_active(q, !(op & REQ_NOWAIT)))
> > +   return ERR_PTR(-EAGAIN);
> > +
> 
> Firstly it is not enough to prevent new allocation only, because
> requests pool may have been used up and all the allocated requests
> just stay in block queue when SCSI device is put into quiesce.
> So you may cause new I/O hang and wait forever here. That is why
> I take freeze to do that because freezing queue can prevent new
> allocation and drain in-queue requests both.

Sorry but I disagree. If all tags are allocated and it is attempted to
suspend a queue then this patch not only will prevent allocation of new
non-PM requests but it will also let these previously submitted non-PM
requests finish. See also the blk_peek_request() changes in patch 4/5.
Once a previously submitted request finished allocation of the PM request
will succeed.

> Secondly, all RQF_PREEMPT(not PM) is blocked here too, that may
> cause regression since we usually need/allow RQF_PREEMPT to run
> during SCSI quiesce.

Sorry but I disagree with this comment too. Please keep in mind that my patch
only affects the SCSI quiesce status if that status has been requested by the
power management subsystem. After the power management subsystem has
transitioned a SCSI queue into the quiesce state that queue has reached the
RPM_SUSPENDED status. No new requests must be submitted against a suspended
queue. It is easy to see in the legacy block layer that only PM requests and
no other RQF_PREEMPT requests are processed once the queue has reached the
suspended status:

/*
 * Don't process normal requests when queue is suspended
 * or in the process of suspending/resuming
 */
static struct request *blk_pm_peek_request(struct request_queue *q,
   struct request *rq)
{
if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
(q->rpm_status != RPM_ACTIVE && !(rq->cmd_flags & REQ_PM
return NULL;
else
return rq;
}

Bart.

Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-12 Thread weiping zhang
On Wed, Sep 06, 2017 at 01:00:44PM +, Bart Van Assche wrote:
> On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote:
> > On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote:
> > > On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote:
> > > > if blk-mq use "none" io scheduler, nr_request get a wrong value when
> > > > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> > > > the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> > > > wrong value.
> > > > 
> > > > Reproduce:
> > > > 
> > > > echo none > /sys/block/nvme0n1/queue/ioscheduler
> > > > echo 100 > /sys/block/nvme0n1/queue/nr_requests
> > > > cat /sys/block/nvme0n1/queue/nr_requests
> > > > 100
> > > > 
> > > > Signed-off-by: weiping zhang 
> > > > ---
> > > >  block/blk-mq.c | 7 +--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index f84d145..8303e5e 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct 
> > > > request_queue *q, unsigned int nr)
> > > >  * queue depth. This is similar to what the old code 
> > > > would do.
> > > >  */
> > > > if (!hctx->sched_tags) {
> > > > -   ret = blk_mq_tag_update_depth(hctx, >tags,
> > > > -   min(nr, 
> > > > set->queue_depth),
> > > > +   if (nr > set->queue_depth) {
> > > > +   nr = set->queue_depth;
> > > > +   pr_warn("reduce nr_request to %u\n", 
> > > > nr);
> > > > +   }
> > > > +   ret = blk_mq_tag_update_depth(hctx, 
> > > > >tags, nr,
> > > > false);
> > > > } else {
> > > > ret = blk_mq_tag_update_depth(hctx, 
> > > > >sched_tags,
> > > 
> > > Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? That 
> > > will help to
> > > keep user space code simple that updates the queue depth.
> > 
> > Hi Bart,
> > 
> > The reason why not return -EINVAL is keeping alin with minimum checking in 
> > queue_requests_store,
> > if you insist return -EINVAL/-ERANGE, minimum checking should also keep
> > same behavior. Both return error meesage and quietly changing are okey
> > for me. Which way do you prefer ?
> > 
> > static ssize_t
> > queue_requests_store(struct request_queue *q, const char *page, size_t 
> > count)
> > {
> > unsigned long nr;
> > int ret, err;
> > 
> > if (!q->request_fn && !q->mq_ops)
> > return -EINVAL;
> > 
> > ret = queue_var_store(, page, count);
> > if (ret < 0)
> > return ret;
> > 
> > if (nr < BLKDEV_MIN_RQ)
> > nr = BLKDEV_MIN_RQ;
> 
> Hello Jens,
> 
> Do you perhaps have a preference for one of the approaches that have been 
> discussed
> in this e-mail thread?
> 
> Thanks,
> 
> Bart.

Hello Jens,

Would you please give some comments about this patch,

Thanks
Weiping.


Re: [PATCH V2] lightnvm: prevent bd removal if busy

2017-09-12 Thread Javier González
> On 10 Sep 2017, at 21.07, Rakesh Pandit  wrote:
> 
> When a virtual block device is formatted and mounted after creating
> with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove"
> would result in this:
> 
> 446416.309757] bdi-block not registered
> [446416.309773] [ cut here ]
> [446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159 
> __mark_inode_dirty+0x268/0x340
> 
> Ideally removal should return -EBUSY as block device is mounted after
> formatting.  This patch tries to address this checking if whole device
> or any partition of it already mounted or not before removal.
> 
> Whole device is checked using "bd_super" member of block device.  This
> member is always set once block device has been mounted using a
> filesystem.  Another member "bd_part_count" takes care of checking any
> if any partitions are under use.  "bd_part_count" is only updated
> under locks when partitions are opened or closed (first open and last
> release).  This at least does take care sending -EBUSY if removal is
> being attempted while whole block device or any partition is mounted.
> 
> Signed-off-by: Rakesh Pandit 
> ---
> 
> V2: Take a different approach. Instead of checking bd_openers use
> bd_super and bd_part_count.  This should address the removal of bdevs
> which are mounted from removal.
> 
> drivers/lightnvm/core.c | 14 ++
> 1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
> index c39f87d..9f9a137 100644
> --- a/drivers/lightnvm/core.c
> +++ b/drivers/lightnvm/core.c
> @@ -373,6 +373,7 @@ static void __nvm_remove_target(struct nvm_target *t)
> static int nvm_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove 
> *remove)
> {
>   struct nvm_target *t;
> + struct block_device *bdev;
> 
>   mutex_lock(>mlock);
>   t = nvm_find_target(dev, remove->tgtname);
> @@ -380,6 +381,19 @@ static int nvm_remove_tgt(struct nvm_dev *dev, struct 
> nvm_ioctl_remove *remove)
>   mutex_unlock(>mlock);
>   return 1;
>   }
> + bdev = bdget_disk(t->disk, 0);
> + if (!bdev) {
> + pr_err("nvm: removal failed, allocating bd failed\n");
> + mutex_unlock(>mlock);
> + return -ENOMEM;
> + }
> + if (bdev->bd_super || bdev->bd_part_count) {
> + pr_err("nvm: removal failed, block device busy\n");
> + bdput(bdev);
> + mutex_unlock(>mlock);
> + return -EBUSY;
> + }
> + bdput(bdev);
>   __nvm_remove_target(t);
>   mutex_unlock(>mlock);
> 
> --
> 2.7.4

Looks good.

Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP


Re: [PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Ming Lei
On Wed, Sep 13, 2017 at 01:14:35AM +0800, Jianchao Wang wrote:
> When free the driver tag of the next rq with I/O scheduler
> configured, it get the first entry of the list, however, at the
> moment, the failed rq has been requeued at the head of the list.
> The rq it gets is the failed rq not the next rq.
> Free the driver tag of next rq before the failed one is requeued
> in the failure branch of queue_rq callback and it is just needed
> there.

Looks a good catch.

> 
> Signed-off-by: Jianchao Wang 
> ---
>  block/blk-mq.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4603b11..19f848e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -983,7 +983,7 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx 
> *hctx)
>  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
>  {
>   struct blk_mq_hw_ctx *hctx;
> - struct request *rq;
> + struct request *rq, *nxt;
>   int errors, queued;
>  
>   if (list_empty(list))
> @@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list)
>   if (list_empty(list))
>   bd.last = true;
>   else {
> - struct request *nxt;
> -
>   nxt = list_first_entry(list, struct request, queuelist);
>   bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
>   }
>  
>   ret = q->mq_ops->queue_rq(hctx, );
>   if (ret == BLK_STS_RESOURCE) {
> + /*
> +  * If an I/O scheduler has been configured and we got a
> +  * driver tag for the next request already, free it 
> again.
> +  */
> + if (!list_empty(list)) {
> + nxt = list_first_entry(list, struct request, 
> queuelist);
> + blk_mq_put_driver_tag(nxt);
> + }

The following way might be more simple and clean:

if (nxt)
blk_mq_put_driver_tag(nxt);

meantime 'nxt' need to be cleared inside the 'if (list_empty(list))'
before .queue_rq().

-- 
Ming


Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq

2017-09-12 Thread Ming Lei
On Tue, Sep 12, 2017 at 05:24:02PM +0900, Damien Le Moal wrote:
> Ming,
> 
> On 9/10/17 14:10, Ming Lei wrote:
> > On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
> >> Ming,
> >>
> >> On 9/8/17 05:43, Ming Lei wrote:
> >>> Hi Damien,
> >>>
> >>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
>  In the case of a ZBC disk used with scsi-mq, zone write locking does
>  not prevent write reordering in sequential zones. Unlike the legacy
>  case, zone locking can only be done after the command request is
>  removed from the scheduler dispatch queue. That is, at the time of
>  zone locking, the write command may already be out of order.
> >>>
> >>> Per my understanding, for legacy case, it can be quite tricky to let
> >>> the existed I/O scheduler guarantee the write order for ZBC disk.
> >>> I guess requeue still might cause write reorder even in legacy path,
> >>> since requeue can happen in both scsi_request_fn() and 
> >>> scsi_io_completion()
> >>> with q->queue_lock released, meantime new rq belonging to the same
> >>> zone can come and be inserted to queue.
> >>
> >> Yes, the write ordering will always depend on the scheduler doing the
> >> right thing. But both cfq, deadline and even noop do the right thing
> >> there, even considering the aging case. The next write for a zone will
> >> always be the oldest in the queue for that zone, if it is not, it means
> >> that the application did not write sequentially. Extensive testing in
> >> the legacy case never showed a problem due to the scheduler itself.
> > 
> > OK, I suggest to document this guarantee of no write reorder for ZBC
> > somewhere, so that people will keep it in mind when trying to change
> > the current code.
> 
> Have you looked at the comments in sd_zbc.c ? That is explained there.
> Granted, this is a little deep in the stack, but this is after all
> dependent on the implementation of scsi_request_fn(). I can add comments
> there too if you prefer.

Yeah, I looked at that, but seems it is too coarse.

> 
> >> scsi_requeue_command() does the unprep (zone unlock) and requeue while
> >> holding the queue lock. So this is atomic with new write command
> >> insertion. Requeued commands are added to the dispatch queue head, and
> >> since a zone will only have a single write in-flight, there is no
> >> reordering possible. The next write command for a zone to go again is
> >> the last requeued one or the next in lba order. It works.
> > 
> > One special case is write with FLUSH/FUA, which may be added to
> > front of q->queue_head directly. Suppose one write with FUA is
> > just comes between requeue and run queue, write reorder may be
> > triggered.
> 
> Zoned disks are recent and all of them support FUA. This means that a
> write with FUA will be processed like any other write request (if I read
> the code in blk-flush.c correctly). Well, at least for the mq case,
> which does a blk_mq_sched_insert_request(), while the direct call to

blk_mq_sched_bypass_insert() can be called for flush requests too,
since requests in flush sequence share one driver tag(rq->tag != -1),
then the rq can be added to front of hctx->dispatch directly.


-- 
Ming


Re: [PATCH 5/5] blk-mq: Implement power management support

2017-09-12 Thread Ming Lei
On Fri, Sep 08, 2017 at 04:52:26PM -0700, Bart Van Assche wrote:
> Implement the following approach for blk-mq:
> - Either make blk_get_request() wait or make it fail when a
>   request queue is not in status RPM_ACTIVE.
> - While suspending, suspended or resuming, only process power
>   management requests (REQ_PM).
> 
> Reported-by: Oleksandr Natalenko 
> References: "I/O hangs after resuming from suspend-to-ram" 
> (https://marc.info/?l=linux-block=150340235201348).

This patch is nothing to do with Oleksandr's report, please
remove the above two lines. For example, runttime PM can
be bypassed via sysfs, and suspend/resume still can work
well.

> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> Cc: Rafael J. Wysocki 
> Cc: Ming Lei 
> ---
>  block/blk-core.c | 20 
>  block/blk-mq.c   | 34 ++
>  2 files changed, 50 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index cd2700c763ed..49a4cd5b255e 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3438,10 +3438,6 @@ EXPORT_SYMBOL(blk_finish_plug);
>   */
>  void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
>  {
> - /* not support for RQF_PM and ->rpm_status in blk-mq yet */
> - if (q->mq_ops)
> - return;
> -
>   q->dev = dev;
>   q->rpm_status = RPM_ACTIVE;
>   init_waitqueue_head(>rpm_active_wq);
> @@ -3478,6 +3474,19 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>   if (!q->dev)
>   return ret;
>  
> + if (q->mq_ops) {
> + percpu_ref_switch_to_atomic_nowait(>q_usage_counter);
> + if (!percpu_ref_is_zero(>q_usage_counter)) {
> + ret = -EBUSY;
> + pm_runtime_mark_last_busy(q->dev);
> + } else {
> + spin_lock_irq(q->queue_lock);
> + q->rpm_status = RPM_SUSPENDING;
> + spin_unlock_irq(q->queue_lock);
> + }
> + return ret;
> + }
> +
>   spin_lock_irq(q->queue_lock);
>   if (q->nr_pending) {
>   ret = -EBUSY;
> @@ -3561,6 +3570,9 @@ void blk_post_runtime_resume(struct request_queue *q, 
> int err)
>   if (!q->dev)
>   return;
>  
> + if (q->mq_ops)
> + percpu_ref_switch_to_percpu(>q_usage_counter);
> +
>   spin_lock_irq(q->queue_lock);
>   if (!err) {
>   q->rpm_status = RPM_ACTIVE;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f18cff80050..cbd680dc194a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -383,6 +383,29 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
>   return rq;
>  }
>  
> +#ifdef CONFIG_PM
> +static bool blk_mq_wait_until_active(struct request_queue *q, bool wait)
> +{
> + if (!wait)
> + return false;
> + /*
> +  * Note: the q->rpm_status check below races against the changes of
> +  * that variable by the blk_{pre,post}_runtime_{suspend,resume}()
> +  * functions. The worst possible consequence of these races is that a
> +  * small number of requests gets passed to the block driver associated
> +  * with the request queue after rpm_status has been changed into
> +  * RPM_SUSPENDING and before it is changed into RPM_SUSPENDED.
> +  */
> + wait_event(q->rpm_active_wq, q->rpm_status == RPM_ACTIVE);
> + return true;
> +}
> +#else
> +static bool blk_mq_wait_until_active(struct request_queue *q, bool nowait)
> +{
> + return true;
> +}
> +#endif
> +
>  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int 
> op,
>   unsigned int flags)
>  {
> @@ -390,6 +413,17 @@ struct request *blk_mq_alloc_request(struct 
> request_queue *q, unsigned int op,
>   struct request *rq;
>   int ret;
>  
> + WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
> +
> + /*
> +  * Wait if the request queue is suspended or in the process of
> +  * suspending/resuming and the request being allocated will not be
> +  * used for power management purposes.
> +  */
> + if (!(op & REQ_PM) &&
> + !blk_mq_wait_until_active(q, !(op & REQ_NOWAIT)))
> + return ERR_PTR(-EAGAIN);
> +
>   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
>   if (ret)
>   return ERR_PTR(ret);
> -- 
> 2.14.1
> 

One issue is that pm_runtime_mark_last_busy() isn't set
accurately because it can't check if the freeing req is
the last active one, and set it if yes.


-- 
Ming


[PATCH] blk-mq: put the driver tag of nxt rq before first one is requeued

2017-09-12 Thread Jianchao Wang
When free the driver tag of the next rq with I/O scheduler
configured, it get the first entry of the list, however, at the
moment, the failed rq has been requeued at the head of the list.
The rq it gets is the failed rq not the next rq.
Free the driver tag of next rq before the failed one is requeued
in the failure branch of queue_rq callback and it is just needed
there.

Signed-off-by: Jianchao Wang 
---
 block/blk-mq.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b11..19f848e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -983,7 +983,7 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx 
*hctx)
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 {
struct blk_mq_hw_ctx *hctx;
-   struct request *rq;
+   struct request *rq, *nxt;
int errors, queued;
 
if (list_empty(list))
@@ -1029,14 +1029,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
if (list_empty(list))
bd.last = true;
else {
-   struct request *nxt;
-
nxt = list_first_entry(list, struct request, queuelist);
bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
}
 
ret = q->mq_ops->queue_rq(hctx, );
if (ret == BLK_STS_RESOURCE) {
+   /*
+* If an I/O scheduler has been configured and we got a
+* driver tag for the next request already, free it 
again.
+*/
+   if (!list_empty(list)) {
+   nxt = list_first_entry(list, struct request, 
queuelist);
+   blk_mq_put_driver_tag(nxt);
+   }
blk_mq_put_driver_tag_hctx(hctx, rq);
list_add(>queuelist, list);
__blk_mq_requeue_request(rq);
@@ -1059,13 +1065,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
 * that is where we will continue on next queue run.
 */
if (!list_empty(list)) {
-   /*
-* If an I/O scheduler has been configured and we got a driver
-* tag for the next request already, free it again.
-*/
-   rq = list_first_entry(list, struct request, queuelist);
-   blk_mq_put_driver_tag(rq);
-
spin_lock(>lock);
list_splice_init(list, >dispatch);
spin_unlock(>lock);
-- 
2.7.4



Re: [PATCH V2 00/12] scsi-mq support for ZBC disks

2017-09-12 Thread Damien Le Moal
Christoph,

On 9/11/17 21:24, Christoph Hellwig wrote:
> On Fri, Sep 08, 2017 at 09:12:12AM -0700, Damien Le Moal wrote:
>> 1) The zone size and the number of zones of the device (for the bitmaps
>> allocation and offset->zone number conversion).
>> 2) Zone type for the optimization that avoids locking conventional zones.
>>
>> (2) is optional. We can do without, but still really nice to have from a
>> performance perspective as conventional zones tend to be used for
>> storing metadata. So a lot of small random writes is more likely and
>> high queue depth writing would improve performance significantly.
>>
>> For (1), the zone size is known through q->limits.chunk_sectors. But the
>> disk capacity is not known using request_queue only, so the number of
>> zones cannot be calculated... I thought of exporting it through queue
>> limits too, but then stacking of device mappers using ZBC drives becomes
>> a pain as the number of zones needs to be recalculated.
> 
> For 4.14-rc+ you should be able to easily get at the gendisk as the
> whole submission path is now gendisk based, although it might need
> some minor argument reshuffle to pass it instead of the request_queue
> in a few places.  Note that for everything passing the request
> you can get the gendisk from the request as it contains a pointer.

Thank you for the pointers. Ming also mentioned this. I will look into
it for V3. But in any case, this would only allow allocating the zone
write lock bitmap as the device number of zones can now be calculated
using the device capacity (obtained from gendisk) and the device zone
size (obtained from queue limits).

I was thinking of something else very simple, similar to the integrity
code: using a "struct blk_zoned" added to the request queue and
initialized by the low level device driver on disk revalidate. This
structure only needs the number of zones and a zone type bitmap to
indicate sequential zones.

struct blk_zoned {
unsigned int nr_zones;
unsigned long *seq_zones;
};

struct request_queue {
...
#ifdef CONFIG_BLK_DEV_ZONED
struct blk_zoned zoned;
#endif
...
};

This would allow implementing (2) trivially and also makes the
information available to any user of the disk (FS and device mappers).

Would this be acceptable ?

> The only annoying issue is that some of our passthrough character
> device callers don't have a gendisk at all, which causes
> problems all over and should probably be fixed at some point.

pass-through requests are completely ignored for zoned block devices and
processed as they come. Any kind of reordering control is the
responsibility of users.

Best regards

-- 
Damien Le Moal,
Western Digital


Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq

2017-09-12 Thread Damien Le Moal
Ming,

On 9/10/17 14:10, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
>> Ming,
>>
>> On 9/8/17 05:43, Ming Lei wrote:
>>> Hi Damien,
>>>
>>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
 In the case of a ZBC disk used with scsi-mq, zone write locking does
 not prevent write reordering in sequential zones. Unlike the legacy
 case, zone locking can only be done after the command request is
 removed from the scheduler dispatch queue. That is, at the time of
 zone locking, the write command may already be out of order.
>>>
>>> Per my understanding, for legacy case, it can be quite tricky to let
>>> the existed I/O scheduler guarantee the write order for ZBC disk.
>>> I guess requeue still might cause write reorder even in legacy path,
>>> since requeue can happen in both scsi_request_fn() and scsi_io_completion()
>>> with q->queue_lock released, meantime new rq belonging to the same
>>> zone can come and be inserted to queue.
>>
>> Yes, the write ordering will always depend on the scheduler doing the
>> right thing. But both cfq, deadline and even noop do the right thing
>> there, even considering the aging case. The next write for a zone will
>> always be the oldest in the queue for that zone, if it is not, it means
>> that the application did not write sequentially. Extensive testing in
>> the legacy case never showed a problem due to the scheduler itself.
> 
> OK, I suggest to document this guarantee of no write reorder for ZBC
> somewhere, so that people will keep it in mind when trying to change
> the current code.

Have you looked at the comments in sd_zbc.c ? That is explained there.
Granted, this is a little deep in the stack, but this is after all
dependent on the implementation of scsi_request_fn(). I can add comments
there too if you prefer.

>> scsi_requeue_command() does the unprep (zone unlock) and requeue while
>> holding the queue lock. So this is atomic with new write command
>> insertion. Requeued commands are added to the dispatch queue head, and
>> since a zone will only have a single write in-flight, there is no
>> reordering possible. The next write command for a zone to go again is
>> the last requeued one or the next in lba order. It works.
> 
> One special case is write with FLUSH/FUA, which may be added to
> front of q->queue_head directly. Suppose one write with FUA is
> just comes between requeue and run queue, write reorder may be
> triggered.

Zoned disks are recent and all of them support FUA. This means that a
write with FUA will be processed like any other write request (if I read
the code in blk-flush.c correctly). Well, at least for the mq case,
which does a blk_mq_sched_insert_request(), while the direct call to
list_add_tail() for the legacy case is suspicious. Why not
blk_insert_request() ? This may be a problem, but all the testing I have
done on the legacy path never hit this one. The reason is I think that
the two in-kernel users of zoned devices (f2fs and dm-zoned) do not use
REQ_FUA, nor issue flush requests with data.

> If this assumption is true, there might be such issue on blk-mq
> too.

For the same above reason, I think blk-mq is OK too. But granted, this
seems all a little brittle. I will look into this case more carefully to
solidify the overall support.

Thank you for your comments.

-- 
Damien Le Moal,
Western Digital


Re: [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue()

2017-09-12 Thread Damien Le Moal
Ming,

On 9/10/17 14:16, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 01:16:39AM +0900, Damien Le Moal wrote:
>> Using scsi_device_from_queue(), return the scsi_disk structure
>> associated with a request queue if the device is a disk.
>> Export this function to make it available to modules.
>>
> 
> This approach may be a little over-kill, actually gendisk is the
> parent of request queue in kobject tree(see blk_register_queue()),
> that might be an easy way to retrieve disk attached to the queue.

Yes. It is a little extreme.
I am currently preparing a V3 that will have a softer touch.

Best regards.

-- 
Damien Le Moal,
Western Digital