Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-14 Thread Ming Lei
On Thu, Sep 14, 2017 at 01:37:14PM +, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 09:15 +0800, Ming Lei wrote:
> > On Wed, Sep 13, 2017 at 07:07:53PM +, Bart Van Assche wrote:
> > > On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> > > > No, that patch only changes blk_insert_cloned_request() which is used
> > > > by dm-rq(mpath) only, nothing to do with the reported issue during
> > > > suspend and sending SCSI Domain validation.
> > > 
> > > There may be other ways to fix the SCSI domain validation code.
> > 
> > Again the issue isn't in domain validation, it is in quiesce,
> > so we need to fix quiesce, instead of working around transport_spi.
> > 
> > Also What is the other way? Why not this patchset?
> 
> Sorry if I had not made this clear enough but I don't like the approach of
> this patch series so please do not expect any "Reviewed-by" tags from me.
> As the discussion about v4 of this patch series made clear the interaction
> between blk_cleanup_queue() and the changes introduced by this patch series
> in blk_get_request() is subtle and hard to analyze. The blk-mq core is

No, it isn't subtle at all, as I explained, queue dying can be
set during allocating request in both legacy and blk-mq, and driver
is required to handle requests after queue becomes dying, this way
has been there for long time.

Is that really hard to analyze?

> already complicated. In my view patches that make the blk-mq core simpler
> are much more welcome than patches that make the blk-mq core more
> complicated.

Sorry, I can't agree this patchset is too complicated, this patchset just
touches quiesce interface. For other change such as holding queue usage
counter, it follows blk-mq's way, and we can reuse this way for
legacy too.

> 
> Since I expect that any fix for the interaction between blk-mq and power
> management will be integrated in kernel v4.15 at earliest there is no reason

Again, it isn't not related PM only, it is actually related with
SCSI quiesce.

> to rush. My proposal is to wait a few weeks and to see whether anyone comes
> up with a better solution.

I am open for any solution and happy to review them if someone posts
them out, but it should cover at least the two kind of reported issues.

However I won't wait for that, since people have been troubled with this
stuff much, like Oleksandr's case, the system is simple dead after
one susend. And the I/O hang in sending SCSI domain validation was
actually reported from a production system too.


-- 
Ming


Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-14 Thread Bart Van Assche
On Thu, 2017-09-14 at 09:15 +0800, Ming Lei wrote:
> On Wed, Sep 13, 2017 at 07:07:53PM +, Bart Van Assche wrote:
> > On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> > > No, that patch only changes blk_insert_cloned_request() which is used
> > > by dm-rq(mpath) only, nothing to do with the reported issue during
> > > suspend and sending SCSI Domain validation.
> > 
> > There may be other ways to fix the SCSI domain validation code.
> 
> Again the issue isn't in domain validation, it is in quiesce,
> so we need to fix quiesce, instead of working around transport_spi.
> 
> Also What is the other way? Why not this patchset?

Sorry if I had not made this clear enough but I don't like the approach of
this patch series so please do not expect any "Reviewed-by" tags from me.
As the discussion about v4 of this patch series made clear the interaction
between blk_cleanup_queue() and the changes introduced by this patch series
in blk_get_request() is subtle and hard to analyze. The blk-mq core is
already complicated. In my view patches that make the blk-mq core simpler
are much more welcome than patches that make the blk-mq core more
complicated.

Since I expect that any fix for the interaction between blk-mq and power
management will be integrated in kernel v4.15 at earliest there is no reason
to rush. My proposal is to wait a few weeks and to see whether anyone comes
up with a better solution.

Bart.

Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Ming Lei
On Wed, Sep 13, 2017 at 07:07:53PM +, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> > No, that patch only changes blk_insert_cloned_request() which is used
> > by dm-rq(mpath) only, nothing to do with the reported issue during
> > suspend and sending SCSI Domain validation.
> 
> There may be other ways to fix the SCSI domain validation code.

Again the issue isn't in domain validation, it is in quiesce,
so we need to fix quiesce, instead of working around transport_spi.

Also What is the other way? Why not this patchset?

> 
> Regarding the hang during resume: I have not been able to reproduce that
> hang with Jens' latest for-next branch. If that hang would turn out not to
> be reproducible with v4.13-rc1, why to change the behavior of the block
> layer with regard to PREEMPT requests?

Previously you object this patchset because you mention there
is race, now you want to change to another reason?

So I suppose there isn't the race you worried about.

PREEMPT is special enough but used widely, not like NVMe,
traditional SCSI doesn't has a specific queue for
administration purpose only, then RFQ_PREEMPT need to
be dispatched to the same queue even normal I/O is not
allowed at that time.

Because it is a common/generic issue with quiesce, we have to
avoid I/O hang with quiesce. This way is a generic enough for
cover all this kind of issue. And no side effect on normal I/O
of block layer.

Can you make sure that the following two points can't happen
wrt. I/O hang during suspend/resume?

- no normal I/O is triggered during suspend

- all I/O requests in queue will be drained out before
entering suspend

-- 
Ming


Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Bart Van Assche
On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> No, that patch only changes blk_insert_cloned_request() which is used
> by dm-rq(mpath) only, nothing to do with the reported issue during
> suspend and sending SCSI Domain validation.

There may be other ways to fix the SCSI domain validation code.

Regarding the hang during resume: I have not been able to reproduce that
hang with Jens' latest for-next branch. If that hang would turn out not to
be reproducible with v4.13-rc1, why to change the behavior of the block
layer with regard to PREEMPT requests?

Bart.

Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Ming Lei
On Wed, Sep 13, 2017 at 05:28:24PM +, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 00:48 +0800, Ming Lei wrote:
> > Could you please let me know if your concern about race between
> > preempt freeze and blk_cleanup_queue() is addressed in my last
> > reply?
> 
> Shouldn't we wait until v4.13-rc1 has been released before spending more
> energy on this? Certain patches that have been sent to Linus, e.g. commit

The discussion should have been done any time, right?

I believe I replied all your comments, if you don't agree any one of
my follow-up, please point it out.

> 157f377beb71 ("block: directly insert blk-mq request from
> blk_insert_cloned_request()"), may affect the test results. I have not

No, that patch only changes blk_insert_cloned_request() which is used
by dm-rq(mpath) only, nothing to do with the reported issue during
suspend and sending SCSI Domain validation.


-- 
Ming


Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Bart Van Assche
On Thu, 2017-09-14 at 00:48 +0800, Ming Lei wrote:
> Could you please let me know if your concern about race between
> preempt freeze and blk_cleanup_queue() is addressed in my last
> reply?

Shouldn't we wait until v4.13-rc1 has been released before spending more
energy on this? Certain patches that have been sent to Linus, e.g. commit
157f377beb71 ("block: directly insert blk-mq request from
blk_insert_cloned_request()"), may affect the test results. I have not
been able to reproduce the hang during resume with Jens' latest for-next
branch.

Bart.

Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Ming Lei
On Tue, Sep 12, 2017 at 11:40:57AM +0800, Ming Lei wrote:
> On Mon, Sep 11, 2017 at 04:03:55PM +, Bart Van Assche wrote:
> > On Mon, 2017-09-11 at 19:10 +0800, Ming Lei wrote:
> > > @@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, 
> > > unsigned flags)
> > >   if (percpu_ref_tryget_live(>q_usage_counter))
> > >   return 0;
> > >  
> > > + /*
> > > +  * If queue is preempt frozen and caller need to allocate
> > > +  * request for RQF_PREEMPT, we grab the .q_usage_counter
> > > +  * unconditionally and return successfully.
> > > +  *
> > > +  * There isn't race with queue cleanup because:
> > > +  *
> > > +  * 1) it is guaranteed that preempt freeze can't be
> > > +  * started after queue is set as dying
> > > +  *
> > > +  * 2) normal freeze runs exclusively with preempt
> > > +  * freeze, so even after queue is set as dying
> > > +  * afterwards, blk_queue_cleanup() won't move on
> > > +  * until preempt freeze is done
> > > +  *
> > > +  * 3) blk_queue_dying() needn't to be checked here
> > > +  *  - for legacy path, it will be checked in
> > > +  *  __get_request()
> > 
> > For the legacy block layer core, what do you think will happen if the
> > "dying" state is set by another thread after __get_request() has passed the
> > blk_queue_dying() check?
> 
> Without this patchset, block core still need to handle the above
> situation, so your question isn't related with this patchset.
> 
> Also q->queue_lock is required in both setting dying and checking
> dying in__get_request(). But the lock can be released in
> __get_request(), so it is possible to allocate one request after
> queue is set as dying, and the request can be dispatched to a
> dying queue too for legacy.
> 
> > 
> > > +  *  - blk-mq depends on driver to handle dying well
> > > +  *  because it is normal for queue to be set as dying
> > > +  *  just between blk_queue_enter() and allocating new
> > > +  *  request.
> > 
> > The above comment is not correct. The block layer core handles the "dying"
> > state. Block drivers other than dm-mpath should not have to query this state
> > directly.
> 
> If blk-mq doesn't query dying state, how does it know queue is dying
> and handle the state? Also blk-mq isn't different with legacy wrt.
> depending on driver to handle dying.
> 
> > 
> > > +  */
> > > + if ((flags & BLK_REQ_PREEMPT) &&
> > > + blk_queue_is_preempt_frozen(q)) {
> > > + blk_queue_enter_live(q);
> > > + return 0;
> > > + }
> > > +
> > 
> > Sorry but to me it looks like the above code introduces a race condition
> > between blk_queue_cleanup() and blk_get_request() for at least blk-mq.
> > Consider e.g. the following scenario:
> > * A first thread preempt-freezes a request queue.
> > * A second thread calls blk_get_request() with BLK_REQ_PREEMPT set. That
> >   results in a call of blk_queue_is_preempt_frozen().
> > * A context switch occurs to the first thread.
> > * The first thread preempt-unfreezes the same request queue and calls
> >   blk_queue_cleanup(). That last function changes the request queue state
> >   into DYING and waits until all pending requests have finished.
> > * The second thread continues and calls blk_queue_enter_live(), allocates
> >   a request and submits it.
> 
> OK, looks a race I don't think of, but it can be fixed easily by calling
> blk_queue_enter_live() with holding q->freeze_lock, and it won't
> cause performance issue too since it is in slow path.
> 
> For example, we can introduce the following code in blk_queue_enter():
> 
>   if ((flags & BLK_REQ_PREEMPT) &&
>   blk_queue_enter_preempt_freeze(q))
>   return 0;
> 
> static inline bool blk_queue_enter_preempt_freeze(struct request_queue *q)
> {
>   bool preempt_frozen;
> 
>   spin_lock(>freeze_lock);
>   preempt_frozen = q->preempt_freezing && !q->preempt_unfreezing;
>   if (preempt_froze)
>   blk_queue_enter_live(q);
>   spin_unlock(>freeze_lock);
> 
>   return preempt_frozen;
> }
> 
> > 
> > In other words, a request gets submitted against a dying queue. This must
> > not happen. See also my explanation of queue shutdown from a few days ago
> 
> That is not correct, think about why queue dead is checked in
> __blk_run_queue_uncond() instead of queue dying. We still need to
> submit requests to driver when queue is dying, and driver knows
> how to handle that.
> 
> > (https://marc.info/?l=linux-block=150449845831789).
> 
> > from (https://marc.info/?l=linux-block=150449845831789).
> >> Do you understand how request queue cleanup works? The algorithm used for
> >> request queue cleanup is as follows:
> >> * Set the DYING flag. This flag 

Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-11 Thread Ming Lei
On Mon, Sep 11, 2017 at 04:03:55PM +, Bart Van Assche wrote:
> On Mon, 2017-09-11 at 19:10 +0800, Ming Lei wrote:
> > @@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, unsigned 
> > flags)
> > if (percpu_ref_tryget_live(>q_usage_counter))
> > return 0;
> >  
> > +   /*
> > +* If queue is preempt frozen and caller need to allocate
> > +* request for RQF_PREEMPT, we grab the .q_usage_counter
> > +* unconditionally and return successfully.
> > +*
> > +* There isn't race with queue cleanup because:
> > +*
> > +* 1) it is guaranteed that preempt freeze can't be
> > +* started after queue is set as dying
> > +*
> > +* 2) normal freeze runs exclusively with preempt
> > +* freeze, so even after queue is set as dying
> > +* afterwards, blk_queue_cleanup() won't move on
> > +* until preempt freeze is done
> > +*
> > +* 3) blk_queue_dying() needn't to be checked here
> > +*  - for legacy path, it will be checked in
> > +*  __get_request()
> 
> For the legacy block layer core, what do you think will happen if the
> "dying" state is set by another thread after __get_request() has passed the
> blk_queue_dying() check?

Without this patchset, block core still need to handle the above
situation, so your question isn't related with this patchset.

Also q->queue_lock is required in both setting dying and checking
dying in__get_request(). But the lock can be released in
__get_request(), so it is possible to allocate one request after
queue is set as dying, and the request can be dispatched to a
dying queue too for legacy.

> 
> > +*  - blk-mq depends on driver to handle dying well
> > +*  because it is normal for queue to be set as dying
> > +*  just between blk_queue_enter() and allocating new
> > +*  request.
> 
> The above comment is not correct. The block layer core handles the "dying"
> state. Block drivers other than dm-mpath should not have to query this state
> directly.

If blk-mq doesn't query dying state, how does it know queue is dying
and handle the state? Also blk-mq isn't different with legacy wrt.
depending on driver to handle dying.

> 
> > +*/
> > +   if ((flags & BLK_REQ_PREEMPT) &&
> > +   blk_queue_is_preempt_frozen(q)) {
> > +   blk_queue_enter_live(q);
> > +   return 0;
> > +   }
> > +
> 
> Sorry but to me it looks like the above code introduces a race condition
> between blk_queue_cleanup() and blk_get_request() for at least blk-mq.
> Consider e.g. the following scenario:
> * A first thread preempt-freezes a request queue.
> * A second thread calls blk_get_request() with BLK_REQ_PREEMPT set. That
>   results in a call of blk_queue_is_preempt_frozen().
> * A context switch occurs to the first thread.
> * The first thread preempt-unfreezes the same request queue and calls
>   blk_queue_cleanup(). That last function changes the request queue state
>   into DYING and waits until all pending requests have finished.
> * The second thread continues and calls blk_queue_enter_live(), allocates
>   a request and submits it.

OK, looks a race I don't think of, but it can be fixed easily by calling
blk_queue_enter_live() with holding q->freeze_lock, and it won't
cause performance issue too since it is in slow path.

For example, we can introduce the following code in blk_queue_enter():

if ((flags & BLK_REQ_PREEMPT) &&
blk_queue_enter_preempt_freeze(q))
return 0;

static inline bool blk_queue_enter_preempt_freeze(struct request_queue *q)
{
bool preempt_frozen;

spin_lock(>freeze_lock);
preempt_frozen = q->preempt_freezing && !q->preempt_unfreezing;
if (preempt_froze)
blk_queue_enter_live(q);
spin_unlock(>freeze_lock);

return preempt_frozen;
}

> 
> In other words, a request gets submitted against a dying queue. This must
> not happen. See also my explanation of queue shutdown from a few days ago

That is not correct, think about why queue dead is checked in
__blk_run_queue_uncond() instead of queue dying. We still need to
submit requests to driver when queue is dying, and driver knows
how to handle that.

> (https://marc.info/?l=linux-block=150449845831789).

> from (https://marc.info/?l=linux-block=150449845831789).
>> Do you understand how request queue cleanup works? The algorithm used for
>> request queue cleanup is as follows:
>> * Set the DYING flag. This flag makes all later blk_get_request() calls
>>   fail.

Your description isn't true for both legacy and blk-mq:

For legacy, as you see q->queue_lock can be released in
__get_request(), at that time, the 

Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-11 Thread Bart Van Assche
On Mon, 2017-09-11 at 19:10 +0800, Ming Lei wrote:
> @@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, unsigned 
> flags)
>   if (percpu_ref_tryget_live(>q_usage_counter))
>   return 0;
>  
> + /*
> +  * If queue is preempt frozen and caller need to allocate
> +  * request for RQF_PREEMPT, we grab the .q_usage_counter
> +  * unconditionally and return successfully.
> +  *
> +  * There isn't race with queue cleanup because:
> +  *
> +  * 1) it is guaranteed that preempt freeze can't be
> +  * started after queue is set as dying
> +  *
> +  * 2) normal freeze runs exclusively with preempt
> +  * freeze, so even after queue is set as dying
> +  * afterwards, blk_queue_cleanup() won't move on
> +  * until preempt freeze is done
> +  *
> +  * 3) blk_queue_dying() needn't to be checked here
> +  *  - for legacy path, it will be checked in
> +  *  __get_request()

For the legacy block layer core, what do you think will happen if the
"dying" state is set by another thread after __get_request() has passed the
blk_queue_dying() check?

> +  *  - blk-mq depends on driver to handle dying well
> +  *  because it is normal for queue to be set as dying
> +  *  just between blk_queue_enter() and allocating new
> +  *  request.

The above comment is not correct. The block layer core handles the "dying"
state. Block drivers other than dm-mpath should not have to query this state
directly.

> +  */
> + if ((flags & BLK_REQ_PREEMPT) &&
> + blk_queue_is_preempt_frozen(q)) {
> + blk_queue_enter_live(q);
> + return 0;
> + }
> +

Sorry but to me it looks like the above code introduces a race condition
between blk_queue_cleanup() and blk_get_request() for at least blk-mq.
Consider e.g. the following scenario:
* A first thread preempt-freezes a request queue.
* A second thread calls blk_get_request() with BLK_REQ_PREEMPT set. That
  results in a call of blk_queue_is_preempt_frozen().
* A context switch occurs to the first thread.
* The first thread preempt-unfreezes the same request queue and calls
  blk_queue_cleanup(). That last function changes the request queue state
  into DYING and waits until all pending requests have finished.
* The second thread continues and calls blk_queue_enter_live(), allocates
  a request and submits it.

In other words, a request gets submitted against a dying queue. This must
not happen. See also my explanation of queue shutdown from a few days ago
(https://marc.info/?l=linux-block=150449845831789).

Bart.

[PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-11 Thread Ming Lei
REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() to allow block
layer to allocate request when queue is preempt frozen, since we
will preempt freeze queue before quiescing SCSI device in the
following patch for supporting safe SCSI quiescing.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 48 ++--
 block/blk-mq.c |  3 +--
 include/linux/blk-mq.h |  7 ---
 include/linux/blkdev.h | 17 ++---
 4 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ade9b5484a6e..1c8e264753f0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, unsigned 
flags)
if (percpu_ref_tryget_live(>q_usage_counter))
return 0;
 
+   /*
+* If queue is preempt frozen and caller need to allocate
+* request for RQF_PREEMPT, we grab the .q_usage_counter
+* unconditionally and return successfully.
+*
+* There isn't race with queue cleanup because:
+*
+* 1) it is guaranteed that preempt freeze can't be
+* started after queue is set as dying
+*
+* 2) normal freeze runs exclusively with preempt
+* freeze, so even after queue is set as dying
+* afterwards, blk_queue_cleanup() won't move on
+* until preempt freeze is done
+*
+* 3) blk_queue_dying() needn't to be checked here
+*  - for legacy path, it will be checked in
+*  __get_request()
+*  - blk-mq depends on driver to handle dying well
+*  because it is normal for queue to be set as dying
+*  just between blk_queue_enter() and allocating new
+*  request.
+*/
+   if ((flags & BLK_REQ_PREEMPT) &&
+   blk_queue_is_preempt_frozen(q)) {
+   blk_queue_enter_live(q);
+   return 0;
+   }
+
if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
@@ -1410,7 +1439,8 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, gfp_t gfp_mask,
+  unsigned int flags)
 {
struct request *rq;
int ret = 0;
@@ -1420,8 +1450,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -1439,26 +1468,25 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+ gfp_t gfp_mask, unsigned int flags)
 {
struct request *req;
 
+   flags |= gfp_mask & __GFP_DIRECT_RECLAIM ? 0 : BLK_REQ_NOWAIT;
if (q->mq_ops) {
-   req = blk_mq_alloc_request(q, op,
-   (gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   0 : BLK_MQ_REQ_NOWAIT);
+   req = blk_mq_alloc_request(q, op, flags);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
-   req = blk_old_get_request(q, op, gfp_mask);
+   req = blk_old_get_request(q, op, gfp_mask, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
 
return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 096c5f0ea518..720559724f97 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -482,8 +482,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
-   BLK_REQ_NOWAIT : 0);
+