Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch
On Wed, Jul 18, 2018 at 10:39:36PM +0200, h...@lst.de wrote:
> On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> > On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > > that so colourfully calls everyone else "dangerous" while advocating
> > > for silently losing requests on purpose.
> > > 
> > > But where's the option that fixes scsi to handle hardware completions
> > > concurrently with arbitrary timeout software? Propping up that house of
> > > cards can't be the only recourse.
> > 
> > The important bit is that we need to fix this issue quickly.  We are
> > past -rc5 so I'm rather concerned about anything too complicated.
> > 
> > I'm not even sure SCSI has a problem with multiple completions happening
> > at the same time, but it certainly has a problem with bypassing
> > blk_mq_complete_request from the EH path.
> > 
> > I think we can solve this properly, but I also think we are way to late
> > in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> > to revert the changes and try again for 4.19 or even 4.20 if we don't
> > act quickly enough.
> 
> So here is a quick attempt at the revert while also trying to keep
> nvme working.  Keith, Bart, Jianchao - does this looks reasonable
> as a 4.18 band aid?
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

Hm, I not really a fan. The far majority of blk-mq drivers don't even
implement a timeout, and reverting it will lose their requests forever
if they complete at the same time as a timeout.

Of the remaining drivers, most of those don't want the reverted behavior
either. It actually looks like scsi is the only mq driver that wants
to block completions. In the short term, scsi can make that happen with
just three lines of code.

---
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..03986af3076c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,10 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
enum blk_eh_timer_return rtn = BLK_EH_DONE;
struct Scsi_Host *host = scmd->device->host;

+   if (req->q->mq_ops && cmpxchg(>state, MQ_RQ_IN_FLIGHT, 
MQ_RQ_COMPLETE) !=
+   MQ_RQ_IN_FLIGHT);
+   return rtn;
+
trace_scsi_dispatch_cmd_timeout(scmd);
scsi_log_completion(scmd, TIMEOUT_ERROR);


-- 


Re: [PATCH 2/3] block, scsi: Rework runtime power management

2018-07-18 Thread Ming Lei
On Wed, Jul 18, 2018 at 03:45:15PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 20:16 +0800, Ming Lei wrote:
> > On Wed, Jul 18, 2018 at 7:49 AM, Bart Van Assche  
> > wrote:
> > > @@ -3801,8 +3778,11 @@ int blk_pre_runtime_suspend(struct request_queue 
> > > *q)
> > > if (!q->dev)
> > > return ret;
> > > 
> > > +   blk_set_preempt_only(q);
> > > +   blk_freeze_queue_start(q);
> > > +
> > > spin_lock_irq(q->queue_lock);
> > > -   if (q->nr_pending) {
> > > +   if (!percpu_ref_is_zero(>q_usage_counter)) {
> > 
> > This way can't work reliably because the percpu ref isn't in atomic mode
> > yet after blk_freeze_queue_start() returns, then percpu_ref_is_zero() won't
> > see accurate value of the counter, finally the device may be put down before
> > in-flight requests are completed by hardware.
> 
> Hello Ming,
> 
> The blk_freeze_queue_start() implementation is as follows:
> 
> void blk_freeze_queue_start(struct request_queue *q)
> {
>   int freeze_depth;
> 
>   freeze_depth = atomic_inc_return(>mq_freeze_depth);
>   if (freeze_depth == 1) {
>   percpu_ref_kill(>q_usage_counter);
>   if (q->mq_ops)
>   blk_mq_run_hw_queues(q, false);
>   }
> }
> 
> From the documentation header in include/linux/percpu-refcount.h above
> percpu_ref_kill():
> 
>  * Switches @ref into atomic mode before gathering up the percpu counters
>  * and dropping the initial ref.

IMO, the comment isn't accurate enough. Yes, the counter becomes atomic
mode after percpu_ref_kill() returns, but the counter can't be retrieved
accurately before the rcu confirmation is done.

One extra get is done in __percpu_ref_switch_to_atomic(), and the put pair
is run in percpu_ref_call_confirm_rcu(), which is scheduled via 
call_rcu_sched().

So once blk_freeze_queue_start() returns, percpu_ref_is_zero() won't
return true only until the rcu confirmation is done. That means this
approach may not put device down.

Thanks,
Ming


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch
On Wed, Jul 18, 2018 at 09:30:11PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I'm not sure that's the right approach. I think it is the responsibility of
> the block layer to handle races between completion handling and timeout
> handling and that this is not the responsibility of e.g. a block driver. If
> you look at e.g. the legacy block layer then you will see that it takes care
> of this race and that legacy block drivers do not have to worry about this
> race.

Reverting doesn't handle the race at all. It just ignores completions
and puts the responsibility on the drivers to handle the race because
that's what scsi wants to happen.


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Bart Van Assche
On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> There are not that many blk-mq drivers, so we can go through them all.

I'm not sure that's the right approach. I think it is the responsibility of
the block layer to handle races between completion handling and timeout
handling and that this is not the responsibility of e.g. a block driver. If
you look at e.g. the legacy block layer then you will see that it takes care
of this race and that legacy block drivers do not have to worry about this
race.

Bart.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch
On Wed, Jul 18, 2018 at 08:58:48PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> > If scsi needs this behavior, why not just put that behavior in scsi? It
> > can set the state to complete and then everything can play out as
> > before.
> > [ ... ]
> 
> There may be other drivers that need the same protection the SCSI core needs
> so I think the patch at the end of your previous e-mail is a step in the wrong
> direction.
> 
> Bart.

And there may be other drivers that don't want their completions
ignored, so breaking them again is also a step in the wrong direction.

There are not that many blk-mq drivers, so we can go through them all.

Most don't even implement .timeout, so they never know that condition
ever happened. Others always return BLK_EH_RESET_TIMER without doing
anythign else, so the 'new' behavior would have to be better for those,
too.

The following don't implement .timeout:

  loop, rdb, virtio, xen, dm, ubi, scm

The following always return RESET_TIMER:

  null, skd

The following is safe to the new way:

  mtip

And now ones I am not sure about:

  ndb, mmc, dasd

I don't know, reverting looks worse than just fixing the drivers.


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Bart Van Assche
On Wed, 2018-07-18 at 22:39 +0200, h...@lst.de wrote:
> On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> > The important bit is that we need to fix this issue quickly.  We are
> > past -rc5 so I'm rather concerned about anything too complicated.
> > 
> > I'm not even sure SCSI has a problem with multiple completions happening
> > at the same time, but it certainly has a problem with bypassing
> > blk_mq_complete_request from the EH path.
> > 
> > I think we can solve this properly, but I also think we are way to late
> > in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> > to revert the changes and try again for 4.19 or even 4.20 if we don't
> > act quickly enough.
> 
> So here is a quick attempt at the revert while also trying to keep
> nvme working.  Keith, Bart, Jianchao - does this looks reasonable
> as a 4.18 band aid?
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

Hello Christoph,

A patch series that first reverts the following patches:
* blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
* block: fix timeout changes for legacy request drivers
* blk-mq: don't time out requests again that are in the timeout handler
* blk-mq: simplify blk_mq_rq_timed_out
* block: remove BLK_EH_HANDLED
* block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE
* blk-mq: Remove generation seqeunce

and next renames BLK_EH_NOT_HANDLED again into BLK_EH_DONE would probably be
a lot easier to review.

Thanks,

Bart.


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Bart Van Assche
On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> If scsi needs this behavior, why not just put that behavior in scsi? It
> can set the state to complete and then everything can play out as
> before.
> [ ... ]

There may be other drivers that need the same protection the SCSI core needs
so I think the patch at the end of your previous e-mail is a step in the wrong
direction.

Bart.


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch
On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

If scsi needs this behavior, why not just put that behavior in scsi? It
can set the state to complete and then everything can play out as
before.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 22326612a5d3..f50559718b71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,10 +558,8 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   if (cmpxchg(>state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
-   MQ_RQ_IN_FLIGHT)
+   if (blk_mq_mark_complete(rq))
return;
-
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
 
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..a5d05fab24a7 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,14 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
enum blk_eh_timer_return rtn = BLK_EH_DONE;
struct Scsi_Host *host = scmd->device->host;
 
+   /*
+* Mark complete now so lld can't post a completion during error
+* handling. Return immediately if it was already marked complete, as
+* that means the lld posted a completion already.
+*/
+   if (req->q->mq_ops && blk_mq_mark_complete(req))
+   return rtn;
+
trace_scsi_dispatch_cmd_timeout(scmd);
scsi_log_completion(scmd, TIMEOUT_ERROR);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9b0fd11ce89a..0ce587c9c27b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,6 +289,15 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+/*
+ * Returns true if request is not in flight.
+ */
+static inline bool blk_mq_mark_complete(struct request *rq)
+{
+   return (cmpxchg(>state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
+   MQ_RQ_IN_FLIGHT);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
--


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread h...@lst.de
On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

So here is a quick attempt at the revert while also trying to keep
nvme working.  Keith, Bart, Jianchao - does this looks reasonable
as a 4.18 band aid?

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread h...@lst.de
On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> Of the two you mentioned, yours is preferable IMO. While I appreciate
> Jianchao's detailed analysis, it's hard to take a proposal seriously
> that so colourfully calls everyone else "dangerous" while advocating
> for silently losing requests on purpose.
> 
> But where's the option that fixes scsi to handle hardware completions
> concurrently with arbitrary timeout software? Propping up that house of
> cards can't be the only recourse.

The important bit is that we need to fix this issue quickly.  We are
past -rc5 so I'm rather concerned about anything too complicated.

I'm not even sure SCSI has a problem with multiple completions happening
at the same time, but it certainly has a problem with bypassing
blk_mq_complete_request from the EH path.

I think we can solve this properly, but I also think we are way to late
in the 4.18 cycle to fix it properly.  For now I fear we'll just have
to revert the changes and try again for 4.19 or even 4.20 if we don't
act quickly enough.


Re: [RFC PATCH] blk-mq: move timeout handling from queue to tagset

2018-07-18 Thread Bart Van Assche
On Wed, 2018-07-18 at 11:45 -0600, Keith Busch wrote:
> On Wed, Jul 18, 2018 at 05:18:45PM +, Bart Van Assche wrote:
> > On Wed, 2018-07-18 at 11:00 -0600, Keith Busch wrote:
> > > - cancel_work_sync(>timeout_work);
> > > -
> > >   if (q->mq_ops) {
> > >   struct blk_mq_hw_ctx *hctx;
> > >   int i;
> > > @@ -415,6 +412,8 @@ void blk_sync_queue(struct request_queue *q)
> > >   queue_for_each_hw_ctx(q, hctx, i)
> > >   cancel_delayed_work_sync(>run_work);
> > >   } else {
> > > + del_timer_sync(>timeout);
> > > + cancel_work_sync(>timeout_work);
> > >   cancel_delayed_work_sync(>delay_work);
> > >   }
> > >  }
> > 
> > What is the impact of this change on the md driver, which is the only driver
> > that calls blk_sync_queue() directly? What will happen if timeout processing
> > happens concurrently with or after blk_sync_queue() has returned?
> 
> That's a make_request_fn stacking driver, right? There should be
> no impact in that case, since the change above affects only mq.
> 
> I'm actually a little puzzled why md calls blk_sync_queue. Are the
> queue timers ever used for bio-based drivers?

Hello Keith,

For all md drivers that I verified calling md_make_request() triggers one or
more generic_make_request() calls for the underlying devices. So it's not
clear to me why the md driver calls blk_sync_queue().

Bart.

Re: [RFC PATCH] blk-mq: move timeout handling from queue to tagset

2018-07-18 Thread Keith Busch
On Wed, Jul 18, 2018 at 05:18:45PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 11:00 -0600, Keith Busch wrote:
> > -   cancel_work_sync(>timeout_work);
> > -
> > if (q->mq_ops) {
> > struct blk_mq_hw_ctx *hctx;
> > int i;
> > @@ -415,6 +412,8 @@ void blk_sync_queue(struct request_queue *q)
> > queue_for_each_hw_ctx(q, hctx, i)
> > cancel_delayed_work_sync(>run_work);
> > } else {
> > +   del_timer_sync(>timeout);
> > +   cancel_work_sync(>timeout_work);
> > cancel_delayed_work_sync(>delay_work);
> > }
> >  }
> 
> What is the impact of this change on the md driver, which is the only driver
> that calls blk_sync_queue() directly? What will happen if timeout processing
> happens concurrently with or after blk_sync_queue() has returned?

That's a make_request_fn stacking driver, right? There should be
no impact in that case, since the change above affects only mq.

I'm actually a little puzzled why md calls blk_sync_queue. Are the
queue timers ever used for bio-based drivers?
 
> > +   list_for_each_entry(q, >tag_list, tag_set_list) {
> > /*
> >  * Request timeouts are handled as a forward rolling timer. If
> >  * we end up here it means that no requests are pending and
> > @@ -881,7 +868,6 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> > blk_mq_tag_idle(hctx);
> > }
> > }
> > -   blk_queue_exit(q);
> >  }
> 
> What prevents that a request queue is removed from set->tag_list while the 
> above
> loop examines tag_list? Can blk_cleanup_queue() queue be called from the 
> context
> of another thread while this loop is examining hardware queues?

Good point. I missed that this needs to hold the tag_list_lock.
  
> > +   timer_setup(>timer, blk_mq_timed_out_timer, 0);
> > +   INIT_WORK(>timeout_work, blk_mq_timeout_work);
> > [ ... ]
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -86,6 +86,8 @@ struct blk_mq_tag_set {
> >  
> > struct blk_mq_tags  **tags;
> >  
> > +   struct timer_list   timer;
> > +   struct work_struct  timeout_work;
> 
> Can the timer and timeout_work data structures be replaced by a single
> delayed_work instance?

I think so. I wanted to keep blk_add_timer relatively unchanged for this
proposal, so I followed the existing pattern with the timer kicking the
work. I don't see why that extra indirection is necessary, so I think
it's a great idea. Unless anyone knows a reason not to, we can collapse
this into a single delayed work for both mq and legacy as a prep patch
before this one.

Thanks for the feedback!


Re: [RFC PATCH] blk-mq: move timeout handling from queue to tagset

2018-07-18 Thread Bart Van Assche
On Wed, 2018-07-18 at 11:00 -0600, Keith Busch wrote:
>  void blk_sync_queue(struct request_queue *q)
>  {
> - del_timer_sync(>timeout);
> - cancel_work_sync(>timeout_work);
> -
>   if (q->mq_ops) {
>   struct blk_mq_hw_ctx *hctx;
>   int i;
> @@ -415,6 +412,8 @@ void blk_sync_queue(struct request_queue *q)
>   queue_for_each_hw_ctx(q, hctx, i)
>   cancel_delayed_work_sync(>run_work);
>   } else {
> + del_timer_sync(>timeout);
> + cancel_work_sync(>timeout_work);
>   cancel_delayed_work_sync(>delay_work);
>   }
>  }

What is the impact of this change on the md driver, which is the only driver
that calls blk_sync_queue() directly? What will happen if timeout processing
happens concurrently with or after blk_sync_queue() has returned?

>  static void blk_mq_timeout_work(struct work_struct *work)
>  {
> - struct request_queue *q =
> - container_of(work, struct request_queue, timeout_work);
> + struct blk_mq_tag_set *set =
> + container_of(work, struct blk_mq_tag_set, timeout_work);
> + struct request_queue *q;
>   unsigned long next = 0;
>   struct blk_mq_hw_ctx *hctx;
>   int i;
>  
> - /* A deadlock might occur if a request is stuck requiring a
> -  * timeout at the same time a queue freeze is waiting
> -  * completion, since the timeout code would not be able to
> -  * acquire the queue reference here.
> -  *
> -  * That's why we don't use blk_queue_enter here; instead, we use
> -  * percpu_ref_tryget directly, because we need to be able to
> -  * obtain a reference even in the short window between the queue
> -  * starting to freeze, by dropping the first reference in
> -  * blk_freeze_queue_start, and the moment the last request is
> -  * consumed, marked by the instant q_usage_counter reaches
> -  * zero.
> -  */
> - if (!percpu_ref_tryget(>q_usage_counter))
> - return;
> -
> - blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> + blk_mq_tagset_busy_iter(set, blk_mq_check_expired, );
>  
>   if (next != 0) {
> - mod_timer(>timeout, next);
> - } else {
> + mod_timer(>timer, next);
> + return;
> + }
> +
> + list_for_each_entry(q, >tag_list, tag_set_list) {
>   /*
>* Request timeouts are handled as a forward rolling timer. If
>* we end up here it means that no requests are pending and
> @@ -881,7 +868,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   blk_mq_tag_idle(hctx);
>   }
>   }
> - blk_queue_exit(q);
>  }

What prevents that a request queue is removed from set->tag_list while the above
loop examines tag_list? Can blk_cleanup_queue() queue be called from the context
of another thread while this loop is examining hardware queues?
 
> + timer_setup(>timer, blk_mq_timed_out_timer, 0);
> + INIT_WORK(>timeout_work, blk_mq_timeout_work);
> [ ... ]
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -86,6 +86,8 @@ struct blk_mq_tag_set {
>  
>   struct blk_mq_tags  **tags;
>  
> + struct timer_list   timer;
> + struct work_struct  timeout_work;

Can the timer and timeout_work data structures be replaced by a single
delayed_work instance?

Thanks,

Bart.


[RFC PATCH] blk-mq: move timeout handling from queue to tagset

2018-07-18 Thread Keith Busch
This patch removes the per-request_queue timeout handling and uses the
tagset instead. This simplifies the timeout handling since a shared tag
set can handle all timed out requests in a single work queue rather than
iterate the same set in different work for all the users of that set.

The long term objective of this is to aid blk-mq drivers with shared
tagsets. These drivers typically want their timeout error handling to be
single threaded, and this patch provides that context so they wouldn't
need to sync with other request queues requiring the same error handling.

Timeout handling per-queue was a bit racy with completions anyway:
a tag active for one request_queue while iterating could be freed
and reallocated to a different request_queue, so a queue's timeout
handler may be operating on a request allocated to a different queue.
Though no real harm was done with how the callbacks used those requests,
this patch fixes that race.

And since the timeout code takes a reference on requests, the request
can never exit the queue while timeout code is considering. We no
longer need to enter each request_queue in the timeout work so this
patch removes that unnecessary code.

Signed-off-by: Keith Busch 
---
 block/blk-core.c   |  5 ++---
 block/blk-mq.c | 45 -
 block/blk-timeout.c| 16 ++--
 include/linux/blk-mq.h |  2 ++
 4 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f84a9b7b6f5a..9cf7ff6baba0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -404,9 +404,6 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-   del_timer_sync(>timeout);
-   cancel_work_sync(>timeout_work);
-
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
int i;
@@ -415,6 +412,8 @@ void blk_sync_queue(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
cancel_delayed_work_sync(>run_work);
} else {
+   del_timer_sync(>timeout);
+   cancel_work_sync(>timeout_work);
cancel_delayed_work_sync(>delay_work);
}
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 95919268564b..22326612a5d3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -804,8 +804,7 @@ static bool blk_mq_req_expired(struct request *rq, unsigned 
long *next)
return false;
 }
 
-static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
-   struct request *rq, void *priv, bool reserved)
+static void blk_mq_check_expired(struct request *rq, void *priv, bool reserved)
 {
unsigned long *next = priv;
 
@@ -842,33 +841,21 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx 
*hctx,
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
-   struct request_queue *q =
-   container_of(work, struct request_queue, timeout_work);
+   struct blk_mq_tag_set *set =
+   container_of(work, struct blk_mq_tag_set, timeout_work);
+   struct request_queue *q;
unsigned long next = 0;
struct blk_mq_hw_ctx *hctx;
int i;
 
-   /* A deadlock might occur if a request is stuck requiring a
-* timeout at the same time a queue freeze is waiting
-* completion, since the timeout code would not be able to
-* acquire the queue reference here.
-*
-* That's why we don't use blk_queue_enter here; instead, we use
-* percpu_ref_tryget directly, because we need to be able to
-* obtain a reference even in the short window between the queue
-* starting to freeze, by dropping the first reference in
-* blk_freeze_queue_start, and the moment the last request is
-* consumed, marked by the instant q_usage_counter reaches
-* zero.
-*/
-   if (!percpu_ref_tryget(>q_usage_counter))
-   return;
-
-   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   blk_mq_tagset_busy_iter(set, blk_mq_check_expired, );
 
if (next != 0) {
-   mod_timer(>timeout, next);
-   } else {
+   mod_timer(>timer, next);
+   return;
+   }
+
+   list_for_each_entry(q, >tag_list, tag_set_list) {
/*
 * Request timeouts are handled as a forward rolling timer. If
 * we end up here it means that no requests are pending and
@@ -881,7 +868,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
blk_mq_tag_idle(hctx);
}
}
-   blk_queue_exit(q);
 }
 
 struct flush_busy_ctx_data {
@@ -2548,7 +2534,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
if (!q->nr_hw_queues)
goto err_hctxs;
 
-   INIT_WORK(>timeout_work, blk_mq_timeout_work);
blk_queue_rq_timeout(q, set->timeout ? set->timeout 

Re: [PATCH 2/3] block, scsi: Rework runtime power management

2018-07-18 Thread Bart Van Assche
On Wed, 2018-07-18 at 20:16 +0800, Ming Lei wrote:
> On Wed, Jul 18, 2018 at 7:49 AM, Bart Van Assche  
> wrote:
> > @@ -3801,8 +3778,11 @@ int blk_pre_runtime_suspend(struct request_queue *q)
> > if (!q->dev)
> > return ret;
> > 
> > +   blk_set_preempt_only(q);
> > +   blk_freeze_queue_start(q);
> > +
> > spin_lock_irq(q->queue_lock);
> > -   if (q->nr_pending) {
> > +   if (!percpu_ref_is_zero(>q_usage_counter)) {
> 
> This way can't work reliably because the percpu ref isn't in atomic mode
> yet after blk_freeze_queue_start() returns, then percpu_ref_is_zero() won't
> see accurate value of the counter, finally the device may be put down before
> in-flight requests are completed by hardware.

Hello Ming,

The blk_freeze_queue_start() implementation is as follows:

void blk_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;

freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
}
}

>From the documentation header in include/linux/percpu-refcount.h above
percpu_ref_kill():

 * Switches @ref into atomic mode before gathering up the percpu counters
 * and dropping the initial ref.

In other words, I think that after blk_freeze_queue_start() returns that it is
guaranteed that q->q_usage_counter is in atomic mode. However, we may need to
serialize concurrent blk_freeze_queue_start() calls to guarantee that this is
always the case if multiple threads call blk_freeze_queue_start() concurrently.

Bart.

Re: Silent data corruption in blkdev_direct_IO()

2018-07-18 Thread Jan Kara
On Wed 18-07-18 13:40:07, Jan Kara wrote:
> On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote:
> > On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> > > Please go ahead and take care of it since you have the test cases.
> > 
> > Speaking of which, do we already know how it is triggered and can we
> > cook up a blktests testcase for it? This would be more than helpful
> > for all parties.
> 
> Using multiple iovecs with writev / readv trivially triggers the case of IO
> that is done partly as direct and partly as buffered. Neither me nor Martin
> were able to trigger the data corruption the customer is seeing with KVM
> though (since the generic code tries to maintain data integrity even if the
> IO is mixed). It should be possible to trigger the corruption by having two
> processes doing write to the same PAGE_SIZE region of a block device, just at
> different offsets. And if the first process happens to use direct IO while
> the second ends up doing read-modify-write cycle through page cache, the
> first write could end up being lost. I'll try whether something like this
> is able to see the corruption...

OK, when I run attached test program like:

blkdev-dio-test /dev/loop0 0 &
blkdev-dio-test /dev/loop0 2048 &

One of them reports lost write almost immediately. On kernel with my fix
the test program runs for quite a while without problems.

Honza
-- 
Jan Kara 
SUSE Labs, CR
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 

#define PAGE_SIZE 4096
#define SECT_SIZE 512
#define BUF_OFF (2*SECT_SIZE)

int main(int argc, char **argv)
{
	int fd = open(argv[1], O_RDWR | O_DIRECT);
	int ret;
	char *buf;
	loff_t off;
	struct iovec iov[2];
	unsigned int seq;

	if (fd < 0) {
		perror("open");
		return 1;
	}

	off = strtol(argv[2], NULL, 10);

	buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE);

	iov[0].iov_base = buf;
	iov[0].iov_len = SECT_SIZE;
	iov[1].iov_base = buf + BUF_OFF;
	iov[1].iov_len = SECT_SIZE;

	seq = 0;
	memset(buf, 0, PAGE_SIZE);
	while (1) {
		*(unsigned int *)buf = seq;
		*(unsigned int *)(buf + BUF_OFF) = seq;
		ret = pwritev(fd, iov, 2, off);
		if (ret < 0) {
			perror("pwritev");
			return 1;
		}
		if (ret != 2*SECT_SIZE) {
			fprintf(stderr, "Short pwritev: %d\n", ret);
			return 1;
		}
		ret = pread(fd, buf, PAGE_SIZE, off);
		if (ret < 0) {
			perror("pread");
			return 1;
		}
		if (ret != PAGE_SIZE) {
			fprintf(stderr, "Short read: %d\n", ret);
			return 1;
		}
		if (*(unsigned int *)buf != seq ||
		*(unsigned int *)(buf + SECT_SIZE) != seq) {
			printf("Lost write %u: %u %u\n", seq, *(unsigned int *)buf, *(unsigned int *)(buf + SECT_SIZE));
			return 1;
		}
		seq++;
	}

	return 0;
}


Re: Silent data corruption in blkdev_direct_IO()

2018-07-18 Thread Jan Kara
On Wed 18-07-18 11:20:15, Johannes Thumshirn wrote:
> On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> > Please go ahead and take care of it since you have the test cases.
> 
> Speaking of which, do we already know how it is triggered and can we
> cook up a blktests testcase for it? This would be more than helpful
> for all parties.

Using multiple iovecs with writev / readv trivially triggers the case of IO
that is done partly as direct and partly as buffered. Neither me nor Martin
were able to trigger the data corruption the customer is seeing with KVM
though (since the generic code tries to maintain data integrity even if the
IO is mixed). It should be possible to trigger the corruption by having two
processes doing write to the same PAGE_SIZE region of a block device, just at
different offsets. And if the first process happens to use direct IO while
the second ends up doing read-modify-write cycle through page cache, the
first write could end up being lost. I'll try whether something like this
is able to see the corruption...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: Silent data corruption in blkdev_direct_IO()

2018-07-18 Thread Johannes Thumshirn
On Wed, Jul 18, 2018 at 03:54:46PM +0800, Ming Lei wrote:
> Please go ahead and take care of it since you have the test cases.

Speaking of which, do we already know how it is triggered and can we
cook up a blktests testcase for it? This would be more than helpful
for all parties.

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: Silent data corruption in blkdev_direct_IO()

2018-07-18 Thread Ming Lei
On Wed, Jul 18, 2018 at 09:32:12AM +0200, Martin Wilck wrote:
> On Wed, 2018-07-18 at 10:48 +0800, Ming Lei wrote:
> > On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote:
> > > 
> > > From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00
> > > 2001
> > > From: Martin Wilck 
> > > Date: Wed, 18 Jul 2018 01:56:37 +0200
> > > Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last
> > > iovec
> > > 
> > > If the last page of the bio is not "full", the length of the last
> > > vector bin needs to be corrected. This bin has the index
> > > (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper
> > > array which
> > > is shifted by the value of bio->bi_vcnt at function invocation.
> > > 
> > > Signed-off-by: Martin Wilck 
> > > ---
> > >  block/bio.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/block/bio.c b/block/bio.c
> > > index 53e0f0a..c22e76f 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio,
> > > struct iov_iter *iter)
> > >   bv[0].bv_offset += offset;
> > >   bv[0].bv_len -= offset;
> > >   if (diff)
> > > - bv[bio->bi_vcnt - 1].bv_len -= diff;
> > > + bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff;
> > >  
> > >   iov_iter_advance(iter, size);
> > >   return 0;
> > 
> > Right, that is the issue, we need this fix for -stable, but maybe the
> > following fix is more readable:
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index f3536bfc8298..6e37b803755b 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page);
> >   */
> >  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >  {
> > -   unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > +   unsigned short idx, nr_pages = bio->bi_max_vecs - bio-
> > >bi_vcnt;
> > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> > struct page **pages = (struct page **)bv;
> > -   size_t offset, diff;
> > +   size_t offset;
> > ssize_t size;
> >  
> > size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages,
> > );
> > if (unlikely(size <= 0))
> > return size ? size : -EFAULT;
> > -   nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
> > +   idx = nr_pages = (size + offset + PAGE_SIZE - 1) /
> > PAGE_SIZE;
> >  
> > /*
> >  * Deep magic below:  We need to walk the pinned pages
> > backwards
> > @@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio,
> > struct iov_iter *iter)
> > bio->bi_iter.bi_size += size;
> > bio->bi_vcnt += nr_pages;
> >  
> > -   diff = (nr_pages * PAGE_SIZE - offset) - size;
> > -   while (nr_pages--) {
> > -   bv[nr_pages].bv_page = pages[nr_pages];
> > -   bv[nr_pages].bv_len = PAGE_SIZE;
> > -   bv[nr_pages].bv_offset = 0;
> > +   while (idx--) {
> > +   bv[idx].bv_page = pages[idx];
> > +   bv[idx].bv_len = PAGE_SIZE;
> > +   bv[idx].bv_offset = 0;
> > }
> >  
> > bv[0].bv_offset += offset;
> > bv[0].bv_len -= offset;
> > -   if (diff)
> > -   bv[bio->bi_vcnt - 1].bv_len -= diff;
> > +   bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) -
> > size;
> >  
> > iov_iter_advance(iter, size);
> > return 0;
> > 
> > And for mainline, I suggest to make Christoph's new code in, that is
> > easy to prove its correctness, and seems simpler.
> 
> Fine with me. Will you take care of a submission, or should I?
> Btw, this is not the full fix for our data corruption issue yet.
> Another patch is needed which still needs testing.

Please go ahead and take care of it since you have the test cases.

thanks
Ming


Re: Silent data corruption in blkdev_direct_IO()

2018-07-18 Thread Martin Wilck
On Wed, 2018-07-18 at 10:48 +0800, Ming Lei wrote:
> On Wed, Jul 18, 2018 at 02:07:28AM +0200, Martin Wilck wrote:
> > 
> > From b75adc856119346e02126cf8975755300f2d9b7f Mon Sep 17 00:00:00
> > 2001
> > From: Martin Wilck 
> > Date: Wed, 18 Jul 2018 01:56:37 +0200
> > Subject: [PATCH] block: bio_iov_iter_get_pages: fix size of last
> > iovec
> > 
> > If the last page of the bio is not "full", the length of the last
> > vector bin needs to be corrected. This bin has the index
> > (bio->bi_vcnt - 1), but in bio->bi_io_vec, not in the "bv" helper
> > array which
> > is shifted by the value of bio->bi_vcnt at function invocation.
> > 
> > Signed-off-by: Martin Wilck 
> > ---
> >  block/bio.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 53e0f0a..c22e76f 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -913,7 +913,7 @@ int bio_iov_iter_get_pages(struct bio *bio,
> > struct iov_iter *iter)
> > bv[0].bv_offset += offset;
> > bv[0].bv_len -= offset;
> > if (diff)
> > -   bv[bio->bi_vcnt - 1].bv_len -= diff;
> > +   bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -= diff;
> >  
> > iov_iter_advance(iter, size);
> > return 0;
> 
> Right, that is the issue, we need this fix for -stable, but maybe the
> following fix is more readable:
> 
> diff --git a/block/bio.c b/block/bio.c
> index f3536bfc8298..6e37b803755b 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -914,16 +914,16 @@ EXPORT_SYMBOL(bio_add_page);
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> - unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> + unsigned short idx, nr_pages = bio->bi_max_vecs - bio-
> >bi_vcnt;
>   struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>   struct page **pages = (struct page **)bv;
> - size_t offset, diff;
> + size_t offset;
>   ssize_t size;
>  
>   size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages,
> );
>   if (unlikely(size <= 0))
>   return size ? size : -EFAULT;
> - nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
> + idx = nr_pages = (size + offset + PAGE_SIZE - 1) /
> PAGE_SIZE;
>  
>   /*
>* Deep magic below:  We need to walk the pinned pages
> backwards
> @@ -936,17 +936,15 @@ int bio_iov_iter_get_pages(struct bio *bio,
> struct iov_iter *iter)
>   bio->bi_iter.bi_size += size;
>   bio->bi_vcnt += nr_pages;
>  
> - diff = (nr_pages * PAGE_SIZE - offset) - size;
> - while (nr_pages--) {
> - bv[nr_pages].bv_page = pages[nr_pages];
> - bv[nr_pages].bv_len = PAGE_SIZE;
> - bv[nr_pages].bv_offset = 0;
> + while (idx--) {
> + bv[idx].bv_page = pages[idx];
> + bv[idx].bv_len = PAGE_SIZE;
> + bv[idx].bv_offset = 0;
>   }
>  
>   bv[0].bv_offset += offset;
>   bv[0].bv_len -= offset;
> - if (diff)
> - bv[bio->bi_vcnt - 1].bv_len -= diff;
> + bv[nr_pages - 1].bv_len -= (nr_pages * PAGE_SIZE - offset) -
> size;
>  
>   iov_iter_advance(iter, size);
>   return 0;
> 
> And for mainline, I suggest to make Christoph's new code in, that is
> easy to prove its correctness, and seems simpler.

Fine with me. Will you take care of a submission, or should I?
Btw, this is not the full fix for our data corruption issue yet.
Another patch is needed which still needs testing.

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)



Re: [PATCH 1/3] block: Fix a comment in a header file

2018-07-18 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850