Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-13 Thread t...@kernel.org
Hello, Bart.

On Thu, Apr 12, 2018 at 10:40:52PM +, Bart Van Assche wrote:
> Is something like the patch below perhaps what you had in mind?

Yeah, exactly.  It'd be really great to have the lockdep asserts add
to the right places.

Thanks.

-- 
tejun


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org
Hello,

On Thu, Apr 12, 2018 at 06:56:26PM +, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 11:11 -0700, t...@kernel.org wrote:
> > * Right now, blk_queue_enter/exit() doesn't have any annotations.
> >   IOW, there's no way for paths which need enter locked to actually
> >   assert that.  If we're gonna protect more things with queue
> >   enter/exit, it gotta be annotated.
> 
> Hello Tejun,
> 
> One possibility is to check the count for the local CPU of q->q_usage_counter
> at runtime, e.g. using WARN_ON_ONCE(). However, that might have a runtime
> overhead. Another possibility is to follow the example of kernfs and to use
> a lockdep map and lockdep_assert_held(). There might be other alternatives I
> have not thought of.

Oh, I'd just do straight-forward lockdep annotation on
queue_enter/exit functions and provide the necessary assert helpers.

Thanks.

-- 
tejun


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org
Hello,

On Thu, Apr 12, 2018 at 04:29:09PM +, Bart Van Assche wrote:
> Any code that submits a bio or request needs blk_queue_enter() /
> blk_queue_exit() anyway. Please have a look at the following commit - you will
> see that that commit reduces the number of blk_queue_enter() / 
> blk_queue_exit()
> calls in the hot path:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus=37f9579f4c31a6d698dbf3016d7bf132f9288d30

So, this can work, but it's still pretty fragile.

* Lock switching is fragile and we really should get rid of it.  This
  is very likely to come back and bite us.

* Right now, blk_queue_enter/exit() doesn't have any annotations.
  IOW, there's no way for paths which need enter locked to actually
  assert that.  If we're gonna protect more things with queue
  enter/exit, it gotta be annotated.

Thanks.

-- 
tejun


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org
Hello,

On Thu, Apr 12, 2018 at 04:03:52PM +, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 08:37 -0700, Tejun Heo wrote:
> > On Thu, Apr 12, 2018 at 08:09:17AM -0600, Bart Van Assche wrote:
> > > I have retested hotunplugging by rerunning the srp-test software. It
> > > seems like you overlooked that this patch does not remove the
> > > blkcg_exit_queue() call from blk_cleanup_queue()? If a device is
> > > hotunplugged it is up to the block driver to call
> > > blk_cleanup_queue(). And blk_cleanup_queue() will call
> > > blkcg_exit_queue().
> > 
> > Hmm... what'd prevent blg_lookup_and_create() racing against that?
> 
> Hello Tejun,
> 
> Did you perhaps mean blkg_lookup_create()? That function has one caller,

Ah yeah, sorry about the sloppiness.

> namely blkcg_bio_issue_check(). The only caller of that function is
> generic_make_request_checks(). A patch was posted on the linux-block mailing
> list recently that surrounds that call with blk_queue_enter() / 
> blk_queue_exit().
> I think that prevents that blkcg_exit_queue() is called concurrently with
> blkg_lookup_create().

Yeah, that'd solve the problem for that particular path, but what
that's doing is adding another layer of refcnting which is used to
implement the revoke (or sever) semantics.  This is a fragile approach
tho because it isn't clear who's protecting what and there's nothing
in blkg's usage which suggests it'd be protected that way and we're
basically working around a different underlying problem (lock
switching) by expanding the coverage of a different lock.

I'd much prefer fixing the lock switching problem properly than
working around that shortcoming this way.

Thans.

-- 
tejun


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org
On Thu, Apr 12, 2018 at 06:58:21AM -0700, t...@kernel.org wrote:
> Cool, can we just factor out the queue lock from those drivers?  It's
> just really messy to be switching locks like we do in the cleanup
> path.

So, looking at a couple drivers, it looks like all we'd need is a
struct which contains a spinlock and a refcnt.  Switching the drivers
to use that is a bit tedious but straight-forward and the block layer
can simply put that struct on queue release.

Thanks.

-- 
tejun


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org
Hello,

On Thu, Apr 12, 2018 at 03:56:51PM +0200, h...@lst.de wrote:
> On Thu, Apr 12, 2018 at 06:48:12AM -0700, t...@kernel.org wrote:
> > > Which sounds like a very good reason not to use a driver controller
> > > lock for internals like blkcq.
> > > 
> > > In fact splitting the lock used for synchronizing access to queue
> > > fields from the driver controller lock used to synchronize I/O
> > > in the legacy path in long overdue.
> > 
> > It'd be probably a lot easier to make sure the shared lock doesn't go
> > away till all the request_queues using it go away.  The choice is
> > between refcnting something which carries the lock and double locking
> > in hot paths.  Can't think of many downsides of the former approach.
> 
> We've stopped sharing request_queues between different devices a while
> ago.  The problem is just that for legacy devices the driver still controls
> the lock life time, and it might be shorter than the queue lifetime.
> Note that for blk-mq we basically don't use the queue_lock at all,
> and certainly not in the hot path.

Cool, can we just factor out the queue lock from those drivers?  It's
just really messy to be switching locks like we do in the cleanup
path.

Thanks.

-- 
tejun


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org
Hello,

On Thu, Apr 12, 2018 at 03:14:40PM +0200, h...@lst.de wrote:
> > At least the SCSI ULP drivers drop the last reference to a disk after
> > the blk_cleanup_queue() call. As explained in the description of commit
> > a063057d7c73, removing a request queue from blkcg must happen before
> > blk_cleanup_queue() finishes because a block driver may free the
> > request queue spinlock immediately after blk_cleanup_queue() returns.
> > So I don't think that we can move the code that removes a request
> > queue from blkcg into put_disk(). Another challenge is that some block
> > drivers (e.g. skd) clear the disk->queue pointer if device_add_disk()
> > has not been called to avoid that put_disk() causes a request queue
> > reference count imbalance.
> 
> Which sounds like a very good reason not to use a driver controller
> lock for internals like blkcq.
> 
> In fact splitting the lock used for synchronizing access to queue
> fields from the driver controller lock used to synchronize I/O
> in the legacy path in long overdue.

It'd be probably a lot easier to make sure the shared lock doesn't go
away till all the request_queues using it go away.  The choice is
between refcnting something which carries the lock and double locking
in hot paths.  Can't think of many downsides of the former approach.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-12 Thread t...@kernel.org
Hello, Israel.

On Thu, Apr 12, 2018 at 11:59:11AM +0300, Israel Rukshin wrote:
> On 4/12/2018 12:31 AM, t...@kernel.org wrote:
> >Hey, again.
> >
> >On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:
> >>Hello, Israel.
> >>
> >>On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> >>>>Just noticed this one, this looks interesting to me as well. Israel,
> >>>>can you run your test with this patch?
> >>>Yes, I just did and it looks good.
> >>Awesome.
> >Just to be sure, you tested the combined patch and saw the XXX debug
> >messages, right?
> 
> I am not sure I understand.
> What is the combined patch?
> What are the debug messages that I need to see?

There are total of three patches and I posted the combined patch + a
debug message in another post.  Can you please test the following
patch?  It'll print out something like the following (possibly many of
them).

 XXX blk_mq_timeout_reset_cleanup(): executed %d missed completions

Thanks.

Index: work/block/blk-mq.c
===
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   if (clear)
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* mem

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:
> Hello, Israel.
> 
> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> > >Just noticed this one, this looks interesting to me as well. Israel,
> > >can you run your test with this patch?
> > 
> > Yes, I just did and it looks good.
> 
> Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > > On 04/11/18 13:00, Alexandru Moise wrote:
> > > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > > returns
> > > > an ERR ptr, because it tries to insert into a populated index into 
> > > > blkcg->blkg_tree,
> > > > the entry that we fail to remove at __blk_release_queue().
> > > 
> > > Hello Alex,
> > > 
> > > Had you considered something like the untested patch below?
> > 
> > But queue init shouldn't fail here, right?
> 
> Hello Tejun,
> 
> Your question is not entirely clear to me. Are you referring to the atomic
> allocations in blkg_create() or are you perhaps referring to something else?

Hmm.. maybe I'm confused but I thought that the fact that
blkcg_init_queue() fails itself is already a bug, which happens
because a previously destroyed queue left behind blkgs.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> On 04/11/18 13:00, Alexandru Moise wrote:
> >But the root cause of it is in blkcg_init_queue() when blkg_create() returns
> >an ERR ptr, because it tries to insert into a populated index into 
> >blkcg->blkg_tree,
> >the entry that we fail to remove at __blk_release_queue().
> 
> Hello Alex,
> 
> Had you considered something like the untested patch below?

But queue init shouldn't fail here, right?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:26:12PM +, Bart Van Assche wrote:
> Please explain what you wrote further. It's not clear to me why you think
> that anything is broken nor what a "sever model" is.

So, sever (or revoke) model is where you actively disconnect an object
and ensuring that there'll be no further references from others.  In
contrast, what we usually do is refcnting, where we don't actively
sever the dying object from its users but let the users drain
themselves over time and destroy the object when we know all the users
are gone (recnt reaching zero).

> I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
> to avoid race conditions between request queue cleanup and the block cgroup
> controller. Additionally, since it is guaranteed that no new requests will
> be submitted to a queue after it has been marked dead I don't see why it
> would make sense to keep the association between a request queue and the
> blkcg controller until the last reference on a queue is dropped.

Sure, new requests aren't the only source tho.  e.g. there can be
derefs through sysfs / cgroupfs or writeback attempts on dead devices.
If you want to implement sever, you gotta hunt down all those and make
sure they can't make no further derefs.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote:
> A simple and effective solution is to dissociate a request queue from the
> block cgroup controller before blk_cleanup_queue() returns. This is why commit
> a063057d7c73 ("block: Fix a race between request queue removal and the block
> cgroup controller") moved the blkcg_exit_queue() call from 
> __blk_release_queue()
> into blk_cleanup_queue().

which is broken.  We can try to switch the lifetime model to revoking
all live objects but that likely is a lot more involving than blindly
moving blkg shootdown from release to cleanup.  Implementing sever
semantics is usually a lot more involved / fragile because it requires
explicit participation from all users (exactly the same way revoking
->queue_lock is difficult).

I'm not necessarily against switching to sever model, but what the
patch did isn't that.  It just moved some code without actually
understanding or auditing what the implications are.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> > And looking at the change, it looks like the right thing we should
> > have done is caching @lock on the print_blkg side and when switching
> > locks make sure both locks are held.  IOW, do the following in
> > blk_cleanup_queue()
> > 
> > spin_lock_irq(lock);
> > if (q->queue_lock != >__queue_lock) {
> > spin_lock(>__queue_lock);
> > q->queue_lock = >__queue_lock;
> > spin_unlock(>__queue_lock);
> > }
> > spin_unlock_irq(lock);
> > 
> > Otherwise, there can be two lock holders thinking they have exclusive
> > access to the request_queue.
> 
> I think that's a bad idea. A block driver is allowed to destroy the
> spinlock it associated with the request queue as soon as blk_cleanup_queue()
> has finished. If the block cgroup controller would cache a pointer to the
> block driver spinlock then that could cause the cgroup code to attempt to
> lock a spinlock after it has been destroyed. I don't think we need that kind
> of race conditions.

I see, but that problem is there with or without caching as long as we
have queu_lock usage which reach beyond cleanup_queue, right?  Whether
that user caches the lock for matching unlocking or not doesn't really
change the situation.

Short of adding protection around queue_lock switching, I can't think
of a solution tho.  Probably the right thing to do is adding queue
lock/unlock helpers which are safe to use beyond cleanup_queue.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:24:49PM +0300, Sagi Grimberg wrote:
> 
> >Ah, yeah, I was moving it out of add_timer but forgot to actully add
> >it to the issue path.  Fixed patch below.
> >
> >BTW, no matter what we do w/ the request handover between normal and
> >timeout paths, we'd need something similar.  Otherwise, while we can
> >reduce the window, we can't get rid of it.
> >
> >Thanks.
> 
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?

The following is the combined patch with one debug message to see
whether missed completions are actually happening.

Thanks.

Index: work/block/blk-mq.c
===
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   if (clear)
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
+*/
+   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+   blk_mq_rq_update_aborted_gstate(rq, 0);
+}
+
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   int cnt = 0;
+
+   /*
+* @rq is now fully returned to the normal path.  If normal
+ 

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hello, Bart.

On Wed, Apr 11, 2018 at 12:50:51PM +, Bart Van Assche wrote:
> Thank you for having shared this patch. It looks interesting to me. What I
> know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this patch
>   applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear when
>   the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
>   memory upon timeout"
>   (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
>   and also to a "RIP: scsi_times_out+0x17" crash during boot
>   (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
> 
> So we have the choice between two approaches:
> (1) apply the patch from your previous e-mail and root-cause and fix the
> crashes referred to above.
> (2) apply a patch that makes the crashes reported against v4.16 disappear and
> remove the atomic instructions introduced by such a patch at a later time.
> 
> Since crashes have been reported for kernel v4.16 I think we should follow
> approach (2). That will remove the time pressure from root-causing and fixing
> the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

So, it really bothers me how blind we're going about this.  It isn't
an insurmountable emergency that we have to adopt whatever solution
which passed a couple tests this minute.  We can debug and root cause
this properly and pick the right solution.  We even have two most
likely causes already analysed and patches proposed, one of them
months ago.  If we wanna change the handover model, let's do that
because the new one is better, not because of vague fear.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
On Tue, Apr 10, 2018 at 09:46:23PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 14:33 -0700, t...@kernel.org wrote:
> > +   else
> > +   rq->missed_completion = true;
> 
> In this patch I found code that sets rq->missed_completion but no code that
> clears it. Did I perhaps miss something?

Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.

---
 block/blk-mq.c |   45 -
 include/linux/blkdev.h |2 ++
 2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
 
-   hctx->need_sync_rcu = false;
+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
 
/* terminate the ones we won */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */
if (nr_resets) {
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, false);
+   blk_mq_qu

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 08:40:43AM -0700, t...@kernel.org wrote:
> On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> > I agree, the issue should be in driver's irq handler and .timeout in
> > theory.
> > 
> > For example, even though one request has been done by irq handler, .timeout
> > still may return RESET_TIMER.
> 
> blk-mq can use a separate flag to track normal completions during
> timeout and complete the request normally on RESET_TIMER if the flag
> is set while EH was in progress.  With a bit of care, we'd be able to
> plug the race completely.

So, something like the following.  Just compile tested.  The extra
dedicated field is kinda unfortunate but we need something like that
no matter how we do this because the involved completion marking
violates the ownership (we want to record the normail completion while
timeout handling is in progress).  Alternatively, we can add an atomic
flags field.  The addition doesn't change the size of struct request
because it fits in a hole but that hole seems removeable, so...

Anyways, it'd be great if someone can test this combined with the
previous two rcu sync patches.

Thanks.
---
 block/blk-mq.c |   44 +++-
 include/linux/blkdev.h |2 ++
 2 files changed, 33 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -881,7 +883,7 @@ static void blk_mq_check_expired(struct
}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +898,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
 
-   hctx->need_sync_rcu = false;
+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +920,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
@@ -976,7 +991,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
 
/* terminate the ones we won */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -98

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> I agree, the issue should be in driver's irq handler and .timeout in
> theory.
> 
> For example, even though one request has been done by irq handler, .timeout
> still may return RESET_TIMER.

blk-mq can use a separate flag to track normal completions during
timeout and complete the request normally on RESET_TIMER if the flag
is set while EH was in progress.  With a bit of care, we'd be able to
plug the race completely.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello, Ming.

On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> +   if (time_after_eq(jiffies, deadline) &&
> +   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> +   blk_mq_rq_timed_out(rq, reserved);
> 
> Normal completion still can happen between blk_mq_change_rq_state()
> and blk_mq_rq_timed_out().
> 
> In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> the big window.

I don't think plugging this hole is all that difficult, but this
shouldn't lead to any critical failures.  If so, that'd be a driver
bug.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 02:30:26PM +, Bart Van Assche wrote:
> > Switching to another model might be better but let's please do that
> > with the right rationales.  A good portion of this seems to be built
> > on misunderstandings.
> 
> Which misunderstandings? I'm not aware of any misunderstandings at my side.
> Additionally, tests with two different block drivers (NVMeOF initiator and
> the SRP initiator driver) have shown that the current blk-mq timeout
> implementation with or without your two patches applied result in subtle and
> hard to debug crashes and/or memory corruption. That is not the case for the

I must have missed that part.  Which tests were they?

> patch at the start of this thread. The latest report of a crash I ran into
> myself and that is fixed by the patch at the start of this thread is
> available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.
> 
> Please also keep in mind that if this patch would be accepted that that does
> not prevent this patch to be replaced with an RCU-based solution later on.
> If anyone comes up any time with a reliably working RCU-based solution I
> will be happy to accept a revert of this patch and I will help reviewing that
> RCU-based solution.

Oh, switching is fine but let's get in sync first.  Who have the repro
cases and what were tested?

Thanks.

-- 
tejun


Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code

2018-04-09 Thread t...@kernel.org
Hello, Bart.

On Mon, Apr 09, 2018 at 09:30:27PM +, Bart Van Assche wrote:
> On Mon, 2018-04-09 at 11:56 -0700, t...@kernel.org wrote:
> > On Mon, Apr 09, 2018 at 05:03:05PM +, Bart Van Assche wrote:
> > > exist today in the blk-mq timeout handling code cannot be fixed completely
> > > using RCU only.
> > 
> > I really don't think that is that complicated.  Let's first confirm
> > the race fix and get to narrowing / closing that window.
> 
> Two months ago it was reported for the first time that commit 1d9bd5161ba3
> ("blk-mq: replace timeout synchronization with a RCU and generation based
> scheme") introduces a regression. Since that report nobody has posted a
> patch that fixes all races related to blk-mq timeout handling and that only

The two patches using RCU were posted a long time ago.  It was just
that the repro that only you had at the time didn't work anymore so we
couldn't confirm the fix.  If we now have a different repro, awesome.
Let's see whether the fix works.

> uses RCU. If you want to continue working on this that's fine with me. But
> since my opinion is that it is impossible to fix these races using RCU only
> I will continue working on an alternative approach. See also "[PATCH]
> blk-mq: Fix a race between resetting the timer and completion handling"
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html).

ISTR discussing that patch earlier.  Didn't the RCU based fix get
posted after that discussion?

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 10:09:18PM +, Bart Van Assche wrote:
> Please elaborate what your long-term goal is for the blk-mq timeout handler.

Hmm... I don't really have any plans beyond what's been posted.

> The legacy block layer suspends request state changes while a timeout is
> being processed by holding the request queue lock while requests are being
> processed, while processing a timeout and while calling 
> q->rq_timed_out_fn(rq).
> Do you think it is possible to make the blk-mq core suspend request processing
> while processing a timeout without introducing locking in
> blk_mq_complete_request()? If you do not plan to add locking in
> blk_mq_complete_request(), do you think it is possible to fix all the races we
> discussed in previous e-mails?

I don't know of multiple race conditions.  What am I missing?  AFAIK,
there's one non-critical race condition which has always been there.
We have a larger race window for that case but don't yet know whether
that's problematic or not.  If that actually is problematic, we can
figure out a way to solve that but such effort / added complexity
doesn't seem justified yet.  No?

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 09:56:41PM +, Bart Van Assche wrote:
> This patch increases the time during which .aborted_gstate == .gstate if the
> timeout is reset. Does that increase the chance that a completion will be 
> missed
> if the request timeout is reset?

It sure does, but we're comparing an outright kernel bug vs. an
inherently opportunistic mechanism being a bit more lossy.  I think
the answer is pretty clear.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 09:31:34PM +, Bart Van Assche wrote:
> > > > +* As nothing prevents from completion happening while
> > > > +* ->aborted_gstate is set, this may lead to ignored completions
> > > > +* and further spurious timeouts.
> > > > +*/
> > > > +   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
> > > > +   blk_mq_rq_update_aborted_gstate(rq, 0);
...
> I think it can happen that the above code sees that (rq->rq_flags &
> RQF_MQ_TIMEOUT_RESET) != 0, that blk_mq_start_request() executes the
> following code:
> 
>   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>   blk_add_timer(rq);
> 
> and that subsequently blk_mq_rq_update_aborted_gstate(rq, 0) is called,
> which will cause the next completion to be lost. Is fixing one occurrence
> of a race and reintroducing it in another code path really an improvement?

I'm not following at all.  How would blk_mq_start_request() get called
on the request while it's still owned by the timeout handler?  That
gstate clearing is what transfers the ownership back to the
non-timeout path.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-21 Thread t...@kernel.org
Hello, Bart.

On Wed, Feb 21, 2018 at 06:53:05PM +, Bart Van Assche wrote:
> On Sun, 2018-02-18 at 05:11 -0800, t...@kernel.org wrote:
> > On Wed, Feb 14, 2018 at 04:58:56PM +, Bart Van Assche wrote:
> > > With this patch applied the tests I ran so far pass.
> > 
> > Ah, great to hear.  Thanks a lot for testing.  Can you please verify
> > the following?  It's the same approach but with RCU sync batching.
> 
> Hello Tejun,
> 
> After having merged kernel v4.16-rc2 into my kernel tree and after having
> applied patch "Avoid that ATA error handling hangs"
> (https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg71145.html) I
> have not been able to reproduce the reported crash - neither with the patch
> applied that was posted on February 13th nor without that patch. This makes
> me wonder whether we should drop the discussed patches unless someone comes
> up with a reproducible test case?

It is an actual bug in that we actually can override the timer setting
of the next request instance.  Given that the race window isn't that
large, it makes sense that the reproducibility is affected by
butterflies.  I think it makes sense to fix the bug.  Any chance you
can test the new patch on top of the reproducible setup?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-18 Thread t...@kernel.org
Hello, Bart.

On Wed, Feb 14, 2018 at 04:58:56PM +, Bart Van Assche wrote:
> With this patch applied the tests I ran so far pass.

Ah, great to hear.  Thanks a lot for testing.  Can you please verify
the following?  It's the same approach but with RCU sync batching.

Thanks.

Index: work/block/blk-mq.c
===
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -816,7 +816,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -831,13 +832,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -874,13 +872,34 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -893,7 +912,25 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* there's no memory barrier around ->aborted_gstate.  Let
+* blk_add_timer() clear it later.
+*
+* As nothing prevents from completion happening while
+* ->aborted_gstate is set, this may lead to ignored completions
+* and further spurious timeouts.
+*/
+   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+   blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
@@ -928,7 +965,7 @@ static void blk_mq_timeout_work(struct w
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
 
if (data.nr_expired) {
-   bool has_rcu = false;
+   int nr_resets = 0;
 
/*
 * Wait till everyone sees ->aborted_gstate.  The
@@ -936,22 +973,22 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   queue_for_each_hw_ctx(q, hctx, i) {
-   if (!hctx->nr_expired)
-   continue;
-
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-   has_rcu = true;
-   else
-   synchronize_srcu(hctx->srcu);
-
-   hctx->nr_expired = 0;
-   }
-   if (has_rcu)
-   synchronize_rcu();
+   blk_mq_timeout_sync_rcu(q);
 
/* terminate the ones we won */
-   blk_mq_queue_tag_busy_iter(q, 

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-13 Thread t...@kernel.org
Hello, Bart.

Sorry about the delay.

On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote:
> The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The
> instruction at that address tries to dereference scsi_cmnd.device (%rax). The
> register dump shows that that pointer has the value NULL. The only function I
> know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only
> caller of that function in the SCSI core is scsi_initialize_rq(). That 
> function
> has two callers, namely scsi_init_command() and blk_get_request(). However,
> the scsi_cmnd.device pointer is not cleared when a request finishes. This is
> why I think that the above crash report indicates that scsi_times_out() was
> called for a request that was being reinitialized and not by device 
> hotplugging.

Can you please give the following patch a shot?  While timeout path is
synchornizing against the completion path (and the following re-init)
while taking back control of a timed-out request, it wasn't doing that
while giving it back, so the timer registration could race against
completion and re-issue.  I'm still not quite sure how that can lead
to the oops tho.  Anyways, we need something like this one way or the
other.

This isn't the final patch.  We should add batching-up of rcu
synchronize calls similar to the abort path.

Thanks.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..b66aec3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -816,7 +816,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -836,8 +837,12 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 * ->aborted_gstate is set, this may lead to ignored
 * completions and further spurious timeouts.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   synchronize_rcu();
+   else
+   synchronize_srcu(hctx->srcu);
+   blk_mq_rq_update_aborted_gstate(req, 0);
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -893,7 +898,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx 
*hctx,
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, reserved);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)



Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 05:48:32PM +, Bart Van Assche wrote:
> On Thu, 2018-02-08 at 09:40 -0800, t...@kernel.org wrote:
> > Heh, sorry about not being clear.  What I'm trying to say is that
> > scmd->device != NULL && device->host == NULL.  Or was this what you
> > were saying all along?
> 
> What I agree with is that the request pointer (req argument) is stored in %rdi
> and that mov 0x1b0(%rdi),%rax loads scmd->device into %rax. Since RIP ==
> scsi_times_out+0x17, since the instruction at that address tries to 
> dereference
> %rax and since the register dump shows that %rax == NULL I think that means 
> that
> scmd->device == NULL.

Ah, I was completely confused about which one had to be NULL.  You're
absolutely right.  Let's continue earlier in the thread.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 05:37:46PM +, Bart Van Assche wrote:
> On Thu, 2018-02-08 at 09:19 -0800, t...@kernel.org wrote:
> > Hello, Bart.
> > 
> > On Thu, Feb 08, 2018 at 05:10:45PM +, Bart Van Assche wrote:
> > > I think "dereferencing a pointer" means reading the memory location that 
> > > pointer points
> > > at? Anyway, I think we both interpret the crash report in the same way, 
> > > namely that it
> > > means that scmd->device == NULL.
> > 
> > No, what I'm trying to say is that it's is the device->host reference
> > not the scmd->device reference.
> 
> Again, I think that means that we agree about the interpretation of the crash
> report. The big question here is what the next step should be to analyze this
> further and/or to avoid that this crash can occur?

Heh, sorry about not being clear.  What I'm trying to say is that
scmd->device != NULL && device->host == NULL.  Or was this what you
were saying all along?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
Hello, Bart.

On Thu, Feb 08, 2018 at 05:10:45PM +, Bart Van Assche wrote:
> I think "dereferencing a pointer" means reading the memory location that 
> pointer points
> at? Anyway, I think we both interpret the crash report in the same way, 
> namely that it
> means that scmd->device == NULL.

No, what I'm trying to say is that it's is the device->host reference
not the scmd->device reference.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
Hello, Bart.

On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote:
> > That sounds more like a scsi hotplug bug than an issue in the timeout
> > code unless we messed up @req pointer to begin with.
> 
> I don't think that this is related to SCSI hotplugging: this crash does not
> occur with the v4.15 block layer core and it does not occur with my timeout
> handler rework patch applied either. I think that means that we cannot
> exclude the block layer core timeout handler rework as a possible cause.
> 
> The disassembler output is as follows:
> 
> (gdb) disas /s scsi_times_out
> Dump of assembler code for function scsi_times_out:
> drivers/scsi/scsi_error.c:
> 282 {
>0x5bd0 <+0>: push   %r13
>0x5bd2 <+2>: push   %r12
>0x5bd4 <+4>: push   %rbp
> ./include/linux/blk-mq.h:
> 300 return rq + 1;
>0x5bd5 <+5>: lea0x178(%rdi),%rbp
> drivers/scsi/scsi_error.c:
> 282 {
>0x5bdc <+12>:push   %rbx
> 283 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
> 284 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
> 285 struct Scsi_Host *host = scmd->device->host;
>0x5bdd <+13>:mov0x1b0(%rdi),%rax
> 282 {
>0x5be4 <+20>:mov%rdi,%rbx
> 283 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
> 284 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
> 285 struct Scsi_Host *host = scmd->device->host;
>0x5be7 <+23>:mov(%rax),%r13
>0x5bea <+26>:nopl   0x0(%rax,%rax,1)
> [ ... ]
> (gdb) print /x sizeof(struct request)
> $2 = 0x178
> (gdb) print &(((struct scsi_cmnd*)0)->device)
> $4 = (struct scsi_device **) 0x38 
> (gdb) print &(((struct scsi_device*)0)->host)   
> $5 = (struct Scsi_Host **) 0x0
> 
> The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The
> instruction at that address tries to dereference scsi_cmnd.device (%rax). The
> register dump shows that that pointer has the value NULL. The only function I
> know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only
> caller of that function in the SCSI core is scsi_initialize_rq(). That 
> function
> has two callers, namely scsi_init_command() and blk_get_request(). However,
> the scsi_cmnd.device pointer is not cleared when a request finishes. This is
> why I think that the above crash report indicates that scsi_times_out() was
> called for a request that was being reinitialized and not by device 
> hotplugging.

I could be misreading it but scsi_cmnd->device dereference should be
the following.

0x5bdd <+13>:mov0x1b0(%rdi),%rax

%rdi is @req, 0x1b0(%rdi) seems to be the combined arithmetic of
blk_mq_rq_to_pdu() and ->device dereference - 0x178 + 0x38.  The
faulting access is (%rax), which is deref'ing host from device.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 07:39:40AM -0800, t...@kernel.org wrote:
> That sounds more like a scsi hotplug but than an issue in the timeout
   ^bug
> code unless we messed up @req pointer to begin with.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 01:09:57AM +, Bart Van Assche wrote:
> On Wed, 2018-02-07 at 23:48 +, Bart Van Assche wrote:
> > With this patch applied I see requests for which it seems like the timeout 
> > handler
> > did not get invoked: [ ... ]
> 
> I just noticed the following in the system log, which is probably the reason 
> why some
> requests got stuck:
> 
> Feb  7 15:16:26 ubuntu-vm kernel: BUG: unable to handle kernel NULL pointer 
> dereference at   (null)
> Feb  7 15:16:26 ubuntu-vm kernel: IP: scsi_times_out+0x17/0x2c0 [scsi_mod]
> Feb  7 15:16:26 ubuntu-vm kernel: PGD 0 P4D 0  
> Feb  7 15:16:26 ubuntu-vm kernel: Oops:  [#1] PREEMPT SMP
> Feb  7 15:16:26 ubuntu-vm kernel: Modules linked in: dm_service_time ib_srp 
> libcrc32c crc32c_generic scsi_transport_srp target_core_pscsi 
> target_core_file ib_srpt target_core_iblock target_core_mod
> rdma_rxe ip6_udp_tunnel udp_tunnel ib_umad ib_uverbs scsi_debug brd 
> mq_deadline kyber_iosched deadline_iosched cfq_iosched bfq crct10dif_pclmul 
> crc32_pclmul af_packet ghash_clmulni_intel pcbc
> aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw 
> virtio_console virtio_balloon sg button i2c_piix4 dm_multipath dm_mod dax 
> scsi_dh_rdac scsi_dh_emc scsi_dh_alua ib_iser rdma_cm iw_cm
> ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
> ip_tables x_tables ipv6 autofs4 ext4 crc16 mbcache jbd2 sd_mod virtio_net 
> virtio_blk virtio_scsi sr_mod cdrom ata_generic
> pata_acpi psmouse crc32c_intel i2c_core atkbd
> Feb  7 15:16:26 ubuntu-vm kernel: uhci_hcd ehci_hcd intel_agp ata_piix 
> intel_gtt agpgart libata virtio_pci usbcore virtio_ring virtio scsi_mod 
> usb_common unix
> Feb  7 15:16:26 ubuntu-vm kernel: CPU: 0 PID: 146 Comm: kworker/0:1H Not 
> tainted 4.15.0-dbg+ #1
> Feb  7 15:16:26 ubuntu-vm kernel: Hardware name: QEMU Standard PC (i440FX + 
> PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Feb  7 15:16:26 ubuntu-vm kernel: Workqueue: kblockd blk_mq_timeout_work
> Feb  7 15:16:26 ubuntu-vm kernel: RIP: 0010:scsi_times_out+0x17/0x2c0 
> [scsi_mod]
> Feb  7 15:16:26 ubuntu-vm kernel: RSP: 0018:98f0c02abd58 EFLAGS: 00010246
> Feb  7 15:16:26 ubuntu-vm kernel: RAX:  RBX: 965de2b3a2c0 
> RCX: 
> Feb  7 15:16:26 ubuntu-vm kernel: RDX: c0094740 RSI:  
> RDI: 965de2b3a2c0
> Feb  7 15:16:26 ubuntu-vm kernel: RBP: 965de2b3a438 R08: fffc 
> R09: 0007
> Feb  7 15:16:26 ubuntu-vm kernel: R10:  R11:  
> R12: 0002
> Feb  7 15:16:26 ubuntu-vm kernel: R13:  R14: 965de2a44218 
> R15: 965de2a48728
> Feb  7 15:16:26 ubuntu-vm kernel: FS:  () 
> GS:965dffc0() knlGS:
> Feb  7 15:16:26 ubuntu-vm kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> Feb  7 15:16:26 ubuntu-vm kernel: CR2:  CR3: 3ce0f003 
> CR4: 003606f0
> Feb  7 15:16:26 ubuntu-vm kernel: DR0:  DR1:  
> DR2: 
> Feb  7 15:16:26 ubuntu-vm kernel: DR3:  DR6: fffe0ff0 
> DR7: 0400
> Feb  7 15:16:26 ubuntu-vm kernel: Call Trace:
> Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_terminate_expired+0x42/0x80
> Feb  7 15:16:26 ubuntu-vm kernel: bt_iter+0x3d/0x50
> Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_queue_tag_busy_iter+0xe9/0x200
> Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_timeout_work+0x181/0x2e0
> Feb  7 15:16:26 ubuntu-vm kernel: process_one_work+0x21c/0x6d0
> Feb  7 15:16:26 ubuntu-vm kernel: worker_thread+0x35/0x380
> Feb  7 15:16:26 ubuntu-vm kernel: kthread+0x117/0x130
> Feb  7 15:16:26 ubuntu-vm kernel: ret_from_fork+0x24/0x30
> Feb  7 15:16:26 ubuntu-vm kernel: Code: ff ff 0f ff e9 fd fe ff ff 90 66 2e 
> 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 8d af 78 01 00 00 53 48 8b 87 b0 01 
> 00 00 48 89 fb <4c> 8b 28 0f 1f 44 00 00 65
> 8b 05 6a b5 f8 3f 83 f8 0f 0f 87 ed  
> Feb  7 15:16:26 ubuntu-vm kernel: RIP: scsi_times_out+0x17/0x2c0 [scsi_mod] 
> RSP: 98f0c02abd58
> Feb  7 15:16:26 ubuntu-vm kernel: CR2: 
> Feb  7 15:16:26 ubuntu-vm kernel: ---[ end trace ce6c20d95bab450e ]---

So, that's dereferencing %rax which is NULL.  That gotta be the ->host
deref in the following.

  enum blk_eh_timer_return scsi_times_out(struct request *req)
  {
  struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
  enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
  struct Scsi_Host *host = scmd->device->host;
  ...

That sounds more like a scsi hotplug but than an issue in the timeout
code unless we messed up @req pointer to begin with.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello,

On Wed, Feb 07, 2018 at 09:02:21PM +, Bart Van Assche wrote:
> The patch that I used in my test had an smp_wmb() call (see also below). 
> Anyway,
> I will see whether I can extract more state information through debugfs.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ef4f6df0f1df..8eb2105d82b7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -827,13 +827,9 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> - /*
> -  * As nothing prevents from completion happening while
> -  * ->aborted_gstate is set, this may lead to ignored
> -  * completions and further spurious timeouts.
> -  */
> - blk_mq_rq_update_aborted_gstate(req, 0);
>   blk_add_timer(req);
> + smp_wmb();
> + blk_mq_rq_update_aborted_gstate(req, 0);

Without the matching rmb, just adding rmb won't do much but given the
default strong ordering on x86 and other operations around, what you
were seeing is probably not caused by lack of barriers.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello,

On Wed, Feb 07, 2018 at 07:03:56PM +, Bart Van Assche wrote:
> I tried the above patch but already during the first iteration of the test I
> noticed that the test hung, probably due to the following request that got 
> stuck:
> 
> $ (cd /sys/kernel/debug/block && grep -aH . */*/*/rq_list)
> a98cff60 {.op=SCSI_IN, .cmd_flags=, 
> .rq_flags=MQ_INFLIGHT|PREEMPT|QUIET|IO_STAT|PM,
>  .state=idle, .tag=22, .internal_tag=-1, .cmd=Synchronize Cache(10) 35 00 00 
> 00 00 00, .retries=0,
>  .result = 0x0, .flags=TAGGED, .timeout=60.000, allocated 872.690 s ago}

I'm wonder how this happened, so we can lose a completion when it
races against BLK_EH_RESET_TIMER; however, the command should timeout
later cuz the timer is running again now.  Maybe we actually had the
memory barrier race that you pointed out in the other message?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, Bart.

On Wed, Feb 07, 2018 at 06:14:13PM +, Bart Van Assche wrote:
> When I wrote my comment I was not sure whether or not non-reentrancy is
> guaranteed for work queue items. However, according to what I found in the
> workqueue implementation I think that is guaranteed. So it shouldn't be
> possible that the timer activated by blk_add_timer() gets handled before
> aborted_gstate is reset. But since the timeout handler and completion

Yeah, we're basically single threaded in the timeout path.

> handlers can be executed by different CPUs, shouldn't a memory barrier be
> inserted between the blk_add_timer() call and resetting aborted_gsync to
> guarantee that a completion cannot occur before blk_add_timer() has reset
> RQF_MQ_TIMEOUT_EXPIRED?

Ah, you're right.  u64_stat_sync doesn't imply barriers, so we want
something like the following.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..d6edf3b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -593,7 +593,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request 
*rq, u64 gstate)
 */
local_irq_save(flags);
u64_stats_update_begin(>aborted_gstate_sync);
-   rq->aborted_gstate = gstate;
+   smp_store_release(>aborted_gstate, gstate);
u64_stats_update_end(>aborted_gstate_sync);
local_irq_restore(flags);
 }
@@ -605,7 +605,7 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 
do {
start = u64_stats_fetch_begin(>aborted_gstate_sync);
-   aborted_gstate = rq->aborted_gstate;
+   aborted_gstate = smp_load_acquire(>aborted_gstate);
} while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
 
return aborted_gstate;
@@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 * ->aborted_gstate is set, this may lead to ignored
 * completions and further spurious timeouts.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   blk_mq_rq_update_aborted_gstate(req, 0);
break;
case BLK_EH_NOT_HANDLED:
break;


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, Bart.

On Wed, Feb 07, 2018 at 05:27:10PM +, Bart Van Assche wrote:
> Even with the above change I think that there is still a race between the
> code that handles timer resets and the completion handler. Anyway, the test

Can you elaborate the scenario a bit further?  If you're referring to
lost completions, we've always had that and while we can try to close
that hole too, I don't think it's a critical issue.

> with which I triggered these races is as follows:
> - Start from what will become kernel v4.16-rc1 and apply the patch that adds
>   SRP over RoCE support to the ib_srpt driver. See also the "[PATCH v2 00/14]
>   IB/srpt: Add RDMA/CM support" patch series
>   (https://www.spinics.net/lists/linux-rdma/msg59589.html).
> - Apply my patch series that fixes a race between the SCSI error handler and
>   SCSI transport recovery.
> - Apply my patch series that improves the stability of the SCSI target core
>   (LIO).
> - Build and install that kernel.
> - Clone the following repository: https://github.com/bvanassche/srp-test.
> - Run the following test:
>   while true; do srp-test/run_tests -c -t 02-mq; done
> - While the test is running, check whether or not something weird happens.
>   Sometimes I see that scsi_times_out() crashes. Sometimes I see while running
>   this test that a soft lockup is reported inside blk_mq_do_dispatch_ctx().
> 
> If you want I can share the tree on github that I use myself for my tests.

Heh, that's quite a bit.  I'll take up on that git tree later but for
now I'd really appreciate if you can test the patch.

Thank you very much.

-- 
tejun


Re: [PATCH] blk-mq: Fix a race between resetting the timer and completion handling

2018-02-05 Thread t...@kernel.org
Hello, Bart.

On Mon, Feb 05, 2018 at 09:33:03PM +, Bart Van Assche wrote:
> My goal with this patch is to fix the race between resetting the timer and
> the completion path. Hence change (3). Changes (1) and (2) are needed to
> make the changes in blk_mq_rq_timed_out() work.

Ah, I see.  That makes sense.  Can I ask you to elaborate the scenario
you were fixing?

> > > @@ -831,13 +834,12 @@ static void blk_mq_rq_timed_out(struct request 
> > > *req, bool reserved)
> > >   __blk_mq_complete_request(req);
> > >   break;
> > >   case BLK_EH_RESET_TIMER:
> > > - /*
> > > -  * As nothing prevents from completion happening while
> > > -  * ->aborted_gstate is set, this may lead to ignored
> > > -  * completions and further spurious timeouts.
> > > -  */
> > > - blk_mq_rq_update_aborted_gstate(req, 0);
> > > + local_irq_disable();
> > > + write_seqcount_begin(>gstate_seq);
> > >   blk_add_timer(req);
> > > + req->aborted_gstate = 0;
> > > + write_seqcount_end(>gstate_seq);
> > > + local_irq_enable();
> > >   break;
> > 
> > So, this is #3 and I'm not sure how adding gstate_seq protection gets
> > rid of the race condition mentioned in the comment.  It's still the
> > same that nothing is protecting against racing w/ completion.
> 
> I think you are right. I will see whether I can rework this patch to address
> that race.

That race is harmless and has always been there tho.  It only happens
when the actual completion coincides with timeout expiring, which is
very unlikely, and the only downside is that the completion gets lost
and the request will get timed out down the line.  It'd of course be
better to close the race window but code simplicity likely is an
important trade-off factor here.

Thanks.

-- 
tejun


Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread t...@kernel.org
Hello, Bart.

On Tue, Jan 09, 2018 at 04:12:40PM +, Bart Van Assche wrote:
> I'm concerned about the additional CPU cycles needed for the new 
> blk_mq_map_queue()
> call, although I know this call is cheap. Would the timeout code really get 
> that

So, if that is really a concern, let's cache that mapping instead of
changing synchronization rules for that.

> much more complicated if the hctx_lock() and hctx_unlock() calls would be 
> changed
> into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if
> "if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" 
> in
> blk_mq_timeout_work()?

Code-wise, it won't be too much extra code but I think diverging the
sync methods between issue and completion paths is more fragile and
likely to invite confusions and mistakes in the future.  We have the
normal path (issue) synchronizing against the exception
path (timeout).  I think it's best to keep the sync constructs aligned
with that conceptual picture.

Thanks.

-- 
tejun


Re: [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 10:10:01PM +, Bart Van Assche wrote:
> Other req->deadline writes are protected by preempt_disable(),
> write_seqcount_begin(>gstate_seq), write_seqcount_end(>gstate_seq)
> and preempt_enable(). I think it's fine that the above req->deadline store
> does not have that protection but I also think that that deserves a comment.

Will add.

Thanks.

-- 
tejun


Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 11:29:11PM +, Bart Van Assche wrote:
> Does "gstate" perhaps stand for "generation number and state"? If so, please
> mention this in one of the above comments.

Yeah, will do.

Thanks.

-- 
tejun


Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 09:06:55PM +, Bart Van Assche wrote:
> On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> > +static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
> > +{
> > +   unsigned long flags;
> > +
> > +   local_irq_save(flags);
> > +   u64_stats_update_begin(>aborted_gstate_sync);
> > +   rq->aborted_gstate = gstate;
> > +   u64_stats_update_end(>aborted_gstate_sync);
> > +   local_irq_restore(flags);
> > +}
> 
> Please add a comment that explains the purpose of local_irq_save() and
> local_irq_restore(). Please also explain why you chose to disable interrupts

Will do.

> instead of disabling preemption. I think that disabling preemption should be
> sufficient since this is the only code that updates rq->aborted_gstate and
> since this function is always called from thread context.

blk_mq_complete_request() can read it from the irq context.  If that
happens between update_begin and end, it'll end up looping infinitely.

> > @@ -801,6 +840,12 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> > reserved)
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > +   /*
> > +* As nothing prevents from completion happening while
> > +* ->aborted_gstate is set, this may lead to ignored
> > +* completions and further spurious timeouts.
> > +*/
> > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > blk_add_timer(req);
> > blk_clear_rq_complete(req);
> > break;
> 
> Is the race that the comment refers to addressed by one of the later patches?

No, but it's not a new race.  It has always been there and I suppose
doesn't lead to practical problems - the race window is pretty small
and the effect isn't critical.  I'm just documenting the existing race
condition.  Will note that in the description.

Thanks.

-- 
tejun


Re: [PATCHSET v2] blk-mq: reimplement timeout handling

2017-12-20 Thread t...@kernel.org
On Wed, Dec 20, 2017 at 11:41:02PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > Currently, blk-mq timeout path synchronizes against the usual
> > issue/completion path using a complex scheme involving atomic
> > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> An attempt to run SCSI I/O with this patch series applied resulted in
> the following:

Can you please try the v3?  There were a couple bugs that I missed
while testing earlier versions.

 http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-15 Thread t...@kernel.org
Hello, Peter.

On Thu, Dec 14, 2017 at 09:20:42PM +0100, Peter Zijlstra wrote:
> But now that I look at this again, TJ, why can't the below happen?
> 
>   write_seqlock_begin();
>   blk_mq_rq_update_state(rq, IN_FLIGHT);
>   blk_add_timer(rq);
>   
>   read_seqcount_begin()
>   while (seq & 1)
>   cpurelax();
>   // life-lock
>   
>   write_seqlock_end();

Ah, you're right.  For both gstate_seq and aborted_gstate_sync, we can
push all synchronization to the timeout side - ie. gstate_seq read can
yield, pause or synchronize_rcu and hte aborted_gstate_sync can
disable irq around update.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-15 Thread t...@kernel.org
Hello, Bart.

On Thu, Dec 14, 2017 at 09:13:32PM +, Bart Van Assche wrote:
...
> however is called before a every use of a request. Sorry but I'm not
> enthusiast about the code in blk_rq_init() that reinitializes state
> information that should survive request reuse.

If it wasn't clear, me neither.  I think what'd be better is making
those paths use the usual request allocation path instead of custom
one but for flush handling, that's not gonna be trivial, so let's deal
with that later.

Thanks.

-- 
tejun


Re: [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path

2017-12-14 Thread t...@kernel.org
Hello,

On Thu, Dec 14, 2017 at 06:56:55PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> >  void blk_abort_request(struct request *req)
> >  {
> > -   if (blk_mark_rq_complete(req))
> > -   return;
> >  
> > if (req->q->mq_ops) {
> > -   blk_mq_rq_timed_out(req, false);
> > +   req->deadline = jiffies;
> > +   mod_timer(>q->timeout, 0);
> > } else {
> > +   if (blk_mark_rq_complete(req))
> > +   return;
> > blk_delete_timer(req);
> > blk_rq_timed_out(req);
> > }
> 
> This patch makes blk_abort_request() asynchronous for blk-mq. Have all callers
> been audited to verify whether this change is safe?

I *think* so.  For all the ata related parts, I know they're.
mtip32xx and dasd_ioctl, it seems safe, but I can't tell for sure.
Will cc the respective maintainers on the next posting.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread t...@kernel.org
Hello,

On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > rules.  Unfortunatley, it contains quite a few holes.
>   ^
>   Unfortunately?
> 
> > While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
> ^^^
> synchronization?

lol, believe it or not, my english spelling is a lot better than my
korean.  Will fix them.

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct 
> > request *rq)
> > rq->start_time = jiffies;
> > set_start_time_ns(rq);
> > rq->part = NULL;
> > +   seqcount_init(>gstate_seq);
> > +   u64_stats_init(>aborted_gstate_sync);
> >  }
> >  EXPORT_SYMBOL(blk_rq_init);
> 
> Sorry but the above change looks ugly to me. My understanding is that 
> blk_rq_init() is only used inside the block layer to initialize legacy block
> layer requests while gstate_seq and aborted_gstate_sync are only relevant
> for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> called for blk-mq requests such that the above change can be left out? The
> only callers outside the block layer core of blk_rq_init() I know of are
> ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> code if you want.

This is also used by flush path.  We probably should clean that up,
but let's worry about that later cuz flush handling has enough of its
own complications.

> > +   write_seqcount_begin(>gstate_seq);
> > +   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +   blk_add_timer(rq);
> > +   write_seqcount_end(>gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Write memory barrier has no cost on x86 and usually pretty low cost
elsewhere too and they're likely in the same cacheline as other rq
fields.

> > @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> > reserved)
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > +   /*
> > +* As nothing prevents from completion happening while
> > +* ->aborted_gstate is set, this may lead to ignored
> > +* completions and further spurious timeouts.
> > +*/
> > +   u64_stats_update_begin(>aborted_gstate_sync);
> > +   req->aborted_gstate = 0;
> > +   u64_stats_update_end(>aborted_gstate_sync);
> 
> If a blk-mq request is resubmitted 2**62 times, can that result in the above
> code setting aborted_gstate to the same value as gstate? Isn't that a bug?
> If so, how about setting aborted_gstate in the above code to e.g. gstate ^ 
> (2**63)?

A request gets aborted only if the state is in-flight, 0 isn't that.
Also, how many years would 2^62 times be?

> > +   struct u64_stats_sync aborted_gstate_sync;
> > +   u64 aborted_gstate;
> > +
> > unsigned long deadline;
> > struct list_head timeout_list;
> 
> Why are gstate and aborted_gstate 64-bit variables? What makes you think that
> 32 bits would not be enough?

Because 32 bits puts it in the rance where a false hit is still
theoretically possible in a reasonable amount of time.

Thanks.

-- 
tejun


Re: [PATCH 1/6] blk-mq: protect completion path with RCU

2017-12-14 Thread t...@kernel.org
On Thu, Dec 14, 2017 at 05:01:06PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > +   } else {
> > +   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> > +   if (!blk_mark_rq_complete(rq))
> > +   __blk_mq_complete_request(rq);
> > +   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> 
> Hello Tejun,
> 
> The name queue_rq_srcu was chosen to reflect the original use of that 
> structure,
> namely to protect .queue_rq() calls. Your patch series broadens the use of 
> that

Yeah, will add a patch to rename it.

> srcu structure so I would appreciate it if it would be renamed, e.g. into 
> "srcu".
> See also commit 6a83e74d214a ("blk-mq: Introduce blk_mq_quiesce_queue()").

Ah yeah, it'd be nice to have the [s]rcu synchronize calls factored
out.

Thanks.

-- 
tejun


Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED

2017-12-12 Thread t...@kernel.org
On Tue, Dec 12, 2017 at 10:20:39PM +, Bart Van Assche wrote:
> The above code should show all requests owned by the block driver. Patch
> "blk-mq-debugfs: Also show requests that have not yet been started" (not yet
> in Jens' tree) changes the REQ_ATOM_STARTED test into 
> list_empty(>queuelist).
> Can that change be supported with the existing MQ_RQ_* states or will a new
> state have to be introduced to support this? See also
> https://marc.info/?l=linux-block=151252188411991.

If list_empty() test was correct before, it'd be correct now too.

Thnaks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-12 Thread t...@kernel.org
Hello, Bart.

On Tue, Dec 12, 2017 at 09:37:11PM +, Bart Van Assche wrote:
> Have you considered the following instead of introducing MQ_RQ_IDLE and
> MQ_RQ_IN_FLIGHT? I think this could help to limit the number of new atomic
> operations introduced in the hot path by this patch series.

But nothing in the hot paths is atomic.

> static inline bool blk_mq_rq_in_flight(struct request *rq)
> {
>   return list_empty(>queuelist);
> }

And the fact that we encode the generation number and state into a
single variable contributes to not needing atomic operations.
Breaking up the state and generation like the above would need more
synchronization, not less.

Thanks.

-- 
tejun


Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()

2017-09-11 Thread t...@kernel.org
Hello,

On Mon, Sep 11, 2017 at 04:09:12PM +, Bart Van Assche wrote:
> On Mon, 2017-09-11 at 06:13 -0700, Tejun Heo wrote:
> > On Fri, Sep 08, 2017 at 04:52:22PM -0700, Bart Van Assche wrote:
> > > The blk-mq core keeps track of the number of request queue users
> > > through q->q_usage_count. Make it possible to switch this counter
> > > to atomic mode from the context of the block layer power management
> > > code by introducing percpu_ref_switch_to_atomic_nowait().
> > 
> > Do you ever have to switch back?  If so, how do you know whether the
> > previous transition finished?
> 
> The way I would like to use this function is to check for completion of the
> transition by calling percpu_ref_is_zero(). That function namely not only
> checks the value of the refcount but also whether it is in atomic mode. See
> also "[PATCH 5/5] blk-mq: Implement power management support"
> (https://www.spinics.net/lists/linux-block/msg17143.html). Do you think this
> will work?

Probably but that sounds really hairy.  I'd much prefer if it could be
done through the usual kill / confirm / release and re-init.  That's
not a possibility?

Thanks.

-- 
tejun