Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-18 Thread Ming Lei
On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote:
> On Sun, 2017-09-17 at 20:40 +0800, Ming Lei wrote:
> > "if no request has completed before the delay has expired" can't be a
> > reason to rerun the queue, because the queue can still be busy.
> 
> That statement of you shows that there are important aspects of the SCSI
> core and dm-mpath driver that you don't understand.

Then can you tell me why blk-mq's SCHED_RESTART can't cover
the rerun when there are in-flight requests? What is the case
in which dm-rq can return BUSY and there aren't any in-flight
requests meantime?

Also you are the author of adding 'blk_mq_delay_run_hw_queue(
hctx, 100/*ms*/)' in dm-rq, you never explain in commit
6077c2d706097c0(dm rq: Avoid that request processing stalls
sporadically) what the root cause is for your request stall
and why this patch fixes your issue. Even you don't explain
why is the delay 100ms?

So it is a workaound, isn't it?

My concern is that it isn't good to add blk_mq_delay_run_hw_queue(hctx, 
100/*ms*/)
in the hot path since it should been covered by SCHED_RESTART
if there are in-flight requests.

> 
> > I suggest to understand the root cause, instead of keeping this
> > ugly random delay because run hw queue after 100ms may be useless
> > in 99.99% times.
> 
> If you are still looking at removing the blk_mq_delay_run_hw_queue() calls
> then I think you are looking in the wrong direction. What kind of problem
> are you trying to solve? Is it perhaps that there can be a delay between

Actually the improvement on dm-rq IO schedule(the patch 2 ~ 5) doesn't
need this patch.


-- 
Ming


Re: [PATCH 00/10] nvme multipath support on top of nvme-4.13 branch

2017-09-18 Thread Anish Jhaveri
On Fri, Sep 15, 2017 at 08:07:01PM +0200, Christoph Hellwig wrote:
> Hi Anish,
> 
> I looked over the code a bit, and I'm rather confused by the newly
> added commands.  Which controller supports them?  Also the NVMe
> working group went down a very different way with the ALUA approch,
> which uses different grouping concepts and doesn't require path
> activations - for Linux we'd really like to stick to only standardized
> multipath implementations.
Hi Christoph, 
Thanks for looking over the code. The newly added commands were to
support Asymmetric NVME Namespace Subsystem. Though, I can remove
those commands and make it active-active configuration. I noticed
you have sent out the review with changes adding new structures
and functions which makes it very similar in terms of implementation.
The only reason I proposed given changes, as it doesn't require any 
change to kernel. 



Re: [PATCH 00/10] nvme multipath support on top of nvme-4.13 branch

2017-09-18 Thread anish . jhaveri
On Wed, Sep 13, 2017 at 08:57:13AM +0200, Hannes Reinecke wrote:
> In general I am _not_ in favour of this approach.
> 
> This is essentially the same level of multipath support we had in the
> old qlogic and lpfc drivers in 2.4/2.6 series, and it took us _years_ to
> get rid of this.
> Main objection here is that it will be really hard (if not impossible)
> to use the same approach for other subsystems (eg SCSI), so we'll end up
> having different multipath implementations depending on which subsystem
> is being used.
> Which really is a maintenance nightmare.
> 
> I'm not averse to having other multipath implementations in the kernel,
> but it really should be abstracted so that other subsystems can _try_ to
> leverage it.

Got it. Thanks! 


Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Ming Lei
On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch  wrote:
> On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote:
>> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
>> > The nvme driver's use of blk_mq_reinit_tagset only happens during
>> > controller initialisation, but I'm seeing lost commands well after that
>> > during normal and stable running.
>> >
>> > The timing is pretty narrow to hit, but I'm pretty sure this is what's
>> > happening. For nvme, this occurs when nvme_timeout() runs concurrently
>> > with nvme_handle_cqe() for the same struct request. For scsi-mq,
>> > the same situation may arise if scsi_mq_done() runs concurrently with
>> > scsi_times_out().
>>
>> Hello Keith,
>>
>> Are you sure that scenario can happen? The blk-mq core calls 
>> test_and_set_bit()
>> for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
>> called. See also blk_mark_rq_complete(). This avoids that the .complete() and
>> .timeout() functions run concurrently.
>
> Indeed that prevents .complete from running concurrently with the
> timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> callbacks. These are the LLD functions that call blk_mq_complete_request
> well before .complete. If the driver calls blk_mq_complete_request on
> a request that blk-mq is timing out, the request is lost because blk-mq
> already called blk_mark_rq_complete. Nothing prevents these LLD functions

That shouldn't happen because only one blk_mark_rq_complete() can win
and it is the winner's responsibility to complete the request, so
there shouldn't
be request lost. Especially in your case, it is the responsibility of timeout
to complete the rq really.

Also your patch removes test_and_clear_bit(REQ_ATOM_STARTED) from
__blk_mq_requeue_request(), which will make timeout to easy to happen
since the period between starting requeue and starting req may be long
enough, and timeout can be triggered because STARTED isn't cleared.

-- 
Ming Lei


[PATCH V5 09/10] SCSI: transport_spi: resume a quiesced device

2017-09-18 Thread Ming Lei
We have to preempt freeze queue in scsi_device_quiesce(),
and unfreeze in scsi_device_resume(), so call scsi_device_resume()
for the device which is quiesced by scsi_device_quiesce().

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_transport_spi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_transport_spi.c 
b/drivers/scsi/scsi_transport_spi.c
index d0219e36080c..bdecf99b6ab1 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -1040,6 +1040,9 @@ spi_dv_device(struct scsi_device *sdev)
 
scsi_target_resume(starget);
 
+   /* undo what scsi_device_quiesce() did */
+   scsi_device_resume(sdev);
+
spi_initial_dv(starget) = 1;
 
  out_free:
-- 
2.9.5



[PATCH V5 10/10] SCSI: preempt freeze block queue when SCSI device is put into quiesce

2017-09-18 Thread Ming Lei
Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
request pool because all allocated requests before can't
be dispatched when device is put in QIUESCE. Then no request
can be allocated for RQF_PREEMPT, and system may hang somewhere,
such as When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch freezes block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_freeze_queue_preempt
is returned. Then RQF_PREEMPT can be allocated successfully duirng
preempt freeze.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..751a956b7b2b 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,10 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
+   req = __blk_get_request(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+   BLK_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -2928,12 +2929,28 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
int err;
 
+   /*
+* Simply quiesing SCSI device isn't safe, it is easy
+* to use up requests because all these allocated requests
+* can't be dispatched when device is put in QIUESCE.
+* Then no request can be allocated and we may hang
+* somewhere, such as system suspend/resume.
+*
+* So we freeze block queue in preempt mode first, no new
+* normal request can enter queue any more, and all pending
+* requests are drained once blk_freeze_queue_preempt()
+* is returned. Only RQF_PREEMPT is allowed in preempt freeze.
+*/
+   blk_freeze_queue_preempt(sdev->request_queue);
+
mutex_lock(>state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
mutex_unlock(>state_mutex);
 
-   if (err)
+   if (err) {
+   blk_unfreeze_queue_preempt(sdev->request_queue);
return err;
+   }
 
scsi_run_queue(sdev->request_queue);
while (atomic_read(>device_busy)) {
@@ -2964,6 +2981,8 @@ void scsi_device_resume(struct scsi_device *sdev)
scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
scsi_run_queue(sdev->request_queue);
mutex_unlock(>state_mutex);
+
+   blk_unfreeze_queue_preempt(sdev->request_queue);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5



[PATCH V5 07/10] block: introduce preempt version of blk_[freeze|unfreeze]_queue

2017-09-18 Thread Ming Lei
The two APIs are required to allow request allocation of
RQF_PREEMPT when queue is preempt frozen.

We have to guarantee that normal freeze and preempt freeze
are run exclusive. Because for normal freezing, once
blk_freeze_queue_wait() is returned, no request can enter
queue any more.

Another issue we should pay attention to is that the race
of preempt freeze vs. blk_cleanup_queue(), and it is avoided
by not allowing to preempt freeeze after queue becomes dying,
otherwise preempt freeeze may hang forever.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   |   2 +
 block/blk-mq.c | 133 +++--
 block/blk.h|  11 
 include/linux/blk-mq.h |   2 +
 include/linux/blkdev.h |   6 +++
 5 files changed, 140 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 16131ea0f3b0..2010f8610203 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -905,6 +905,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (blkcg_init_queue(q))
goto fail_ref;
 
+   spin_lock_init(>freeze_lock);
+
return q;
 
 fail_ref:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0c472f5a9176..940cba14a3ad 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -118,19 +118,6 @@ void blk_mq_in_flight(struct request_queue *q, struct 
hd_struct *part,
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, );
 }
 
-void blk_freeze_queue_start(struct request_queue *q)
-{
-   int freeze_depth;
-
-   freeze_depth = atomic_inc_return(>freeze_depth);
-   if (freeze_depth == 1) {
-   percpu_ref_kill(>q_usage_counter);
-   if (q->mq_ops)
-   blk_mq_run_hw_queues(q, false);
-   }
-}
-EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-
 void blk_freeze_queue_wait(struct request_queue *q)
 {
if (!q->mq_ops)
@@ -148,6 +135,69 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue 
*q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
 
+static bool queue_freeze_is_over(struct request_queue *q,
+   bool preempt, bool *queue_dying)
+{
+   /*
+* preempt freeze has to be prevented after queue is set as
+* dying, otherwise we may hang forever
+*/
+   if (preempt) {
+   spin_lock_irq(q->queue_lock);
+   *queue_dying = !!blk_queue_dying(q);
+   spin_unlock_irq(q->queue_lock);
+
+   return !q->normal_freezing || *queue_dying;
+   }
+   return !q->preempt_freezing;
+}
+
+static void __blk_freeze_queue_start(struct request_queue *q, bool preempt)
+{
+   int freeze_depth;
+   bool queue_dying;
+
+   /*
+* Make sure normal freeze and preempt freeze are run
+* exclusively, but each kind itself is allowed to be
+* run concurrently, even nested.
+*/
+   spin_lock(>freeze_lock);
+   wait_event_cmd(q->freeze_wq,
+  queue_freeze_is_over(q, preempt, _dying),
+  spin_unlock(>freeze_lock),
+  spin_lock(>freeze_lock));
+
+   if (preempt && queue_dying)
+   goto unlock;
+
+   freeze_depth = atomic_inc_return(>freeze_depth);
+   if (freeze_depth == 1) {
+   if (preempt) {
+   q->preempt_freezing = 1;
+   q->preempt_unfreezing = 0;
+   } else
+   q->normal_freezing = 1;
+   spin_unlock(>freeze_lock);
+
+   percpu_ref_kill(>q_usage_counter);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
+
+   /* have to drain I/O here for preempt quiesce */
+   if (preempt)
+   blk_freeze_queue_wait(q);
+   } else
+ unlock:
+   spin_unlock(>freeze_lock);
+}
+
+void blk_freeze_queue_start(struct request_queue *q)
+{
+   __blk_freeze_queue_start(q, false);
+}
+EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
+
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
@@ -166,20 +216,75 @@ void blk_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
-void blk_unfreeze_queue(struct request_queue *q)
+static void blk_start_unfreeze_queue_preempt(struct request_queue *q)
+{
+   /* no new request can be coming after unfreezing */
+   spin_lock(>freeze_lock);
+   q->preempt_unfreezing = 1;
+   spin_unlock(>freeze_lock);
+
+   blk_freeze_queue_wait(q);
+}
+
+static void __blk_unfreeze_queue(struct request_queue *q, bool preempt)
 {
int freeze_depth;
 
freeze_depth = atomic_dec_return(>freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
+   if (preempt)
+   

[PATCH V5 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-18 Thread Ming Lei
REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() to allow block
layer to allocate request when queue is preempt frozen, since we
will preempt freeze queue before quiescing SCSI device in the
following patch for supporting safe SCSI quiescing.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 33 +++--
 block/blk-mq.c |  3 +--
 block/blk.h| 13 +
 include/linux/blk-mq.h |  7 ---
 include/linux/blkdev.h | 17 ++---
 5 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2010f8610203..089ea337ca0b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -787,6 +787,20 @@ int blk_queue_enter(struct request_queue *q, unsigned 
flags)
if (percpu_ref_tryget_live(>q_usage_counter))
return 0;
 
+   /*
+* If queue is preempt frozen and caller need to allocate
+* request for RQF_PREEMPT, we grab the .q_usage_counter
+* unconditionally and return successfully.
+*
+* There isn't race with queue cleanup because
+* both block legacy and blk-mq requires driver
+* to handle requests after queue is set as dying.
+*
+*/
+   if ((flags & BLK_REQ_PREEMPT) &&
+   blk_queue_enter_preempt_freeze(q))
+   return 0;
+
if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
@@ -1410,7 +1424,8 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, gfp_t gfp_mask,
+  unsigned int flags)
 {
struct request *rq;
int ret = 0;
@@ -1420,8 +1435,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -1439,26 +1453,25 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+ gfp_t gfp_mask, unsigned int flags)
 {
struct request *req;
 
+   flags |= gfp_mask & __GFP_DIRECT_RECLAIM ? 0 : BLK_REQ_NOWAIT;
if (q->mq_ops) {
-   req = blk_mq_alloc_request(q, op,
-   (gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   0 : BLK_MQ_REQ_NOWAIT);
+   req = blk_mq_alloc_request(q, op, flags);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
-   req = blk_old_get_request(q, op, gfp_mask);
+   req = blk_old_get_request(q, op, gfp_mask, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
 
return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 940cba14a3ad..49aa039da469 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -482,8 +482,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
 
diff --git a/block/blk.h b/block/blk.h
index 243b2e7e5098..8737ae12318e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -90,6 +90,19 @@ static inline bool blk_queue_is_preempt_frozen(struct 
request_queue *q)
return preempt_frozen;
 }
 
+static inline bool blk_queue_enter_preempt_freeze(struct request_queue *q)
+{
+   bool preempt_frozen;
+
+   spin_lock(>freeze_lock);
+   preempt_frozen = q->preempt_freezing && !q->preempt_unfreezing;
+   if (preempt_frozen)
+   

[PATCH V5 05/10] block: rename .mq_freeze_wq and .mq_freeze_depth

2017-09-18 Thread Ming Lei
Both two are used for legacy and blk-mq, so rename them
as .freeze_wq and .freeze_depth for avoiding to confuse
people.

No functional change.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 12 ++--
 block/blk-mq.c | 12 ++--
 include/linux/blkdev.h |  4 ++--
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 97837a5bf838..8889920567d2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -630,7 +630,7 @@ void blk_set_queue_dying(struct request_queue *q)
 * We need to ensure that processes currently waiting on
 * the queue are notified as well.
 */
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -793,14 +793,14 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
/*
 * read pair of barrier in blk_freeze_queue_start(),
 * we need to order reading __PERCPU_REF_DEAD flag of
-* .q_usage_counter and reading .mq_freeze_depth or
+* .q_usage_counter and reading .freeze_depth or
 * queue dying flag, otherwise the following wait may
 * never return if the two reads are reordered.
 */
smp_rmb();
 
-   ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   ret = wait_event_interruptible(q->freeze_wq,
+   !atomic_read(>freeze_depth) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
@@ -819,7 +819,7 @@ static void blk_queue_usage_counter_release(struct 
percpu_ref *ref)
struct request_queue *q =
container_of(ref, struct request_queue, q_usage_counter);
 
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>freeze_wq);
 }
 
 static void blk_rq_timed_out_timer(unsigned long data)
@@ -891,7 +891,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q->bypass_depth = 1;
__set_bit(QUEUE_FLAG_BYPASS, >queue_flags);
 
-   init_waitqueue_head(>mq_freeze_wq);
+   init_waitqueue_head(>freeze_wq);
 
/*
 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1e9bf2b987a1..988da93d85a4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -122,7 +122,7 @@ void blk_freeze_queue_start(struct request_queue *q)
 {
int freeze_depth;
 
-   freeze_depth = atomic_inc_return(>mq_freeze_depth);
+   freeze_depth = atomic_inc_return(>freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
if (q->mq_ops)
@@ -135,14 +135,14 @@ void blk_freeze_queue_wait(struct request_queue *q)
 {
if (!q->mq_ops)
blk_drain_queue(q);
-   wait_event(q->mq_freeze_wq, percpu_ref_is_zero(>q_usage_counter));
+   wait_event(q->freeze_wq, percpu_ref_is_zero(>q_usage_counter));
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout)
 {
-   return wait_event_timeout(q->mq_freeze_wq,
+   return wait_event_timeout(q->freeze_wq,
percpu_ref_is_zero(>q_usage_counter),
timeout);
 }
@@ -170,11 +170,11 @@ void blk_unfreeze_queue(struct request_queue *q)
 {
int freeze_depth;
 
-   freeze_depth = atomic_dec_return(>mq_freeze_depth);
+   freeze_depth = atomic_dec_return(>freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
percpu_ref_reinit(>q_usage_counter);
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>freeze_wq);
}
 }
 EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
@@ -2440,7 +2440,7 @@ void blk_mq_free_queue(struct request_queue *q)
 /* Basically redo blk_mq_init_queue with queue frozen */
 static void blk_mq_queue_reinit(struct request_queue *q)
 {
-   WARN_ON_ONCE(!atomic_read(>mq_freeze_depth));
+   WARN_ON_ONCE(!atomic_read(>freeze_depth));
 
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 460294bb0fa5..b8053bcc6b5f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -564,7 +564,7 @@ struct request_queue {
struct mutexsysfs_lock;
 
int bypass_depth;
-   atomic_tmq_freeze_depth;
+   atomic_tfreeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
bsg_job_fn  

[PATCH V5 06/10] block: pass flags to blk_queue_enter()

2017-09-18 Thread Ming Lei
We need to pass PREEMPT flags to blk_queue_enter()
for allocating request with RQF_PREEMPT in the
following patch.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 10 ++
 block/blk-mq.c |  5 +++--
 block/blk-timeout.c|  2 +-
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  7 ++-
 5 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8889920567d2..16131ea0f3b0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -779,7 +779,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+int blk_queue_enter(struct request_queue *q, unsigned flags)
 {
while (true) {
int ret;
@@ -787,7 +787,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
if (percpu_ref_tryget_live(>q_usage_counter))
return 0;
 
-   if (nowait)
+   if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -1418,7 +1418,8 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -2225,7 +2226,8 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bio->bi_disk->queue;
 
-   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+   if (likely(blk_queue_enter(q, (bio->bi_opf & REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 988da93d85a4..0c472f5a9176 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -377,7 +377,8 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
 
@@ -416,7 +417,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
if (hctx_idx >= q->nr_hw_queues)
return ERR_PTR(-EIO);
 
-   ret = blk_queue_enter(q, true);
+   ret = blk_queue_enter(q, BLK_REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 17ec83bb0900..e803106a5e5b 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
struct request *rq, *tmp;
int next_set = 0;
 
-   if (blk_queue_enter(q, true))
+   if (blk_queue_enter(q, BLK_REQ_NOWAIT))
return;
spin_lock_irqsave(q->queue_lock, flags);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..98cf2d7ee9d3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -674,7 +674,7 @@ int bdev_read_page(struct block_device *bdev, sector_t 
sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return result;
 
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@@ -710,7 +710,7 @@ int bdev_write_page(struct block_device *bdev, sector_t 
sector,
 
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b8053bcc6b5f..54450715915b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -857,6 +857,11 @@ enum {
BLKPREP_INVALID,/* invalid command, kill, return -EREMOTEIO */
 };
 
+/* passed to blk_queue_enter */
+enum {
+   BLK_REQ_NOWAIT = (1 << 0),
+};
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 /*
@@ -962,7 +967,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct 
gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 struct scsi_ioctl_command __user *);
 
-extern int 

[PATCH V5 01/10] blk-mq: only run hw queues for blk-mq

2017-09-18 Thread Ming Lei
This patch just makes it explicitely.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..6fd9f86fc86d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5



[PATCH V5 04/10] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait

2017-09-18 Thread Ming Lei
The only change on legacy is that blk_drain_queue() is run
from blk_freeze_queue(), which is called in blk_cleanup_queue().

So this patch removes the explicit call of __blk_drain_queue() in
blk_cleanup_queue().

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 17 +++--
 block/blk-mq.c   |  8 +---
 block/blk.h  |  1 +
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 5 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index abfba798ee03..97837a5bf838 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -530,6 +530,21 @@ static void __blk_drain_queue(struct request_queue *q, 
bool drain_all)
 }
 
 /**
+ * blk_drain_queue - drain requests from request_queue
+ * @q: queue to drain
+ *
+ * Drain requests from @q.  All pending requests are drained.
+ * The caller is responsible for ensuring that no new requests
+ * which need to be drained are queued.
+ */
+void blk_drain_queue(struct request_queue *q)
+{
+   spin_lock_irq(q->queue_lock);
+   __blk_drain_queue(q, true);
+   spin_unlock_irq(q->queue_lock);
+}
+
+/**
  * blk_queue_bypass_start - enter queue bypass mode
  * @q: queue of interest
  *
@@ -659,8 +674,6 @@ void blk_cleanup_queue(struct request_queue *q)
 */
blk_freeze_queue(q);
spin_lock_irq(lock);
-   if (!q->mq_ops)
-   __blk_drain_queue(q, true);
queue_flag_set(QUEUE_FLAG_DEAD, q);
spin_unlock_irq(lock);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1acebbd1fbd4..1e9bf2b987a1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -131,11 +131,13 @@ void blk_freeze_queue_start(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
-void blk_mq_freeze_queue_wait(struct request_queue *q)
+void blk_freeze_queue_wait(struct request_queue *q)
 {
+   if (!q->mq_ops)
+   blk_drain_queue(q);
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(>q_usage_counter));
 }
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
+EXPORT_SYMBOL_GPL(blk_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout)
@@ -160,7 +162,7 @@ void blk_freeze_queue(struct request_queue *q)
 * exported to drivers as the only user for unfreeze is blk_mq.
 */
blk_freeze_queue_start(q);
-   blk_mq_freeze_queue_wait(q);
+   blk_freeze_queue_wait(q);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
diff --git a/block/blk.h b/block/blk.h
index fcb9775b997d..21eed59d96db 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -64,6 +64,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request 
*rq,
struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
+void blk_drain_queue(struct request_queue *q);
 void __blk_queue_free_tags(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a0075d40cfef..b2dc14882ee8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2907,7 +2907,7 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 
mutex_lock(>namespaces_mutex);
list_for_each_entry(ns, >namespaces, list)
-   blk_mq_freeze_queue_wait(ns->queue);
+   blk_freeze_queue_wait(ns->queue);
mutex_unlock(>namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 355d74507656..62c3d1f7d12a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -256,7 +256,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
-void blk_mq_freeze_queue_wait(struct request_queue *q);
+void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-- 
2.9.5



[PATCH V5 0/10] block/scsi: safe SCSI quiescing

2017-09-18 Thread Ming Lei
Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be comming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated and wait forever,
meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
then system hangs forever, such as during system suspend or
sending SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt freeze, and solves the issue
by preempt freezing block queue during SCSI quiesce, and allows
to allocate request of RQF_PREEMPT when queue is in this state.

Oleksandr verified that V4 does fix the hang during suspend/resume,
and Cathy verified that V4 fixes hang in sending SCSI domain validation
wrt. transport_spi.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing/unifying blk_freeze_queue_preempt() and
blk_unfreeze_queue_preempt(), and cleanup is done together.

The patchset can be found in the following gitweb:

https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V5

V5:
- fix one tiny race by introducing blk_queue_enter_preempt_freeze()
given this change is small enough compared with V4, I added
tested-by directly

V4:
- reorganize patch order to make it more reasonable
- support nested preempt freeze, as required by SCSI transport spi
- check preempt freezing in slow path of of blk_queue_enter()
- add "SCSI: transport_spi: resume a quiesced device"
- wake up freeze queue in setting dying for both blk-mq and legacy
- rename blk_mq_[freeze|unfreeze]_queue() in one patch
- rename .mq_freeze_wq and .mq_freeze_depth
- improve comment

V3:
- introduce q->preempt_unfreezing to fix one bug of preempt freeze
- call blk_queue_enter_live() only when queue is preempt frozen
- cleanup a bit on the implementation of preempt freeze
- only patch 6 and 7 are changed

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun
- introduce preempt version of blk_[freeze|unfreeze]_queue
- sync between preempt freeze and normal freeze
- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013=3=2


Thanks,
Ming

Ming Lei (10):
  blk-mq: only run hw queues for blk-mq
  block: tracking request allocation with q_usage_counter
  blk-mq: rename blk_mq_[freeze|unfreeze]_queue
  blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  block: rename .mq_freeze_wq and .mq_freeze_depth
  block: pass flags to blk_queue_enter()
  block: introduce preempt version of blk_[freeze|unfreeze]_queue
  block: allow to allocate req with RQF_PREEMPT when queue is preempt
frozen
  SCSI: transport_spi: resume a quiesced device
  SCSI: preempt freeze block queue when SCSI device is put into quiesce

 block/bfq-iosched.c   |   2 +-
 block/blk-cgroup.c|   8 +-
 block/blk-core.c  |  80 +
 block/blk-mq.c| 180 --
 block/blk-mq.h|   1 -
 block/blk-timeout.c   |   2 +-
 block/blk.h   |  25 ++
 block/elevator.c  |   4 +-
 drivers/block/loop.c  |  24 ++---
 drivers/block/rbd.c   |   2 +-
 drivers/nvme/host/core.c  |   8 +-
 drivers/scsi/scsi_lib.c   |  25 +-
 drivers/scsi/scsi_transport_spi.c |   3 +
 fs/block_dev.c|   4 +-
 include/linux/blk-mq.h|  15 ++--
 include/linux/blkdev.h|  32 +--
 16 files changed, 311 insertions(+), 104 deletions(-)

-- 
2.9.5



[PATCH V5 02/10] block: tracking request allocation with q_usage_counter

2017-09-18 Thread Ming Lei
This usage is basically same with blk-mq, so that we can
support to freeze legacy queue easily.

Also 'wake_up_all(>mq_freeze_wq)' has to be moved
into blk_set_queue_dying() since both legacy and blk-mq
may wait on the wait queue of .mq_freeze_wq.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 14 ++
 block/blk-mq.c   |  7 ---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index aebe676225e6..abfba798ee03 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,12 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /*
+* We need to ensure that processes currently waiting on
+* the queue are notified as well.
+*/
+   wake_up_all(>mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1392,16 +1398,21 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1573,6 +1584,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1854,8 +1866,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fd9f86fc86d..10c1f49f663d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -256,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(>mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.9.5



[PATCH V5 03/10] blk-mq: rename blk_mq_[freeze|unfreeze]_queue

2017-09-18 Thread Ming Lei
We will support to freeze queue on block legacy path too.

No functional change.

Tested-by: Cathy Avery 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/bfq-iosched.c  |  2 +-
 block/blk-cgroup.c   |  8 
 block/blk-mq.c   | 27 +--
 block/blk-mq.h   |  1 -
 block/elevator.c |  4 ++--
 drivers/block/loop.c | 24 
 drivers/block/rbd.c  |  2 +-
 drivers/nvme/host/core.c |  6 +++---
 include/linux/blk-mq.h   |  4 ++--
 9 files changed, 34 insertions(+), 44 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a4783da90ba8..a18f36bfbdf0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4758,7 +4758,7 @@ static int bfq_init_queue(struct request_queue *q, struct 
elevator_type *e)
 * The invocation of the next bfq_create_group_hierarchy
 * function is the head of a chain of function calls
 * (bfq_create_group_hierarchy->blkcg_activate_policy->
-* blk_mq_freeze_queue) that may lead to the invocation of the
+* blk_freeze_queue) that may lead to the invocation of the
 * has_work hook function. For this reason,
 * bfq_create_group_hierarchy is invoked only after all
 * scheduler data has been initialized, apart from the fields
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d3f56baee936..ffc984381e4b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1298,7 +1298,7 @@ int blkcg_activate_policy(struct request_queue *q,
return 0;
 
if (q->mq_ops)
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
else
blk_queue_bypass_start(q);
 pd_prealloc:
@@ -1339,7 +1339,7 @@ int blkcg_activate_policy(struct request_queue *q,
spin_unlock_irq(q->queue_lock);
 out_bypass_end:
if (q->mq_ops)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
if (pd_prealloc)
@@ -1365,7 +1365,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
return;
 
if (q->mq_ops)
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
else
blk_queue_bypass_start(q);
 
@@ -1390,7 +1390,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
spin_unlock_irq(q->queue_lock);
 
if (q->mq_ops)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 10c1f49f663d..1acebbd1fbd4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -162,18 +162,9 @@ void blk_freeze_queue(struct request_queue *q)
blk_freeze_queue_start(q);
blk_mq_freeze_queue_wait(q);
 }
+EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
-void blk_mq_freeze_queue(struct request_queue *q)
-{
-   /*
-* ...just an alias to keep freeze and unfreeze actions balanced
-* in the blk_mq_* namespace
-*/
-   blk_freeze_queue(q);
-}
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
-
-void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_unfreeze_queue(struct request_queue *q)
 {
int freeze_depth;
 
@@ -184,7 +175,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
wake_up_all(>mq_freeze_wq);
}
 }
-EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
@@ -2192,9 +2183,9 @@ static void blk_mq_update_tag_set_depth(struct 
blk_mq_tag_set *set,
lockdep_assert_held(>tag_list_lock);
 
list_for_each_entry(q, >tag_list, tag_set_list) {
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
queue_set_hctx_shared(q, shared);
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
}
 }
 
@@ -2625,7 +2616,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
if (!set)
return -EINVAL;
 
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
 
ret = 0;
queue_for_each_hw_ctx(q, hctx, i) {
@@ -2650,7 +2641,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
if (!ret)
q->nr_requests = nr;
 
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
 
return ret;
 }
@@ -2668,7 +2659,7 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
return;
 
list_for_each_entry(q, >tag_list, tag_set_list)
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
 
set->nr_hw_queues = nr_hw_queues;
blk_mq_update_queue_map(set);
@@ -2678,7 +2669,7 @@ static void 

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 11:14:38PM +, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote:
> > On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote:
> > > Are you sure that scenario can happen? The blk-mq core calls 
> > > test_and_set_bit()
> > > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> > > called. See also blk_mark_rq_complete(). This avoids that the .complete() 
> > > and
> > > .timeout() functions run concurrently.
> > 
> > Indeed that prevents .complete from running concurrently with the
> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> > callbacks. These are the LLD functions that call blk_mq_complete_request
> > well before .complete. If the driver calls blk_mq_complete_request on
> > a request that blk-mq is timing out, the request is lost because blk-mq
> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
> > from running at the same time as the timeout handler.
> 
> Can you explain how you define "request is lost"? 

Sure, what I mean by "lost" is when nothing will call
__blk_mq_complete_request, which is required to make forward progress on
that request.

If a driver calls blk_mq_complete_request on a request being checked by
by the timeout handler, blk-mq will return immediately instead of
making progress toward completion since blk-mq already set
REQ_ATOM_COMPLETE while running the timeout hanlder.

> If a timeout occurs for a
> SCSI request then scsi_times_out() calls scsi_abort_command() (if no
> .eh_timed_out() callback has been defined by the LLD). It is the 
> responsibility
> of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns
> SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the
> SCSI core will escalate the error further until .scsi_done() has been called 
> for
> the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you
> wrote is not correct for the SCSI core and a properly implemented SCSI LLD. 

Once blk-mq's blk_mq_check_expired calls blk_mark_rq_complete, an
LLD calling blk_mq_complete_request does absolutely nothing because
REQ_ATOM_COMPLETE is already set.

There's nothing stopping scsi_mq_done from running at the same time as
blk-mq's timeout handler.

It doesn't matter what path .scsi_done is called now. That will just
call scsi_mq_done -> blk_mq_complete_request, and since
REQ_ATOM_COMPLETE is already set, the command won't be completed.

The only way to complete that request now is if the timeout
handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns
BLK_EH_NOT_HANDLED on success (very few drivers actually return
BLK_EH_HANDLED).

After the timeout handler runs, it's once again possible for the driver to
complete the request if it returned BLK_EH_RESET_TIMER, but if the driver
already completed the command during the timeout handler, how is the
driver supposed to know it needs to complete the request a second time?


Re: [PATCH] block: move sanity checking ahead of bi_front/back_seg_size updating

2017-09-18 Thread jianchao.wang


On 09/19/2017 07:51 AM, Christoph Hellwig wrote:
> On Sat, Sep 16, 2017 at 07:10:30AM +0800, Jianchao Wang wrote:
>> If the bio_integrity_merge_rq() return false or nr_phys_segments exceeds
>> the max_segments, the merging fails, but the bi_front/back_seg_size may
>> have been modified. To avoid it, move the sanity checking ahead.
>>
>> Signed-off-by: Jianchao Wang 
> 
> This looks fine to me:
> 
> Reviewed-by: Christoph Hellwig 
> 
> But can you elaborate a little more on how this found and if there
> is a way to easily reproduce it, say for a blktests test case?
> 
It is found when I made the patch of 
'block: consider merge of segments when merge bio into rq' , not from an actual
issue or test case.

Thanks
Jianchao


Re: [PATCH v6 1/2] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-09-18 Thread Christoph Hellwig
Taking a look at this it seems like using a lock in struct block_device
isn't the right thing to do anyway - all the action is on fields in
struct blk_trace, so having a lock inside that would make a lot more
sense.

It would also help to document what exactly we're actually protecting.


Re: [PATCH] block: move sanity checking ahead of bi_front/back_seg_size updating

2017-09-18 Thread Christoph Hellwig
On Sat, Sep 16, 2017 at 07:10:30AM +0800, Jianchao Wang wrote:
> If the bio_integrity_merge_rq() return false or nr_phys_segments exceeds
> the max_segments, the merging fails, but the bi_front/back_seg_size may
> have been modified. To avoid it, move the sanity checking ahead.
> 
> Signed-off-by: Jianchao Wang 

This looks fine to me:

Reviewed-by: Christoph Hellwig 

But can you elaborate a little more on how this found and if there
is a way to easily reproduce it, say for a blktests test case?


Re: [PATCH v6 2/2] block_dev: Rename bd_fsfreeze_mutex

2017-09-18 Thread Christoph Hellwig
Don't rename it to a way to long name.  Either add a separate mutex
for your purpose (unless there is interaction between freezing and
blktrace, which I doubt), or properly comment the usage.


[PATCH 9/9] nvme: implement multipath access to nvme subsystems

2017-09-18 Thread Christoph Hellwig
This patch adds initial multipath support to the nvme driver.  For each
namespace we create a new block device node, which can be used to access
that namespace through any of the controllers that refer to it.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 264 ---
 drivers/nvme/host/nvme.h |  11 ++
 2 files changed, 259 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3e8405fd57a9..5449c83a9dc3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -77,6 +77,8 @@ static DEFINE_MUTEX(nvme_subsystems_lock);
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
+static DEFINE_IDA(nvme_disk_ida);
+
 static struct class *nvme_class;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
@@ -104,6 +106,19 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
return ret;
 }
 
+static void nvme_failover_req(struct request *req)
+{
+   struct nvme_ns *ns = req->q->queuedata;
+   unsigned long flags;
+
+   spin_lock_irqsave(>head->requeue_lock, flags);
+   blk_steal_bios(>head->requeue_list, req);
+   spin_unlock_irqrestore(>head->requeue_lock, flags);
+
+   nvme_reset_ctrl(ns->ctrl);
+   kblockd_schedule_work(>head->requeue_work);
+}
+
 static blk_status_t nvme_error_status(struct request *req)
 {
switch (nvme_req(req)->status & 0x7ff) {
@@ -131,6 +146,53 @@ static blk_status_t nvme_error_status(struct request *req)
}
 }
 
+static bool nvme_req_needs_failover(struct request *req)
+{
+   if (!(req->cmd_flags & REQ_NVME_MPATH))
+   return false;
+
+   switch (nvme_req(req)->status & 0x7ff) {
+   /*
+* Generic command status:
+*/
+   case NVME_SC_INVALID_OPCODE:
+   case NVME_SC_INVALID_FIELD:
+   case NVME_SC_INVALID_NS:
+   case NVME_SC_LBA_RANGE:
+   case NVME_SC_CAP_EXCEEDED:
+   case NVME_SC_RESERVATION_CONFLICT:
+   return false;
+
+   /*
+* I/O command set specific error.  Unfortunately these values are
+* reused for fabrics commands, but those should never get here.
+*/
+   case NVME_SC_BAD_ATTRIBUTES:
+   case NVME_SC_INVALID_PI:
+   case NVME_SC_READ_ONLY:
+   case NVME_SC_ONCS_NOT_SUPPORTED:
+   WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+   nvme_fabrics_command);
+   return false;
+
+   /*
+* Media and Data Integrity Errors:
+*/
+   case NVME_SC_WRITE_FAULT:
+   case NVME_SC_READ_ERROR:
+   case NVME_SC_GUARD_CHECK:
+   case NVME_SC_APPTAG_CHECK:
+   case NVME_SC_REFTAG_CHECK:
+   case NVME_SC_COMPARE_FAILED:
+   case NVME_SC_ACCESS_DENIED:
+   case NVME_SC_UNWRITTEN_BLOCK:
+   return false;
+   }
+
+   /* Everything else could be a path failure, so should be retried */
+   return true;
+}
+
 static inline bool nvme_req_needs_retry(struct request *req)
 {
if (blk_noretry_request(req))
@@ -145,6 +207,11 @@ static inline bool nvme_req_needs_retry(struct request 
*req)
 void nvme_complete_rq(struct request *req)
 {
if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
+   if (nvme_req_needs_failover(req)) {
+   nvme_failover_req(req);
+   return;
+   }
+
nvme_req(req)->retries++;
blk_mq_requeue_request(req, true);
return;
@@ -173,6 +240,18 @@ void nvme_cancel_request(struct request *req, void *data, 
bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+static void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   mutex_lock(>namespaces_mutex);
+   list_for_each_entry(ns, >namespaces, list) {
+   if (ns->head)
+   kblockd_schedule_work(>head->requeue_work);
+   }
+   mutex_unlock(>namespaces_mutex);
+}
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
enum nvme_ctrl_state new_state)
 {
@@ -240,9 +319,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
if 

[PATCH 7/9] nvme: introduce a nvme_ns_ids structure

2017-09-18 Thread Christoph Hellwig
This allows us to manage the various uniqueue namespace identifiers
together instead needing various variables and arguments.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 69 +++-
 drivers/nvme/host/nvme.h | 14 +++---
 2 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4291119a6bc9..9da538d7ca87 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -781,7 +781,7 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct 
nvme_id_ctrl **id)
 }
 
 static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
-   u8 *eui64, u8 *nguid, uuid_t *uuid)
+   struct nvme_ns_ids *ids)
 {
struct nvme_command c = { };
int status;
@@ -817,7 +817,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, 
unsigned nsid,
goto free_data;
}
len = NVME_NIDT_EUI64_LEN;
-   memcpy(eui64, data + pos + sizeof(*cur), len);
+   memcpy(ids->eui64, data + pos + sizeof(*cur), len);
break;
case NVME_NIDT_NGUID:
if (cur->nidl != NVME_NIDT_NGUID_LEN) {
@@ -827,7 +827,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, 
unsigned nsid,
goto free_data;
}
len = NVME_NIDT_NGUID_LEN;
-   memcpy(nguid, data + pos + sizeof(*cur), len);
+   memcpy(ids->nguid, data + pos + sizeof(*cur), len);
break;
case NVME_NIDT_UUID:
if (cur->nidl != NVME_NIDT_UUID_LEN) {
@@ -837,7 +837,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, 
unsigned nsid,
goto free_data;
}
len = NVME_NIDT_UUID_LEN;
-   uuid_copy(uuid, data + pos + sizeof(*cur));
+   uuid_copy(>uuid, data + pos + sizeof(*cur));
break;
default:
/* Skip unnkown types */
@@ -1178,22 +1178,31 @@ static void nvme_config_discard(struct nvme_ns *ns)
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
-   struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
+   struct nvme_id_ns *id, struct nvme_ns_ids *ids)
 {
+   memset(ids, 0, sizeof(*ids));
+
if (ctrl->vs >= NVME_VS(1, 1, 0))
-   memcpy(eui64, id->eui64, sizeof(id->eui64));
+   memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
if (ctrl->vs >= NVME_VS(1, 2, 0))
-   memcpy(nguid, id->nguid, sizeof(id->nguid));
+   memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 /* Don't treat error as fatal we potentially
  * already have a NGUID or EUI-64
  */
-   if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+   if (nvme_identify_ns_descs(ctrl, nsid, ids))
dev_warn(ctrl->device,
 "%s: Identify Descriptors failed\n", __func__);
}
 }
 
+static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
+{
+   return uuid_equal(>uuid, >uuid) &&
+   memcmp(>nguid, >nguid, sizeof(a->nguid)) == 0 &&
+   memcmp(>eui64, >eui64, sizeof(a->eui64)) == 0;
+}
+
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
struct nvme_ns *ns = disk->private_data;
@@ -1234,8 +1243,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
struct nvme_ns *ns = disk->private_data;
struct nvme_ctrl *ctrl = ns->ctrl;
struct nvme_id_ns *id;
-   u8 eui64[8] = { 0 }, nguid[16] = { 0 };
-   uuid_t uuid = uuid_null;
+   struct nvme_ns_ids ids;
int ret = 0;
 
if (test_bit(NVME_NS_DEAD, >flags)) {
@@ -1252,10 +1260,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
goto out;
}
 
-   nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, );
-   if (!uuid_equal(>uuid, ) ||
-   memcmp(>nguid, , sizeof(ns->nguid)) ||
-   memcmp(>eui, , sizeof(ns->eui))) {
+   nvme_report_ns_ids(ctrl, ns->ns_id, id, );
+   if (!nvme_ns_ids_equal(>ids, )) {
dev_err(ctrl->device,
"identifiers changed for nsid %d\n", ns->ns_id);
ret = -ENODEV;
@@ -2139,18 +2145,19 @@ static ssize_t wwid_show(struct device *dev, struct 
device_attribute *attr,
char *buf)
 {
struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+  

[PATCH 8/9] nvme: track shared namespaces

2017-09-18 Thread Christoph Hellwig
Introduce a new struct nvme_ns_head [1] that holds information about
an actual namespace, unlike struct nvme_ns, which only holds the
per-controller namespace information.  For private namespaces there
is a 1:1 relation of the two, but for shared namespaces this lets us
discover all the paths to it.  For now only the identifiers are moved
to the new structure, but most of the information in struct nvme_ns
should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

[1] comments welcome if you have a better name for it, the current one is
horrible.  One idea would be to rename the current struct nvme_ns
to struct nvme_ns_link or similar and use the nvme_ns name for the
new structure.  But that would involve a lot of churn.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 193 +--
 drivers/nvme/host/lightnvm.c |  14 ++--
 drivers/nvme/host/nvme.h |  21 -
 3 files changed, 193 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9da538d7ca87..3e8405fd57a9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -247,10 +247,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_destroy_ns_head(struct kref *ref)
+{
+   struct nvme_ns_head *head =
+   container_of(ref, struct nvme_ns_head, ref);
+
+   list_del_init(>entry);
+   cleanup_srcu_struct(>srcu);
+   kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+   kref_put(>ref, nvme_destroy_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+   if (ns->head)
+   nvme_put_ns_head(ns->head);
+
if (ns->ndev)
nvme_nvm_unregister(ns);
 
@@ -420,7 +438,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
memset(cmnd, 0, sizeof(*cmnd));
cmnd->common.opcode = nvme_cmd_flush;
-   cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -451,7 +469,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, 
struct request *req,
 
memset(cmnd, 0, sizeof(*cmnd));
cmnd->dsm.opcode = nvme_cmd_dsm;
-   cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
cmnd->dsm.nr = cpu_to_le32(segments - 1);
cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -490,7 +508,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
memset(cmnd, 0, sizeof(*cmnd));
cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-   cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -971,7 +989,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct 
nvme_user_io __user *uio)
memset(, 0, sizeof(c));
c.rw.opcode = io.opcode;
c.rw.flags = io.flags;
-   c.rw.nsid = cpu_to_le32(ns->ns_id);
+   c.rw.nsid = cpu_to_le32(ns->head->ns_id);
c.rw.slba = cpu_to_le64(io.slba);
c.rw.length = cpu_to_le16(io.nblocks);
c.rw.control = cpu_to_le16(io.control);
@@ -1036,7 +1054,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t 
mode,
switch (cmd) {
case NVME_IOCTL_ID:
force_successful_syscall_return();
-   return ns->ns_id;
+   return ns->head->ns_id;
case NVME_IOCTL_ADMIN_CMD:
return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
case NVME_IOCTL_IO_CMD:
@@ -1196,6 +1214,13 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, 
unsigned int nsid,
}
 }
 
+static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
+{
+   return !uuid_is_null(>uuid) ||
+   memchr_inv(ids->nguid, 0, sizeof(ids->nguid)) ||
+   memchr_inv(ids->eui64, 0, sizeof(ids->eui64));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
return uuid_equal(>uuid, >uuid) &&
@@ -1251,7 +1276,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
return -ENODEV;
}
 
-   id = nvme_identify_ns(ctrl, ns->ns_id);
+   id = nvme_identify_ns(ctrl, ns->head->ns_id);
if (!id)
return -ENODEV;
 
@@ -1260,10 +1285,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
goto out;
}
 
-   nvme_report_ns_ids(ctrl, 

[PATCH 2/9] block: move REQ_NOWAIT

2017-09-18 Thread Christoph Hellwig
This flag should be before the operation-specific REQ_NOUNMAP bit.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blk_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a2d2aa709cef..acc2f3cdc2fc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -224,11 +224,11 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
__REQ_BACKGROUND,   /* background IO */
+   __REQ_NOWAIT,   /* Don't wait if request will block */
 
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP,  /* do not free blocks when zeroing */
 
-   __REQ_NOWAIT,   /* Don't wait if request will block */
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -245,9 +245,9 @@ enum req_flag_bits {
 #define REQ_PREFLUSH   (1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
+#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
 
 #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
-#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
 
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.14.1



[PATCH 3/9] block: add REQ_DRV bit

2017-09-18 Thread Christoph Hellwig
Set aside a bit in the request/bio flags for driver use.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blk_types.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index acc2f3cdc2fc..7ec2ed097a8a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -229,6 +229,9 @@ enum req_flag_bits {
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP,  /* do not free blocks when zeroing */
 
+   /* for driver use */
+   __REQ_DRV,
+
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -249,6 +252,8 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
 
+#define REQ_DRV(1ULL << __REQ_DRV)
+
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.14.1



nvme multipath support V2

2017-09-18 Thread Christoph Hellwig
Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

There are a couple questions left in the individual patches, comments
welcome.

Note that this series requires the previous series to remove bi_bdev,
in doubt use the git tree below for testing.

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

Changes since V1:
  - introduce new nvme_ns_ids structure to clean up identifier handling
  - generic_make_request_fast is now named direct_make_request and calls
generic_make_request_checks
  - reset bi_disk on resubmission
  - create sysfs links between the existing nvme namespace block devices and
the new share mpath device
  - temporarily added the timeout patches from James, this should go into
nvme-4.14, though


[PATCH 1/9] nvme: allow timed-out ios to retry

2017-09-18 Thread Christoph Hellwig
From: James Smart 

Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.

Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.

On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.

Signed-off-by: James Smart 
Reviewed-by: Keith Busch 
Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d470f031e27f..5589f67d2cd8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -134,8 +134,6 @@ static inline bool nvme_req_needs_retry(struct request *req)
return false;
if (nvme_req(req)->status & NVME_SC_DNR)
return false;
-   if (jiffies - req->start_time >= req->timeout)
-   return false;
if (nvme_req(req)->retries >= nvme_max_retries)
return false;
return true;
-- 
2.14.1



[PATCH 6/9] nvme: track subsystems

2017-09-18 Thread Christoph Hellwig
This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c| 111 
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/nvme.h|  12 -
 3 files changed, 116 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5589f67d2cd8..4291119a6bc9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -71,6 +71,9 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write 
directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static LIST_HEAD(nvme_subsystems);
+static DEFINE_MUTEX(nvme_subsystems_lock);
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1736,14 +1739,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id,
string_matches(id->fr, q->fr, sizeof(id->fr));
 }
 
-static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl 
*ctrl,
+   struct nvme_id_ctrl *id)
 {
size_t nqnlen;
int off;
 
nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
-   strcpy(ctrl->subnqn, id->subnqn);
+   strcpy(subsys->subnqn, id->subnqn);
return;
}
 
@@ -1751,14 +1755,91 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n");
 
/* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */
-   off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE,
+   off = snprintf(subsys->subnqn, NVMF_NQN_SIZE,
"nqn.2014.08.org.nvmexpress:%4x%4x",
le16_to_cpu(id->vid), le16_to_cpu(id->ssvid));
-   memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn));
+   memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn));
off += sizeof(id->sn);
-   memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn));
+   memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn));
off += sizeof(id->mn);
-   memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
+   memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
+}
+
+static void nvme_destroy_subsystem(struct kref *ref)
+{
+   struct nvme_subsystem *subsys =
+   container_of(ref, struct nvme_subsystem, ref);
+
+   mutex_lock(_subsystems_lock);
+   list_del(>entry);
+   mutex_unlock(_subsystems_lock);
+
+   kfree(subsys);
+}
+
+static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+{
+   kref_put(>ref, nvme_destroy_subsystem);
+}
+
+static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
+{
+   struct nvme_subsystem *subsys;
+
+   lockdep_assert_held(_subsystems_lock);
+
+   list_for_each_entry(subsys, _subsystems, entry) {
+   if (strcmp(subsys->subnqn, subsysnqn))
+   continue;
+   if (!kref_get_unless_zero(>ref))
+   continue;
+   return subsys;
+   }
+
+   return NULL;
+}
+
+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+   struct nvme_subsystem *subsys, *found;
+
+   subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+   if (!subsys)
+   return -ENOMEM;
+   INIT_LIST_HEAD(>ctrls);
+   kref_init(>ref);
+   nvme_init_subnqn(subsys, ctrl, id);
+   mutex_init(>lock);
+
+   mutex_lock(_subsystems_lock);
+   found = __nvme_find_get_subsystem(subsys->subnqn);
+   if (found) {
+   /*
+* Verify that the subsystem actually supports multiple
+* controllers, else bail out.
+*/
+   kfree(subsys);
+   if (!(id->cmic & (1 << 1))) {
+   dev_err(ctrl->device,
+   "ignoring ctrl due to duplicate subnqn (%s).\n",
+   found->subnqn);
+   mutex_unlock(_subsystems_lock);
+   return -EINVAL;
+   }
+
+   subsys = found;
+   } else {
+   list_add_tail(>entry, _subsystems);
+   }
+
+   ctrl->subsys = subsys;
+   mutex_unlock(_subsystems_lock);
+
+   mutex_lock(>lock);
+   list_add_tail(>subsys_entry, >ctrls);
+   mutex_unlock(>lock);
+
+   return 0;
 }
 
 /*
@@ -1796,7 +1877,11 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
return -EIO;
}
 
-   nvme_init_subnqn(ctrl, id);
+   ret = 

[PATCH 4/9] block: provide a direct_make_request helper

2017-09-18 Thread Christoph Hellwig
This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c   | 32 
 include/linux/blkdev.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index aebe676225e6..f02eb346c910 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2238,6 +2238,38 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+/**
+ * direct_make_request - hand a buffer directly to its device driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves like generic_make_request(), but does not protect
+ * against recursion.  Must only be used if the called driver is known
+ * to not re-issue this bio or a split to itself.
+ */
+blk_qc_t direct_make_request(struct bio *bio)
+{
+   struct request_queue *q = bio->bi_disk->queue;
+   bool nowait = bio->bi_opf & REQ_NOWAIT;
+   blk_qc_t ret;
+
+   if (!generic_make_request_checks(bio))
+   return BLK_QC_T_NONE;
+
+   if (unlikely(blk_queue_enter(q, nowait))) {
+   if (nowait && !blk_queue_dying(q))
+   bio->bi_status = BLK_STS_AGAIN;
+   else
+   bio->bi_status = BLK_STS_IOERR;
+   bio_endio(bio);
+   return BLK_QC_T_NONE;
+   }
+
+   ret = q->make_request_fn(q, bio);
+   blk_queue_exit(q);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(direct_make_request);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The  bio which describes the I/O
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 460294bb0fa5..86dfeea16d4c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -935,6 +935,7 @@ do {
\
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
+extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
-- 
2.14.1



[PATCH 5/9] block: add a blk_steal_bios helper

2017-09-18 Thread Christoph Hellwig
This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
---
 block/blk-core.c   | 20 
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index f02eb346c910..6f3db2c14843 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2724,6 +2724,26 @@ struct request *blk_fetch_request(struct request_queue 
*q)
 }
 EXPORT_SYMBOL(blk_fetch_request);
 
+/*
+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */
+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+   if (rq->bio) {
+   if (list->tail)
+   list->tail->bi_next = rq->bio;
+   else
+   list->head = rq->bio;
+   list->tail = rq->biotail;
+   }
+
+   rq->bio = NULL;
+   rq->biotail = NULL;
+   rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+
 /**
  * blk_update_request - Special helper function for request stacking drivers
  * @req:  the request being processed
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 86dfeea16d4c..0ca5525ee5ce 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1109,6 +1109,8 @@ extern struct request *blk_peek_request(struct 
request_queue *q);
 extern void blk_start_request(struct request *rq);
 extern struct request *blk_fetch_request(struct request_queue *q);
 
+void blk_steal_bios(struct bio_list *list, struct request *rq);
+
 /*
  * Request completion related functions.
  *
-- 
2.14.1



Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote:
> On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote:
> > Are you sure that scenario can happen? The blk-mq core calls 
> > test_and_set_bit()
> > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> > called. See also blk_mark_rq_complete(). This avoids that the .complete() 
> > and
> > .timeout() functions run concurrently.
> 
> Indeed that prevents .complete from running concurrently with the
> timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> callbacks. These are the LLD functions that call blk_mq_complete_request
> well before .complete. If the driver calls blk_mq_complete_request on
> a request that blk-mq is timing out, the request is lost because blk-mq
> already called blk_mark_rq_complete. Nothing prevents these LLD functions
> from running at the same time as the timeout handler.

Can you explain how you define "request is lost"? If a timeout occurs for a
SCSI request then scsi_times_out() calls scsi_abort_command() (if no
.eh_timed_out() callback has been defined by the LLD). It is the responsibility
of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns
SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the
SCSI core will escalate the error further until .scsi_done() has been called for
the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you
wrote is not correct for the SCSI core and a properly implemented SCSI LLD. 

Bart.

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
> > The nvme driver's use of blk_mq_reinit_tagset only happens during
> > controller initialisation, but I'm seeing lost commands well after that
> > during normal and stable running.
> > 
> > The timing is pretty narrow to hit, but I'm pretty sure this is what's
> > happening. For nvme, this occurs when nvme_timeout() runs concurrently
> > with nvme_handle_cqe() for the same struct request. For scsi-mq,
> > the same situation may arise if scsi_mq_done() runs concurrently with
> > scsi_times_out().
> 
> Hello Keith,
> 
> Are you sure that scenario can happen? The blk-mq core calls 
> test_and_set_bit()
> for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> called. See also blk_mark_rq_complete(). This avoids that the .complete() and
> .timeout() functions run concurrently.

Indeed that prevents .complete from running concurrently with the
timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
callbacks. These are the LLD functions that call blk_mq_complete_request
well before .complete. If the driver calls blk_mq_complete_request on
a request that blk-mq is timing out, the request is lost because blk-mq
already called blk_mark_rq_complete. Nothing prevents these LLD functions
from running at the same time as the timeout handler.
 
> In case this wouldn't be clear, a block driver must not call
> blk_mq_end_request() after the timeout handler has finished because that would
> trigger a use-after-free of a request structure.
>
> I noticed that your patch includes changes for blk_mq_start_request(). No
> timeout or completion handler should be running while blk_mq_start_request() 
> is
> being executed. If these changes make a difference in your tests then I think
> that means that there is something wrong with the NVMe driver.

The reason for changing blk_mq_start_request was because of how the
requeue was clearing STARTED. I had to remove that since it would have
otherwise been impossible for the blk_mq_rq_timed_out to know if the
request was requeued vs. completed.


Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
> The nvme driver's use of blk_mq_reinit_tagset only happens during
> controller initialisation, but I'm seeing lost commands well after that
> during normal and stable running.
> 
> The timing is pretty narrow to hit, but I'm pretty sure this is what's
> happening. For nvme, this occurs when nvme_timeout() runs concurrently
> with nvme_handle_cqe() for the same struct request. For scsi-mq,
> the same situation may arise if scsi_mq_done() runs concurrently with
> scsi_times_out().

Hello Keith,

Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
called. See also blk_mark_rq_complete(). This avoids that the .complete() and
.timeout() functions run concurrently.

In case this wouldn't be clear, a block driver must not call
blk_mq_end_request() after the timeout handler has finished because that would
trigger a use-after-free of a request structure.

I noticed that your patch includes changes for blk_mq_start_request(). No
timeout or completion handler should be running while blk_mq_start_request() is
being executed. If these changes make a difference in your tests then I think
that means that there is something wrong with the NVMe driver.

Bart.

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 10:07:58PM +, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote:
> > I think we've always known it's possible to lose a request during timeout
> > handling, but just accepted that possibility. It seems to be causing
> > problems, though, leading to unnecessary error escalation and IO failures.
> > 
> > The possiblity arises when the block layer marks the request complete
> > prior to running the timeout handler. If that request happens to complete
> > while the handler is running, the request will be lost, inevitably
> > triggering a second timeout.
> > 
> > This patch attempts to shorten the window for this race condition by
> > clearing the started flag when the driver completes a request. The block
> > layer's timeout handler will then complete the command if it observes
> > the started flag is no longer set.
> > 
> > Note it's possible to lose the command even with this patch. It's just
> > less likely to happen.
> 
> Hello Keith,
> 
> Are you sure the root cause of this race condition is in the blk-mq core?
> I've never observed such behavior in any of my numerous scsi-mq tests (which
> trigger timeouts). Are you sure the race you observed is not caused by a
> blk_mq_reinit_tagset() call, a function that is only used by the NVMe driver
> and not by any other block driver?

Hi Bart,

The nvme driver's use of blk_mq_reinit_tagset only happens during
controller initialisation, but I'm seeing lost commands well after that
during normal and stable running.

The timing is pretty narrow to hit, but I'm pretty sure this is what's
happening. For nvme, this occurs when nvme_timeout() runs concurrently
with nvme_handle_cqe() for the same struct request. For scsi-mq,
the same situation may arise if scsi_mq_done() runs concurrently with
scsi_times_out().

I don't really like the proposed "fix" though since it only makes it
less likely, but I didn't see a way to close that without introducing
locks. If someone knows of a better way, that would be awesome.

Thanks,
Keith


Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote:
> I think we've always known it's possible to lose a request during timeout
> handling, but just accepted that possibility. It seems to be causing
> problems, though, leading to unnecessary error escalation and IO failures.
> 
> The possiblity arises when the block layer marks the request complete
> prior to running the timeout handler. If that request happens to complete
> while the handler is running, the request will be lost, inevitably
> triggering a second timeout.
> 
> This patch attempts to shorten the window for this race condition by
> clearing the started flag when the driver completes a request. The block
> layer's timeout handler will then complete the command if it observes
> the started flag is no longer set.
> 
> Note it's possible to lose the command even with this patch. It's just
> less likely to happen.

Hello Keith,

Are you sure the root cause of this race condition is in the blk-mq core?
I've never observed such behavior in any of my numerous scsi-mq tests (which
trigger timeouts). Are you sure the race you observed is not caused by a
blk_mq_reinit_tagset() call, a function that is only used by the NVMe driver
and not by any other block driver?

Bart.

[RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
I think we've always known it's possible to lose a request during timeout
handling, but just accepted that possibility. It seems to be causing
problems, though, leading to unnecessary error escalation and IO failures.

The possiblity arises when the block layer marks the request complete
prior to running the timeout handler. If that request happens to complete
while the handler is running, the request will be lost, inevitably
triggering a second timeout.

This patch attempts to shorten the window for this race condition by
clearing the started flag when the driver completes a request. The block
layer's timeout handler will then complete the command if it observes
the started flag is no longer set.

Note it's possible to lose the command even with this patch. It's just
less likely to happen.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 37 +++--
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a1860..37144ef 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -566,6 +566,7 @@ void blk_mq_complete_request(struct request *rq)
 
if (unlikely(blk_should_fake_timeout(q)))
return;
+   clear_bit(REQ_ATOM_STARTED, >atomic_flags);
if (!blk_mark_rq_complete(rq))
__blk_mq_complete_request(rq);
 }
@@ -605,19 +606,19 @@ void blk_mq_start_request(struct request *rq)
 * complete. So be sure to clear complete again when we start
 * the request, otherwise we'll ignore the completion event.
 */
-   if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   if (!test_bit(REQ_ATOM_STARTED, >atomic_flags)) {
set_bit(REQ_ATOM_STARTED, >atomic_flags);
+   if (q->dma_drain_size && blk_rq_bytes(rq)) {
+   /*
+* Make sure space for the drain appears.  We know we 
can do
+* this because max_hw_segments has been adjusted to be 
one
+* fewer than the device can handle.
+*/
+   rq->nr_phys_segments++;
+   }
+   }
if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags))
clear_bit(REQ_ATOM_COMPLETE, >atomic_flags);
-
-   if (q->dma_drain_size && blk_rq_bytes(rq)) {
-   /*
-* Make sure space for the drain appears.  We know we can do
-* this because max_hw_segments has been adjusted to be one
-* fewer than the device can handle.
-*/
-   rq->nr_phys_segments++;
-   }
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
@@ -637,11 +638,6 @@ static void __blk_mq_requeue_request(struct request *rq)
trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, >issue_stat);
blk_mq_sched_requeue_request(rq);
-
-   if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) {
-   if (q->dma_drain_size && blk_rq_bytes(rq))
-   rq->nr_phys_segments--;
-   }
 }
 
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
@@ -763,10 +759,15 @@ void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   blk_add_timer(req);
-   blk_clear_rq_complete(req);
-   break;
+   if (test_bit(REQ_ATOM_STARTED, >atomic_flags)) {
+   blk_add_timer(req);
+   blk_clear_rq_complete(req);
+   break;
+   }
+   /* Fall through */
case BLK_EH_NOT_HANDLED:
+   if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   __blk_mq_complete_request(req);
break;
default:
printk(KERN_ERR "block: bad eh return: %d\n", ret);
-- 
2.5.5



Re: [PATCH v6 0/2] blktrace: Fix deadlock problem

2017-09-18 Thread Steven Rostedt

Acked-by: Steven Rostedt (VMware) 

for the series.

Jens, feel free to take this in your tree.

-- Steve


On Mon, 18 Sep 2017 14:53:49 -0400
Waiman Long  wrote:

>  v6:
>   - Add a second patch to rename the bd_fsfreeze_mutex to
> bd_fsfreeze_blktrace_mutex.
> 
>  v5:
>   - Overload the bd_fsfreeze_mutex in block_device structure for
> blktrace protection.
> 
>  v4:
>   - Use blktrace_mutex in blk_trace_ioctl() as well.
> 
>  v3:
>   - Use a global blktrace_mutex to serialize sysfs attribute accesses
> instead of the bd_mutex.
> 
>  v2:
>   - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
>   - Check for signal in the mutex_trylock loops.
>   - Use usleep() instead of schedule() for RT tasks.
> 
> This patchset fixes a potential blktrace deadlock problem between
> block device deletion and sysfs operations.
> 
> Waiman Long (2):
>   blktrace: Fix potentail deadlock between delete & sysfs ops
>   block_dev: Rename bd_fsfreeze_mutex
> 
>  fs/block_dev.c  | 14 +++---
>  fs/gfs2/ops_fstype.c|  6 +++---
>  fs/nilfs2/super.c   |  6 +++---
>  fs/super.c  |  6 +++---
>  include/linux/fs.h  |  5 +++--
>  kernel/trace/blktrace.c | 26 --
>  6 files changed, 39 insertions(+), 24 deletions(-)
> 



[PATCH V3] block/ndb: add WQ_UNBOUND to the knbd-recv workqueue

2017-09-18 Thread Dan Melnic
Add WQ_UNBOUND to the knbd-recv workqueue so we're not bound
to a single CPU that is selected at device creation time.

Signed-off-by: Dan Melnic 
Reviewed-by: Josef Bacik 
---
 drivers/block/nbd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 2aa87cb..f5d8c9a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -2080,7 +2080,8 @@ static int __init nbd_init(void)
if (nbds_max > 1UL << (MINORBITS - part_shift))
return -EINVAL;
recv_workqueue = alloc_workqueue("knbd-recv",
-WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+WQ_MEM_RECLAIM | WQ_HIGHPRI |
+WQ_UNBOUND, 0);
if (!recv_workqueue)
return -ENOMEM;
 
-- 
2.9.5



Re: [PATCH v2] fs: pass the write life time hint to the mapped filesystem

2017-09-18 Thread Michael Moy

Thanks, I see that now.

What is the file hint using F_SET_FILE_RW_HINT used for?

It seems that if both are set, the one set first gets used
and, if only the file hint is set, it is not used at all.


On 9/18/2017 9:49 AM, Christoph Hellwig wrote:

On Mon, Sep 18, 2017 at 09:45:57AM -0600, Michael Moy wrote:

The write hint needs to be copied to the mapped filesystem
so it can be passed down to the nvme device driver.

v2: fix tabs in the email

If you want the write hint for buffered I/O you need to set it on the
inode using F_SET_RW_HINT.

With your patch we'd magically move the file hint to the inode hint
on each write.

___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme





[PATCH V2] block/ndb: add WQ_UNBOUND to the knbd-recv workqueue

2017-09-18 Thread Dan Melnic
Add WQ_UNBOUND to the knbd-recv workqueue so we're not bound
to a single CPU that is selected at device creation time.

Signed-off-by: Dan Melnic 
---
 drivers/block/nbd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 2aa87cb..ac3c5ecc 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -2080,7 +2080,9 @@ static int __init nbd_init(void)
if (nbds_max > 1UL << (MINORBITS - part_shift))
return -EINVAL;
recv_workqueue = alloc_workqueue("knbd-recv",
-WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+WQ_MEM_RECLAIM | WQ_HIGHPRI |
+WQ_UNBOUND,
+0);
if (!recv_workqueue)
return -ENOMEM;
 
-- 
2.9.5



Re: [PATCH V2] block/ndb: add WQ_UNBOUND to the knbd-recv workqueue

2017-09-18 Thread Josef Bacik
On Mon, Sep 18, 2017 at 12:56:17PM -0700, Dan Melnic wrote:
> Add WQ_UNBOUND to the knbd-recv workqueue so we're not bound
> to a single CPU that is selected at device creation time.
> 
> Signed-off-by: Dan Melnic 
> ---
>  drivers/block/nbd.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 2aa87cb..ac3c5ecc 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -2080,7 +2080,9 @@ static int __init nbd_init(void)
>   if (nbds_max > 1UL << (MINORBITS - part_shift))
>   return -EINVAL;
>   recv_workqueue = alloc_workqueue("knbd-recv",
> -  WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
> +  WQ_MEM_RECLAIM | WQ_HIGHPRI |
> +  WQ_UNBOUND,
> +  0);

This can go on the line above.  You can add

Reviewed-by: Josef Bacik 

once you've made that adjustment.  Thanks,

Josef


[PATCH v6 1/2] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-09-18 Thread Waiman Long
The lockdep code had reported the following unsafe locking scenario:

   CPU0CPU1
   
  lock(s_active#228);
   lock(>bd_mutex/1);
   lock(s_active#228);
  lock(>bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, the other
bd_fsfreeze_mutex mutex in the block_device structure is now overloaded
to protect against concurrent blktrace data access in the blktrace.c
file. There is no point in adding one more mutex to the block_device
structure just for blktrace.

Signed-off-by: Waiman Long 
---
 include/linux/fs.h  |  2 +-
 kernel/trace/blktrace.c | 26 --
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..330b572 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -448,7 +448,7 @@ struct block_device {
 
/* The counter of freeze processes */
int bd_fsfreeze_count;
-   /* Mutex for freeze */
+   /* Mutex for freeze and blktrace */
struct mutexbd_fsfreeze_mutex;
 } __randomize_layout;
 
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 2a685b4..7cd5d1d 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -648,6 +648,20 @@ int blk_trace_startstop(struct request_queue *q, int start)
 }
 EXPORT_SYMBOL_GPL(blk_trace_startstop);
 
+/*
+ * When reading or writing the blktrace sysfs files, the references to the
+ * opened sysfs or device files should prevent the underlying block device
+ * from being removed. So no further delete protection is really needed.
+ *
+ * Protection from multiple readers and writers accessing blktrace data
+ * concurrently is still required. The bd_mutex was used for this purpose.
+ * That could lead to deadlock with concurrent block device deletion and
+ * sysfs access. Instead, the block device bd_fsfreeze_mutex is now
+ * overloaded for blktrace data protection. Like freeze/thaw, blktrace
+ * functionality is not frequently used. There is no point in adding
+ * one more mutex to the block_device structure just for blktrace.
+ */
+
 /**
  * blk_trace_ioctl: - handle the ioctls associated with tracing
  * @bdev:  the block device
@@ -665,7 +679,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned 
cmd, char __user *arg)
if (!q)
return -ENXIO;
 
-   mutex_lock(>bd_mutex);
+   mutex_lock(>bd_fsfreeze_mutex);
 
switch (cmd) {
case BLKTRACESETUP:
@@ -691,7 +705,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned 
cmd, char __user *arg)
break;
}
 
-   mutex_unlock(>bd_mutex);
+   mutex_unlock(>bd_fsfreeze_mutex);
return ret;
 }
 
@@ -1727,7 +1741,7 @@ static ssize_t sysfs_blk_trace_attr_show(struct device 
*dev,
if (q == NULL)
goto out_bdput;
 
-   mutex_lock(>bd_mutex);
+   mutex_lock(>bd_fsfreeze_mutex);
 
if (attr == _attr_enable) {
ret = sprintf(buf, "%u\n", !!q->blk_trace);
@@ -1746,7 +1760,7 @@ static ssize_t sysfs_blk_trace_attr_show(struct device 
*dev,
ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
 
 out_unlock_bdev:
-   mutex_unlock(>bd_mutex);
+   mutex_unlock(>bd_fsfreeze_mutex);
 out_bdput:
bdput(bdev);
 out:
@@ -1788,7 +1802,7 @@ static ssize_t sysfs_blk_trace_attr_store(struct device 
*dev,
if (q == NULL)
goto out_bdput;
 
-   mutex_lock(>bd_mutex);
+   mutex_lock(>bd_fsfreeze_mutex);
 
if (attr == _attr_enable) {
if (value)
@@ -1814,7 +1828,7 @@ static ssize_t sysfs_blk_trace_attr_store(struct device 
*dev,
}
 
 out_unlock_bdev:
-   mutex_unlock(>bd_mutex);
+   mutex_unlock(>bd_fsfreeze_mutex);
 out_bdput:
bdput(bdev);
 out:
-- 
1.8.3.1



[PATCH v6 2/2] block_dev: Rename bd_fsfreeze_mutex

2017-09-18 Thread Waiman Long
As the bd_fsfreeze_mutex is used by the blktrace subsystem as well,
it is now renamed to bd_fsfreeze_blktrace_mutex to better reflect
its purpose.

Signed-off-by: Waiman Long 
---
 fs/block_dev.c  | 14 +++---
 fs/gfs2/ops_fstype.c|  6 +++---
 fs/nilfs2/super.c   |  6 +++---
 fs/super.c  |  6 +++---
 include/linux/fs.h  |  5 +++--
 kernel/trace/blktrace.c | 14 +++---
 6 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088f..3dea006 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -504,7 +504,7 @@ struct super_block *freeze_bdev(struct block_device *bdev)
struct super_block *sb;
int error = 0;
 
-   mutex_lock(>bd_fsfreeze_mutex);
+   mutex_lock(>bd_fsfreeze_blktrace_mutex);
if (++bdev->bd_fsfreeze_count > 1) {
/*
 * We don't even need to grab a reference - the first call
@@ -514,7 +514,7 @@ struct super_block *freeze_bdev(struct block_device *bdev)
sb = get_super(bdev);
if (sb)
drop_super(sb);
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
return sb;
}
 
@@ -528,13 +528,13 @@ struct super_block *freeze_bdev(struct block_device *bdev)
if (error) {
deactivate_super(sb);
bdev->bd_fsfreeze_count--;
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
return ERR_PTR(error);
}
deactivate_super(sb);
  out:
sync_blockdev(bdev);
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
return sb;  /* thaw_bdev releases s->s_umount */
 }
 EXPORT_SYMBOL(freeze_bdev);
@@ -550,7 +550,7 @@ int thaw_bdev(struct block_device *bdev, struct super_block 
*sb)
 {
int error = -EINVAL;
 
-   mutex_lock(>bd_fsfreeze_mutex);
+   mutex_lock(>bd_fsfreeze_blktrace_mutex);
if (!bdev->bd_fsfreeze_count)
goto out;
 
@@ -568,7 +568,7 @@ int thaw_bdev(struct block_device *bdev, struct super_block 
*sb)
if (error)
bdev->bd_fsfreeze_count++;
 out:
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
return error;
 }
 EXPORT_SYMBOL(thaw_bdev);
@@ -767,7 +767,7 @@ static void init_once(void *foo)
bdev->bd_bdi = _backing_dev_info;
inode_init_once(>vfs_inode);
/* Initialize mutex for freeze. */
-   mutex_init(>bd_fsfreeze_mutex);
+   mutex_init(>bd_fsfreeze_blktrace_mutex);
 }
 
 static void bdev_evict_inode(struct inode *inode)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index a3711f5..5664905 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1269,14 +1269,14 @@ static struct dentry *gfs2_mount(struct 
file_system_type *fs_type, int flags,
 * will protect the lockfs code from trying to start a snapshot
 * while we are mounting
 */
-   mutex_lock(>bd_fsfreeze_mutex);
+   mutex_lock(>bd_fsfreeze_blktrace_mutex);
if (bdev->bd_fsfreeze_count > 0) {
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
error = -EBUSY;
goto error_bdev;
}
s = sget(fs_type, test_gfs2_super, set_gfs2_super, flags, bdev);
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
error = PTR_ERR(s);
if (IS_ERR(s))
goto error_bdev;
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 4fc018d..931b455 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1306,15 +1306,15 @@ static int nilfs_test_bdev_super(struct super_block *s, 
void *data)
 * will protect the lockfs code from trying to start a snapshot
 * while we are mounting
 */
-   mutex_lock(>bd_fsfreeze_mutex);
+   mutex_lock(>bd_fsfreeze_blktrace_mutex);
if (sd.bdev->bd_fsfreeze_count > 0) {
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
err = -EBUSY;
goto failed;
}
s = sget(fs_type, nilfs_test_bdev_super, nilfs_set_bdev_super, flags,
 sd.bdev);
-   mutex_unlock(>bd_fsfreeze_mutex);
+   mutex_unlock(>bd_fsfreeze_blktrace_mutex);
if (IS_ERR(s)) {
err = PTR_ERR(s);
goto failed;
diff --git a/fs/super.c b/fs/super.c
index 166c4ee..079890f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1083,15 +1083,15 @@ struct dentry *mount_bdev(struct file_system_type 
*fs_type,
 * will protect the lockfs code from trying to start a snapshot
 * while we are mounting
 */
-   mutex_lock(>bd_fsfreeze_mutex);
+   

[PATCH v6 0/2] blktrace: Fix deadlock problem

2017-09-18 Thread Waiman Long
 v6:
  - Add a second patch to rename the bd_fsfreeze_mutex to
bd_fsfreeze_blktrace_mutex.

 v5:
  - Overload the bd_fsfreeze_mutex in block_device structure for
blktrace protection.

 v4:
  - Use blktrace_mutex in blk_trace_ioctl() as well.

 v3:
  - Use a global blktrace_mutex to serialize sysfs attribute accesses
instead of the bd_mutex.

 v2:
  - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
  - Check for signal in the mutex_trylock loops.
  - Use usleep() instead of schedule() for RT tasks.

This patchset fixes a potential blktrace deadlock problem between
block device deletion and sysfs operations.

Waiman Long (2):
  blktrace: Fix potentail deadlock between delete & sysfs ops
  block_dev: Rename bd_fsfreeze_mutex

 fs/block_dev.c  | 14 +++---
 fs/gfs2/ops_fstype.c|  6 +++---
 fs/nilfs2/super.c   |  6 +++---
 fs/super.c  |  6 +++---
 include/linux/fs.h  |  5 +++--
 kernel/trace/blktrace.c | 26 --
 6 files changed, 39 insertions(+), 24 deletions(-)

-- 
1.8.3.1



[PATCH] block/ndb: add WQ_UNBOUND to the knbd-recv workqueue

2017-09-18 Thread Dan Melnic
Add WQ_UNBOUND to the knbd-recv workqueue so we're not bound to asingle CPU
that is selected at device creation time

Signed-off-by: Dan Melnic 
---
 drivers/block/nbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 2aa87cb..aa35e6e 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -2080,7 +2080,7 @@ static int __init nbd_init(void)
if (nbds_max > 1UL << (MINORBITS - part_shift))
return -EINVAL;
recv_workqueue = alloc_workqueue("knbd-recv",
-WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+WQ_MEM_RECLAIM | WQ_HIGHPRI | 
WQ_UNBOUND, 0);
if (!recv_workqueue)
return -ENOMEM;
 
-- 
2.9.5



Re: [PATCH v5] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-09-18 Thread Bart Van Assche
On Sat, 2017-09-16 at 19:37 -0700, Waiman Long wrote:
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 339e737..330b572 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -448,7 +448,7 @@ struct block_device {
>  
>   /* The counter of freeze processes */
>   int bd_fsfreeze_count;
> - /* Mutex for freeze */
> + /* Mutex for freeze and blktrace */
>   struct mutexbd_fsfreeze_mutex;
>  } __randomize_layout;

This patch changes the meaning of bd_fsfreeze_mutex significantly. Please rename
that mutex such that its name matches its new role.

Thanks,

Bart.

[PATCH v2] fs: pass the write life time hint to the mapped filesystem

2017-09-18 Thread Michael Moy

The write hint needs to be copied to the mapped filesystem
so it can be passed down to the nvme device driver.

v2: fix tabs in the email

Signed-off-by: Michael Moy 
---
  mm/filemap.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 870971e..f9849b8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3008,6 +3008,16 @@ ssize_t generic_perform_write(struct file *file,
break;
}

+   /*
+* Copy a write hint to the mapped filesystem.
+*
+* Use the File hint first, inode next.
+*/
+   if (file->f_write_hint)
+   mapping->host->i_write_hint = file->f_write_hint;
+   else if (file->f_inode->i_write_hint)
+   mapping->host->i_write_hint = 
file->f_inode->i_write_hint;
+
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
, );
if (unlikely(status < 0))
--
1.8.3.1



Re: [PATCH v2] fs: pass the write life time hint to the mapped filesystem

2017-09-18 Thread Christoph Hellwig
On Mon, Sep 18, 2017 at 09:45:57AM -0600, Michael Moy wrote:
> The write hint needs to be copied to the mapped filesystem
> so it can be passed down to the nvme device driver.
>
> v2: fix tabs in the email

If you want the write hint for buffered I/O you need to set it on the
inode using F_SET_RW_HINT.

With your patch we'd magically move the file hint to the inode hint
on each write.


Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-18 Thread Bart Van Assche
On Sun, 2017-09-17 at 20:40 +0800, Ming Lei wrote:
> "if no request has completed before the delay has expired" can't be a
> reason to rerun the queue, because the queue can still be busy.

That statement of you shows that there are important aspects of the SCSI
core and dm-mpath driver that you don't understand.

> I suggest to understand the root cause, instead of keeping this
> ugly random delay because run hw queue after 100ms may be useless
> in 99.99% times.

If you are still looking at removing the blk_mq_delay_run_hw_queue() calls
then I think you are looking in the wrong direction. What kind of problem
are you trying to solve? Is it perhaps that there can be a delay between
dm-mpath request completion and the queueing of a new request? If so,
adding a queue run call into the dm-mpath end_io callback is probably
sufficient and probably can replace this entire patch series.

Bart.

Re: compile nvme not find bi_disk

2017-09-18 Thread Jens Axboe
On 09/18/2017 07:47 AM, Tony Yang wrote:
> Dear All
> 
>   I'm compiling nvme, but encountered the following error, how can I
> solve it? Thanks
> 
>  CHK include/generated/compile.h
>   CC [M]  drivers/nvme/host/core.o
> drivers/nvme/host/core.c: In function ‘__nvme_submit_user_cmd’:
> drivers/nvme/host/core.c:631: error: ‘struct bio’ has no member named 
> ‘bi_disk’
> make[3]: *** [drivers/nvme/host/core.o] Error 1
> make[2]: *** [drivers/nvme/host] Error 2
> make[1]: *** [drivers/nvme] Error 2
> make: *** [drivers] Error 2
> [root@cescel03 nvme-nvme-4.13]# bi_disk

You're not running a mainline version of the kernel, looks like you're
trying to compile a newer version of the nvme driver on an older kernel
version.  That's not going to work.

-- 
Jens Axboe



compile nvme not find bi_disk

2017-09-18 Thread Tony Yang
Dear All

  I'm compiling nvme, but encountered the following error, how can I
solve it? Thanks

 CHK include/generated/compile.h
  CC [M]  drivers/nvme/host/core.o
drivers/nvme/host/core.c: In function ‘__nvme_submit_user_cmd’:
drivers/nvme/host/core.c:631: error: ‘struct bio’ has no member named ‘bi_disk’
make[3]: *** [drivers/nvme/host/core.o] Error 1
make[2]: *** [drivers/nvme/host] Error 2
make[1]: *** [drivers/nvme] Error 2
make: *** [drivers] Error 2
[root@cescel03 nvme-nvme-4.13]# bi_disk


Re: [PATCH] lightnvm: remove already calculated nr_chnls

2017-09-18 Thread Javier González
> On 17 Sep 2017, at 23.04, Rakesh Pandit  wrote:
> 
> Remove repeated calculation for number of channels while creating a
> target device.
> 
> Signed-off-by: Rakesh Pandit 
> ---
> 
> This is also a trivial change I found while investigating/working on
> other issue.
> 
> drivers/lightnvm/core.c | 1 -
> 1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
> index 1b8338d..01536b8 100644
> --- a/drivers/lightnvm/core.c
> +++ b/drivers/lightnvm/core.c
> @@ -139,7 +139,6 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
> nvm_dev *dev,
>   int prev_nr_luns;
>   int i, j;
> 
> - nr_chnls = nr_luns / dev->geo.luns_per_chnl;
>   nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
> 
>   dev_map = kmalloc(sizeof(struct nvm_dev_map), GFP_KERNEL);
> --
> 2.7.4

We wanted to make sure that nr_chnls was really, really set :)

Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP