Re: [PATCH V2 6/8] block: allow to allocate req with REQF_PREEMPT when queue is frozen

2017-09-01 Thread Ming Lei
On Fri, Sep 01, 2017 at 08:07:04PM +, Bart Van Assche wrote:
> On Sat, 2017-09-02 at 02:49 +0800, Ming Lei wrote:
> > +   if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_freezing(q))
> > +   blk_queue_enter_live(q);
> > +   else
> > +   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
> 
> Why did you repost this patch series without trying to reach an agreement
> about the approach?
> 
> Anyway, because of the unsafe blk_queue_enter_live() call introduced by this
> patch, please add the following to the description of this patch whenever you
> repost it:
> 
> NAK-ed-by: Bart Van Assche 

Did you take a look at the patch 7 or cover letter? In that patch,
if preempt freezing is on-progressing, any other freeze can't be
started at all, so that is definitely safe to use blk_queue_enter_live()
here, isn't it?

-- 
Ming


[PATCH] target: Avoid that EXTENDED COPY commands triggers lock inversion

2017-09-01 Thread Bart Van Assche
This patch prevents that lockdep reports the following complaint:

==
WARNING: possible circular locking dependency detected
4.12.0-rc1-dbg+ #1 Not tainted
--
rmdir/12053 is trying to acquire lock:
 (device_mutex#2){+.+.+.}, at: [] 
target_free_device+0xae/0xf0 [target_core_mod]

but task is already holding lock:
 (>s_type->i_mutex_key#14){++}, at: [] 
vfs_rmdir+0x50/0x140

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (>s_type->i_mutex_key#14){++}:
   lock_acquire+0x59/0x80
   down_write+0x36/0x70
   configfs_depend_item+0x3a/0xb0 [configfs]
   target_depend_item+0x13/0x20 [target_core_mod]
   target_xcopy_locate_se_dev_e4_iter+0x87/0x100 [target_core_mod]
   target_devices_idr_iter+0x16/0x20 [target_core_mod]
   idr_for_each+0x39/0xc0
   target_for_each_device+0x36/0x50 [target_core_mod]
   target_xcopy_locate_se_dev_e4+0x28/0x80 [target_core_mod]
   target_xcopy_do_work+0x2e9/0xdd0 [target_core_mod]
   process_one_work+0x1ca/0x3f0
   worker_thread+0x49/0x3b0
   kthread+0x109/0x140
   ret_from_fork+0x31/0x40

-> #0 (device_mutex#2){+.+.+.}:
   __lock_acquire+0x101f/0x11d0
   lock_acquire+0x59/0x80
   __mutex_lock+0x7e/0x950
   mutex_lock_nested+0x16/0x20
   target_free_device+0xae/0xf0 [target_core_mod]
   target_core_dev_release+0x10/0x20 [target_core_mod]
   config_item_put+0x6e/0xb0 [configfs]
   configfs_rmdir+0x1a6/0x300 [configfs]
   vfs_rmdir+0xb7/0x140
   do_rmdir+0x1f4/0x200
   SyS_rmdir+0x11/0x20
   entry_SYSCALL_64_fastpath+0x23/0xc2

other info that might help us debug this:

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(>s_type->i_mutex_key#14);
   lock(device_mutex#2);
   lock(>s_type->i_mutex_key#14);
  lock(device_mutex#2);

 *** DEADLOCK ***

3 locks held by rmdir/12053:
 #0:  (sb_writers#10){.+.+.+}, at: [] mnt_want_write+0x1f/0x50
 #1:  (>s_type->i_mutex_key#14/1){+.+.+.}, at: [] 
do_rmdir+0x15e/0x200
 #2:  (>s_type->i_mutex_key#14){++}, at: [] 
vfs_rmdir+0x50/0x140

stack backtrace:
CPU: 3 PID: 12053 Comm: rmdir Not tainted 4.12.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0x86/0xcf
 print_circular_bug+0x1c7/0x220
 __lock_acquire+0x101f/0x11d0
 lock_acquire+0x59/0x80
 __mutex_lock+0x7e/0x950
 mutex_lock_nested+0x16/0x20
 target_free_device+0xae/0xf0 [target_core_mod]
 target_core_dev_release+0x10/0x20 [target_core_mod]
 config_item_put+0x6e/0xb0 [configfs]
 configfs_rmdir+0x1a6/0x300 [configfs]
 vfs_rmdir+0xb7/0x140
 do_rmdir+0x1f4/0x200
 SyS_rmdir+0x11/0x20
 entry_SYSCALL_64_fastpath+0x23/0xc2

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Mike Christie 
---
 drivers/target/target_core_configfs.c | 18 +-
 include/target/target_core_base.h |  1 +
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/target/target_core_configfs.c 
b/drivers/target/target_core_configfs.c
index 7e87d952bb7a..7d0971789f12 100644
--- a/drivers/target/target_core_configfs.c
+++ b/drivers/target/target_core_configfs.c
@@ -2251,13 +2251,29 @@ static struct configfs_attribute 
*target_core_dev_attrs[] = {
NULL,
 };
 
+static void target_core_dev_release_work(struct work_struct *release_work)
+{
+   struct se_device *dev =
+   container_of(release_work, struct se_device, release_work);
+
+   target_free_device(dev);
+}
+
 static void target_core_dev_release(struct config_item *item)
 {
struct config_group *dev_cg = to_config_group(item);
struct se_device *dev =
container_of(dev_cg, struct se_device, dev_group);
 
-   target_free_device(dev);
+   /*
+* Call target_free_device() asynchronously to avoid lock inversion
+* against the configfs_depend_item() call from
+* target_xcopy_locate_se_dev_e4_iter(). That call namely occurs with
+* device_mutex held. This function is called from inside vfs_rmdir
+* and hence is called with i_rwsem held.
+*/
+   INIT_WORK(>release_work, target_core_dev_release_work);
+   schedule_work(>release_work);
 }
 
 /*
diff --git a/include/target/target_core_base.h 
b/include/target/target_core_base.h
index 516764febeb7..19da905dd2ca 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -793,6 +793,7 @@ struct se_device {
struct list_headdev_tmr_list;
struct workqueue_struct *tmr_wq;
struct work_struct  qf_work_queue;
+   struct work_struct  

Re: [PATCH V2 6/8] block: allow to allocate req with REQF_PREEMPT when queue is frozen

2017-09-01 Thread Bart Van Assche
On Sat, 2017-09-02 at 02:49 +0800, Ming Lei wrote:
> + if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_freezing(q))
> + blk_queue_enter_live(q);
> + else
> + ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));

Why did you repost this patch series without trying to reach an agreement
about the approach?

Anyway, because of the unsafe blk_queue_enter_live() call introduced by this
patch, please add the following to the description of this patch whenever you
repost it:

NAK-ed-by: Bart Van Assche 



[PATCH V2 8/8] SCSI: freeze block queue when SCSI device is put into quiesce

2017-09-01 Thread Ming Lei
Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
requests because all these allocated requests can't be dispatched
when device is put in QIUESCE. Then no request can be allocated
for RQF_PREEMPT, and system may hang somewhere, such as
When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch freezes block queue first,
so no new request can enter queue any more, and all pending
requests are drained too once blk_freeze_queue is returned.

This patch also uses __blk_get_request() for allocating
request with RQF_PREEMPT, so that the allocation can
succeed even though block queue is frozen.

Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..a59544012b68 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -243,10 +243,12 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct request *req;
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
+   unsigned flag = sdev->sdev_state == SDEV_QUIESCE ? BLK_REQ_PREEMPT : 0;
 
-   req = blk_get_request(sdev->request_queue,
+   req = __blk_get_request(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+   flag);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -2890,6 +2892,19 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
int err;
 
+   /*
+* Simply quiesing SCSI device isn't safe, it is easy
+* to use up requests because all these allocated requests
+* can't be dispatched when device is put in QIUESCE.
+* Then no request can be allocated and we may hang
+* somewhere, such as system suspend/resume.
+*
+* So we freeze block queue first, no new request can
+* enter queue any more, and pending requests are drained
+* once blk_freeze_queue is returned.
+*/
+   blk_freeze_queue_preempt(sdev->request_queue);
+
mutex_lock(>state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
mutex_unlock(>state_mutex);
@@ -2926,6 +2941,8 @@ void scsi_device_resume(struct scsi_device *sdev)
scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
scsi_run_queue(sdev->request_queue);
mutex_unlock(>state_mutex);
+
+   blk_unfreeze_queue_preempt(sdev->request_queue);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5



[PATCH V2 7/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue

2017-09-01 Thread Ming Lei
The two APIs are required to allow request allocation for
RQF_PREEMPT when queue is frozen.

The following two points have to be guaranteed for one queue:

1) preempt freezing can be started only after all pending
normal & preempt freezing are completed

2) normal freezing can't be started if there is pending
preempt freezing.

Because for normal freezing, once blk_mq_freeze_queue_wait()
is returned, we have to make sure no I/Os are pending.

rwsem should have been perfect for this kind of sync,
but lockdep will complain in case of nested normal freeze.

So spin_lock with freezing status is used for the sync.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   |  2 ++
 block/blk-mq.c | 72 +++---
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  3 +++
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c199910d4fe1..bbcea07f17da 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -899,6 +899,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (blkcg_init_queue(q))
goto fail_ref;
 
+   spin_lock_init(>freeze_lock);
+
return q;
 
 fail_ref:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 695d2eeaf41a..bf8c057aa50f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -118,16 +118,48 @@ void blk_mq_in_flight(struct request_queue *q, struct 
hd_struct *part,
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, );
 }
 
-void blk_freeze_queue_start(struct request_queue *q)
+static void __blk_freeze_queue_start(struct request_queue *q, bool preempt)
 {
int freeze_depth;
 
+   /*
+* Wait for completion of another kind of freezing.
+*
+* We have to sync between normal freeze and preempt
+* freeze. preempt freeze can only be started iff all
+* pending normal & preempt freezing are completed,
+* meantime normal freeze can be started only if there
+* isn't pending preempt freezing.
+*
+* rwsem should have been perfect for this kind of sync,
+* but lockdep will complain in case of nested normal freeze.
+*
+* So we have to use lock to do that manually.
+*/
+   spin_lock(>freeze_lock);
+   wait_event_cmd(q->mq_freeze_wq,
+  preempt ? !(q->normal_freezing + q->preempt_freezing) : 
!q->preempt_freezing,
+  spin_unlock(>freeze_lock),
+  spin_lock(>freeze_lock));
+
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
+   if (preempt)
+   q->preempt_freezing = 1;
+   else
+   q->normal_freezing = 1;
+   spin_unlock(>freeze_lock);
+
percpu_ref_kill(>q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
-   }
+   } else
+   spin_unlock(>freeze_lock);
+}
+
+void blk_freeze_queue_start(struct request_queue *q)
+{
+   __blk_freeze_queue_start(q, false);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
@@ -166,20 +198,54 @@ void blk_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
-void blk_unfreeze_queue(struct request_queue *q)
+static void __blk_unfreeze_queue(struct request_queue *q, bool preempt)
 {
int freeze_depth;
 
freeze_depth = atomic_dec_return(>mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
+   spin_lock(>freeze_lock);
+   if (preempt)
+   q->preempt_freezing = 0;
+   else
+   q->normal_freezing = 0;
+   spin_unlock(>freeze_lock);
percpu_ref_reinit(>q_usage_counter);
wake_up_all(>mq_freeze_wq);
}
 }
+
+void blk_unfreeze_queue(struct request_queue *q)
+{
+   __blk_unfreeze_queue(q, false);
+}
 EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
+ * Once this function is returned, only allow to get request
+ * for preempt purpose, such as RQF_PREEMPT.
+ *
+ */
+void blk_freeze_queue_preempt(struct request_queue *q)
+{
+   __blk_freeze_queue_start(q, true);
+   blk_freeze_queue_wait(q);
+}
+EXPORT_SYMBOL_GPL(blk_freeze_queue_preempt);
+
+/*
+ * It is the caller's responsibility to make sure no new
+ * request can be allocated before calling this function.
+ */
+void blk_unfreeze_queue_preempt(struct request_queue *q)
+{
+   blk_freeze_queue_wait(q);
+   __blk_unfreeze_queue(q, true);
+}
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue_preempt);
+
+/*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
  * mpt3sas driver such that this function can be removed.
  */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0ba5cb043172..596f433eb54c 100644
--- a/include/linux/blk-mq.h

[PATCH V2 6/8] block: allow to allocate req with REQF_PREEMPT when queue is frozen

2017-09-01 Thread Ming Lei
REQF_PREEMPT is a bit special because it is required to be
dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() to allow block
layer to allocate request when queue is frozen, since we
will freeze queue before quiescing SCSI device in the
following patch for supporting safe SCSI quiescing.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 25 +
 block/blk-mq.c | 11 +--
 block/blk.h|  5 +
 include/linux/blk-mq.h |  7 ---
 include/linux/blkdev.h | 17 +++--
 5 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85b15833a7a5..c199910d4fe1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1402,7 +1402,8 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, gfp_t gfp_mask,
+  unsigned int flags)
 {
struct request *rq;
int ret = 0;
@@ -1412,9 +1413,17 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   /*
+* When queue is frozen, we still need to allocate req for
+* REQF_PREEMPT.
+*/
+   if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_freezing(q))
+   blk_queue_enter_live(q);
+   else
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
if (ret)
return ERR_PTR(ret);
+
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
@@ -1430,26 +1439,26 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+ gfp_t gfp_mask, unsigned int flags)
 {
struct request *req;
 
if (q->mq_ops) {
req = blk_mq_alloc_request(q, op,
-   (gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   0 : BLK_MQ_REQ_NOWAIT);
+   flags | ((gfp_mask & __GFP_DIRECT_RECLAIM) ?
+   0 : BLK_MQ_REQ_NOWAIT));
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
-   req = blk_old_get_request(q, op, gfp_mask);
+   req = blk_old_get_request(q, op, gfp_mask, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
 
return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 24de78afbe9a..695d2eeaf41a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -382,9 +382,16 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
 {
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
-   int ret;
+   int ret = 0;
 
-   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+   /*
+* When queue is frozen, we still need to allocate req for
+* REQF_PREEMPT.
+*/
+   if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_freezing(q))
+   blk_queue_enter_live(q);
+   else
+   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
 
diff --git a/block/blk.h b/block/blk.h
index 242486e26a81..b71f8cc047aa 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -80,6 +80,11 @@ static inline void blk_queue_enter_live(struct request_queue 
*q)
percpu_ref_get(>q_usage_counter);
 }
 
+static inline bool blk_queue_is_freezing(struct request_queue *q)
+{
+   return percpu_ref_is_dying(>q_usage_counter);
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f90d78eb85df..0ba5cb043172 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -200,9 +200,10 @@ void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-   BLK_MQ_REQ_NOWAIT   = (1 << 0), /* return when out of requests */
-   BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
-   BLK_MQ_REQ_INTERNAL = (1 << 2), /* 

[PATCH V2 5/8] block: tracking request allocation with q_usage_counter

2017-09-01 Thread Ming Lei
This usage is basically same with blk-mq, so that we can
support to freeze queue easily.

Signed-off-by: Ming Lei 
---
 block/blk-core.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index ce2d3b6f6c62..85b15833a7a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1405,16 +1405,21 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1586,6 +1591,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1867,8 +1873,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
-- 
2.9.5



[PATCH V2 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait

2017-09-01 Thread Ming Lei
The only change on legacy is that blk_drain_queue() is run
from blk_freeze_queue(), which is called in blk_cleanup_queue().

So this patch removes the explicite __blk_drain_queue() in
blk_cleanup_queue().

Signed-off-by: Ming Lei 
---
 block/blk-core.c | 17 +++--
 block/blk-mq.c   |  8 +---
 block/blk.h  |  1 +
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 5 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d579501f24ba..ce2d3b6f6c62 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -530,6 +530,21 @@ static void __blk_drain_queue(struct request_queue *q, 
bool drain_all)
 }
 
 /**
+ * blk_drain_queue - drain requests from request_queue
+ * @q: queue to drain
+ *
+ * Drain requests from @q.  All pending requests are drained.
+ * The caller is responsible for ensuring that no new requests
+ * which need to be drained are queued.
+ */
+void blk_drain_queue(struct request_queue *q)
+{
+   spin_lock_irq(q->queue_lock);
+   __blk_drain_queue(q, true);
+   spin_unlock_irq(q->queue_lock);
+}
+
+/**
  * blk_queue_bypass_start - enter queue bypass mode
  * @q: queue of interest
  *
@@ -653,8 +668,6 @@ void blk_cleanup_queue(struct request_queue *q)
 */
blk_freeze_queue(q);
spin_lock_irq(lock);
-   if (!q->mq_ops)
-   __blk_drain_queue(q, true);
queue_flag_set(QUEUE_FLAG_DEAD, q);
spin_unlock_irq(lock);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c532d8612e1..24de78afbe9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -131,11 +131,13 @@ void blk_freeze_queue_start(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
-void blk_mq_freeze_queue_wait(struct request_queue *q)
+void blk_freeze_queue_wait(struct request_queue *q)
 {
+   if (!q->mq_ops)
+   blk_drain_queue(q);
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(>q_usage_counter));
 }
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
+EXPORT_SYMBOL_GPL(blk_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout)
@@ -160,7 +162,7 @@ void blk_freeze_queue(struct request_queue *q)
 * exported to drivers as the only user for unfreeze is blk_mq.
 */
blk_freeze_queue_start(q);
-   blk_mq_freeze_queue_wait(q);
+   blk_freeze_queue_wait(q);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
diff --git a/block/blk.h b/block/blk.h
index 6847c5435cca..242486e26a81 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -64,6 +64,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request 
*rq,
struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
+void blk_drain_queue(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 986f2b4f9760..d34a9ffaa940 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2778,7 +2778,7 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 
mutex_lock(>namespaces_mutex);
list_for_each_entry(ns, >namespaces, list)
-   blk_mq_freeze_queue_wait(ns->queue);
+   blk_freeze_queue_wait(ns->queue);
mutex_unlock(>namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8ae77e088c01..f90d78eb85df 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -259,7 +259,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
-void blk_mq_freeze_queue_wait(struct request_queue *q);
+void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-- 
2.9.5



[PATCH V2 3/8] blk-mq: only run hw queues for blk-mq

2017-09-01 Thread Ming Lei
This patch just makes it explicitely.

Reviewed-by: Johannes Thumshirn 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8cf1f7cbef2b..4c532d8612e1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5



[PATCH V2 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue

2017-09-01 Thread Ming Lei
This APIs will be used by legacy path too.

Signed-off-by: Ming Lei 
---
 block/bfq-iosched.c  |  2 +-
 block/blk-cgroup.c   |  4 ++--
 block/blk-mq.c   | 17 -
 block/blk-mq.h   |  1 -
 block/elevator.c |  2 +-
 drivers/block/loop.c |  8 
 drivers/block/rbd.c  |  2 +-
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 9 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 509f39998011..ce2b00e897e2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4757,7 +4757,7 @@ static int bfq_init_queue(struct request_queue *q, struct 
elevator_type *e)
 * The invocation of the next bfq_create_group_hierarchy
 * function is the head of a chain of function calls
 * (bfq_create_group_hierarchy->blkcg_activate_policy->
-* blk_mq_freeze_queue) that may lead to the invocation of the
+* blk_freeze_queue) that may lead to the invocation of the
 * has_work hook function. For this reason,
 * bfq_create_group_hierarchy is invoked only after all
 * scheduler data has been initialized, apart from the fields
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 02e8a47ac77c..87c15f3947d5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1296,7 +1296,7 @@ int blkcg_activate_policy(struct request_queue *q,
return 0;
 
if (q->mq_ops)
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
else
blk_queue_bypass_start(q);
 pd_prealloc:
@@ -1363,7 +1363,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
return;
 
if (q->mq_ops)
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
else
blk_queue_bypass_start(q);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 82136e83951d..8cf1f7cbef2b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -161,16 +161,7 @@ void blk_freeze_queue(struct request_queue *q)
blk_freeze_queue_start(q);
blk_mq_freeze_queue_wait(q);
 }
-
-void blk_mq_freeze_queue(struct request_queue *q)
-{
-   /*
-* ...just an alias to keep freeze and unfreeze actions balanced
-* in the blk_mq_* namespace
-*/
-   blk_freeze_queue(q);
-}
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
+EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
 void blk_unfreeze_queue(struct request_queue *q)
 {
@@ -2248,7 +2239,7 @@ static void blk_mq_update_tag_set_depth(struct 
blk_mq_tag_set *set,
lockdep_assert_held(>tag_list_lock);
 
list_for_each_entry(q, >tag_list, tag_set_list) {
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
queue_set_hctx_shared(q, shared);
blk_unfreeze_queue(q);
}
@@ -2683,7 +2674,7 @@ static int __blk_mq_update_nr_requests(struct 
request_queue *q,
if (!set)
return -EINVAL;
 
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
 
ret = 0;
queue_for_each_hw_ctx(q, hctx, i) {
@@ -2747,7 +2738,7 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
return;
 
list_for_each_entry(q, >tag_list, tag_set_list)
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
 
set->nr_hw_queues = nr_hw_queues;
blk_mq_update_queue_map(set);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1b9742eb7399..7ce29ef1e6f3 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,6 @@ struct blk_mq_ctx {
 } cacheline_aligned_in_smp;
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 371c8165c9e8..1164c8a3720f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,7 +967,7 @@ static int elevator_switch_mq(struct request_queue *q,
 {
int ret;
 
-   blk_mq_freeze_queue(q);
+   blk_freeze_queue(q);
 
if (q->elevator) {
if (q->elevator->registered)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c11ea44d470..b2e708b7e1e6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -211,7 +211,7 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
 * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup
 * will get updated by ioctl(LOOP_GET_STATUS)
 */
-   blk_mq_freeze_queue(lo->lo_queue);
+   blk_freeze_queue(lo->lo_queue);
lo->use_dio = use_dio;
if (use_dio)
lo->lo_flags |= LO_FLAGS_DIRECT_IO;
@@ -599,7 +599,7 @@ static int loop_switch(struct loop_device *lo, struct 

[PATCH V2 0/8] block/scsi: safe SCSI quiescing

2017-09-01 Thread Ming Lei
Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for RQF_PREEMPT
can be dispatched to SCSI successfully, and scsi_device_quiesce() just simply
waits for completion of I/Os dispatched to SCSI stack. It isn't enough at all.

Because new request still can be allocated, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated, and system may
hang forever, such as during system suspend or SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch tries to solve the issue by freezing block queue during
SCSI quiescing, and allowing to allocate request of RQF_PREEMPT
when queue is frozen.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing preempt version of blk_freeze_queue() and
blk_unfreeze_queue().

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun

- introduce preempt version of blk_[freeze|unfreeze]_queue

- sync between preempt freeze and normal freeze

- fix warning from percpu-refcount as reported by Oleksandr



[1] https://marc.info/?t=150340250100013=3=2



Ming Lei (8):
  blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
  blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
  blk-mq: only run hw queues for blk-mq
  blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  block: tracking request allocation with q_usage_counter
  block: allow to allocate req with REQF_PREEMPT when queue is frozen
  block: introduce preempt version of blk_[freeze|unfreeze]_queue
  SCSI: freeze block queue when SCSI device is put into quiesce

 block/bfq-iosched.c  |   2 +-
 block/blk-cgroup.c   |   8 ++--
 block/blk-core.c |  50 
 block/blk-mq.c   | 119 ---
 block/blk-mq.h   |   1 -
 block/blk.h  |   6 +++
 block/elevator.c |   4 +-
 drivers/block/loop.c |  16 +++
 drivers/block/rbd.c  |   2 +-
 drivers/nvme/host/core.c |   8 ++--
 drivers/scsi/scsi_lib.c  |  21 -
 include/linux/blk-mq.h   |  15 +++---
 include/linux/blkdev.h   |  20 +++-
 13 files changed, 206 insertions(+), 66 deletions(-)

-- 
2.9.5



[PATCH V2 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue

2017-09-01 Thread Ming Lei
We will support to freeze queue on block legacy path too.

Signed-off-by: Ming Lei 
---
 block/blk-cgroup.c   |  4 ++--
 block/blk-mq.c   | 10 +-
 block/elevator.c |  2 +-
 drivers/block/loop.c |  8 
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  2 +-
 6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0480892e97e5..02e8a47ac77c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1337,7 +1337,7 @@ int blkcg_activate_policy(struct request_queue *q,
spin_unlock_irq(q->queue_lock);
 out_bypass_end:
if (q->mq_ops)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
if (pd_prealloc)
@@ -1388,7 +1388,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
spin_unlock_irq(q->queue_lock);
 
if (q->mq_ops)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d935f15c54da..82136e83951d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -172,7 +172,7 @@ void blk_mq_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
-void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_unfreeze_queue(struct request_queue *q)
 {
int freeze_depth;
 
@@ -183,7 +183,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
wake_up_all(>mq_freeze_wq);
}
 }
-EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
@@ -2250,7 +2250,7 @@ static void blk_mq_update_tag_set_depth(struct 
blk_mq_tag_set *set,
list_for_each_entry(q, >tag_list, tag_set_list) {
blk_mq_freeze_queue(q);
queue_set_hctx_shared(q, shared);
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
}
 }
 
@@ -2708,7 +2708,7 @@ static int __blk_mq_update_nr_requests(struct 
request_queue *q,
if (!ret)
q->nr_requests = nr;
 
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
 
return ret;
 }
@@ -2757,7 +2757,7 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
}
 
list_for_each_entry(q, >tag_list, tag_set_list)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
 }
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
diff --git a/block/elevator.c b/block/elevator.c
index 0e465809d3f3..371c8165c9e8 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -994,7 +994,7 @@ static int elevator_switch_mq(struct request_queue *q,
blk_add_trace_msg(q, "elv switch: none");
 
 out:
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
return ret;
 }
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2fbd4089c20e..5c11ea44d470 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -217,7 +217,7 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
lo->lo_flags |= LO_FLAGS_DIRECT_IO;
else
lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 }
 
 static int
@@ -605,7 +605,7 @@ static int loop_switch(struct loop_device *lo, struct file 
*file)
do_loop_switch(lo, );
 
/* unfreeze */
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 
return 0;
 }
@@ -1079,7 +1079,7 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_state = Lo_unbound;
/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 
if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
loop_reread_partitions(lo, bdev);
@@ -1191,7 +1191,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
__loop_update_dio(lo, lo->use_dio);
 
  exit:
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 
if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
 !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 37046ac2c441..5c76b0a96be2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1226,7 +1226,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, 
struct nvme_id_ns *id)
 
if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
nvme_config_discard(ns);
-   blk_mq_unfreeze_queue(disk->queue);
+   blk_unfreeze_queue(disk->queue);
 }
 
 static int nvme_revalidate_disk(struct gendisk 

[PATCH V2 0/8] block/scsi: safe SCSI quiescing

2017-09-01 Thread Ming Lei
Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for RQF_PREEMPT
can be dispatched to SCSI successfully, and scsi_device_quiesce() just simply
waits for completion of I/Os dispatched to SCSI stack. It isn't enough at all.

Because new request still can be allocated, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated, and system may
hang forever, such as during system suspend or SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch tries to solve the issue by freezing block queue during
SCSI quiescing, and allowing to allocate request of RQF_PREEMPT
when queue is frozen.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing preempt version of blk_freeze_queue() and
blk_unfreeze_queue().

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun

- introduce preempt version of blk_[freeze|unfreeze]_queue

- sync between preempt freeze and normal freeze

- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013=3=2



Ming Lei (8):
  blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
  blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
  blk-mq: only run hw queues for blk-mq
  blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  block: tracking request allocation with q_usage_counter
  block: allow to allocate req with REQF_PREEMPT when queue is frozen
  block: introduce preempt version of blk_[freeze|unfreeze]_queue
  SCSI: freeze block queue when SCSI device is put into quiesce

 block/bfq-iosched.c  |   2 +-
 block/blk-cgroup.c   |   8 ++--
 block/blk-core.c |  50 
 block/blk-mq.c   | 119 ---
 block/blk-mq.h   |   1 -
 block/blk.h  |   6 +++
 block/elevator.c |   4 +-
 drivers/block/loop.c |  16 +++
 drivers/block/rbd.c  |   2 +-
 drivers/nvme/host/core.c |   8 ++--
 drivers/scsi/scsi_lib.c  |  21 -
 include/linux/blk-mq.h   |  15 +++---
 include/linux/blkdev.h   |  20 +++-
 13 files changed, 206 insertions(+), 66 deletions(-)

-- 
2.9.5



[PATCH V2 0/8] block/scsi: safe SCSI quiescing

2017-09-01 Thread Ming Lei
Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for RQF_PREEMPT
can be dispatched to SCSI successfully, and scsi_device_quiesce() just simply
waits for completion of I/Os dispatched to SCSI stack. It isn't enough at all.

Because new request still can be allocated, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated, and system may
hang forever, such as during system suspend or SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch tries to solve the issue by freezing block queue during
SCSI quiescing, and allowing to allocate request of RQF_PREEMPT
when queue is frozen.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing preempt version of blk_freeze_queue() and
blk_unfreeze_queue().

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun

- introduce preempt version of blk_[freeze|unfreeze]_queue

- sync between preempt freeze and normal freeze

- fix warning from percpu-refcount as reported by Oleksandr



[1] https://marc.info/?t=150340250100013=3=2


Ming Lei (8):
  blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
  blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
  blk-mq: only run hw queues for blk-mq
  blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  block: tracking request allocation with q_usage_counter
  block: allow to allocate req with REQF_PREEMPT when queue is frozen
  block: introduce preempt version of blk_[freeze|unfreeze]_queue
  SCSI: freeze block queue when SCSI device is put into quiesce

 block/bfq-iosched.c  |   2 +-
 block/blk-cgroup.c   |   8 ++--
 block/blk-core.c |  50 
 block/blk-mq.c   | 119 ---
 block/blk-mq.h   |   1 -
 block/blk.h  |   6 +++
 block/elevator.c |   4 +-
 drivers/block/loop.c |  16 +++
 drivers/block/rbd.c  |   2 +-
 drivers/nvme/host/core.c |   8 ++--
 drivers/scsi/scsi_lib.c  |  21 -
 include/linux/blk-mq.h   |  15 +++---
 include/linux/blkdev.h   |  17 ++-
 13 files changed, 203 insertions(+), 66 deletions(-)

-- 
2.9.5



[PATCH V2 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue

2017-09-01 Thread Ming Lei
We will support to freeze queue on block legacy path too.

Signed-off-by: Ming Lei 
---
 block/blk-cgroup.c   |  4 ++--
 block/blk-mq.c   | 10 +-
 block/elevator.c |  2 +-
 drivers/block/loop.c |  8 
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  2 +-
 6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0480892e97e5..02e8a47ac77c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1337,7 +1337,7 @@ int blkcg_activate_policy(struct request_queue *q,
spin_unlock_irq(q->queue_lock);
 out_bypass_end:
if (q->mq_ops)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
if (pd_prealloc)
@@ -1388,7 +1388,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
spin_unlock_irq(q->queue_lock);
 
if (q->mq_ops)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d935f15c54da..82136e83951d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -172,7 +172,7 @@ void blk_mq_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
-void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_unfreeze_queue(struct request_queue *q)
 {
int freeze_depth;
 
@@ -183,7 +183,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
wake_up_all(>mq_freeze_wq);
}
 }
-EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
@@ -2250,7 +2250,7 @@ static void blk_mq_update_tag_set_depth(struct 
blk_mq_tag_set *set,
list_for_each_entry(q, >tag_list, tag_set_list) {
blk_mq_freeze_queue(q);
queue_set_hctx_shared(q, shared);
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
}
 }
 
@@ -2708,7 +2708,7 @@ static int __blk_mq_update_nr_requests(struct 
request_queue *q,
if (!ret)
q->nr_requests = nr;
 
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
 
return ret;
 }
@@ -2757,7 +2757,7 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
}
 
list_for_each_entry(q, >tag_list, tag_set_list)
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
 }
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
diff --git a/block/elevator.c b/block/elevator.c
index 0e465809d3f3..371c8165c9e8 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -994,7 +994,7 @@ static int elevator_switch_mq(struct request_queue *q,
blk_add_trace_msg(q, "elv switch: none");
 
 out:
-   blk_mq_unfreeze_queue(q);
+   blk_unfreeze_queue(q);
return ret;
 }
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2fbd4089c20e..5c11ea44d470 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -217,7 +217,7 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
lo->lo_flags |= LO_FLAGS_DIRECT_IO;
else
lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 }
 
 static int
@@ -605,7 +605,7 @@ static int loop_switch(struct loop_device *lo, struct file 
*file)
do_loop_switch(lo, );
 
/* unfreeze */
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 
return 0;
 }
@@ -1079,7 +1079,7 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_state = Lo_unbound;
/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 
if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
loop_reread_partitions(lo, bdev);
@@ -1191,7 +1191,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
__loop_update_dio(lo, lo->use_dio);
 
  exit:
-   blk_mq_unfreeze_queue(lo->lo_queue);
+   blk_unfreeze_queue(lo->lo_queue);
 
if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
 !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 37046ac2c441..5c76b0a96be2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1226,7 +1226,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, 
struct nvme_id_ns *id)
 
if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
nvme_config_discard(ns);
-   blk_mq_unfreeze_queue(disk->queue);
+   blk_unfreeze_queue(disk->queue);
 }
 
 static int nvme_revalidate_disk(struct gendisk 

Re: [PATCH 8/8] scsi: Introduce ZBC disk I/O scheduler

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> +static struct scsi_disk *__zoned_scsi_disk(struct request_queue *q)
> +{
> +   struct scsi_device *sdp;
> +   struct scsi_disk *sdkp;
> +
> +   if (!blk_queue_is_zoned(q)) {
> +   pr_err("zoned: Not a zoned block device\n");
> +   return NULL;
> +   }
> +
> +   sdp = scsi_device_from_queue(q);
> +   if (!sdp) {
> +   pr_err("zoned: Not a SCSI device\n");
> +   return NULL;
> +   }
> +
> +   sdkp = dev_get_drvdata(>sdev_gendev);
> +   if (WARN_ON(sdkp->disk->queue != q))
> +   return NULL;
> +
> +   return sdkp;
> +}

Hello Damien,

Can reading sdkp->disk->queue cause a kernel crash if sdp does not point at
a SCSI device that is associated with a SCSI disk? How about using something
like the code below to convert a request queue pointer into a SCSI disk
pointer?

static int lookup_disk(struct device *dev, void *data)
{
struct scsi_disk **sdkp = data;

if (!*sdkp && dev->class == _disk_class)
*sdkp = to_scsi_disk(dev);

return 0;
}

static struct scsi_disk *q_to_sdkp(struct request_queue *q)
{
struct scsi_device *sdp = scsi_device_from_queue(q);
struct scsi_disk *sdkp = NULL;

if (sdp)
device_for_each_child(>sdev_gendev, , lookup_disk);
return sdkp;
}

Thanks,

Bart.

Re: [PATCH 8/9] block: allow to allocate req with REQF_PREEMPT when queue is frozen

2017-09-01 Thread Ming Lei
On Fri, Sep 01, 2017 at 03:43:39PM +, Bart Van Assche wrote:
> On Fri, 2017-09-01 at 11:55 +0800, Ming Lei wrote:
> > On Thu, Aug 31, 2017 at 10:50:25PM +, Bart Van Assche wrote:
> > > On Fri, 2017-09-01 at 01:27 +0800, Ming Lei wrote:
> > > > @@ -1413,9 +1414,17 @@ static struct request 
> > > > *blk_old_get_request(struct request_queue *q,
> > > > /* create ioc upfront */
> > > > create_io_context(gfp_mask, q->node);
> > > >  
> > > > -   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
> > > > +   /*
> > > > +* When queue is frozen, we still need to allocate req for
> > > > +* REQF_PREEMPT.
> > > > +*/
> > > > +   if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_frozen(q))
> > > > +   blk_queue_enter_live(q);
> > > > +   else
> > > > +   ret = blk_queue_enter(q, !(gfp_mask & 
> > > > __GFP_DIRECT_RECLAIM));
> > > 
> > > Sorry but I doubt that calling blk_queue_enter_live() from inside
> > > blk_old_get_request() is safe. Calling blk_queue_enter_live() is only safe
> > > before a request queue has been marked dead. What prevents a kernel thread
> > > that holds a reference on a request queue and that is running concurrently
> > > with blk_cleanup_queue() to call blk_old_get_request() after a queue has
> > > been marked dead?
> > 
> > I have sent one delta patch in list, which will only call
> > blk_queue_enter_live() iff SCSI device is in QUIESCE.
> 
> That wouldn't make this hack less ugly.
> 
> It is possible to trigger the running -> quiesce state transition through
> /sys/class/scsi_device/*/device/state and the quiesce -> removed transition
> through /sys/class/scsi_device/*/device/delete. An example:
> 
> modprobe scsi_debug delay=0 dev_size_mb=16
> lsscsi | grep scsi_debug
> cd /sys/class/scsi_device/8:0:0:0/device
> echo quiesce > state
> echo 1 > delete
> 
> So the code from your patch 8/9 can race against device removal.
> 
> I think we need a better approach than the REQ_PREEMPT hack. How about
> implementing resume as a callback by adding an additional function pointer
> to struct request_queue / struct blk_mq_ops instead of implementing it as
> a request? For SCSI devices races of resume against removal can e.g. be
> handled by checking the scsi_device_set_state() return value. That function
> namely does not allow the removing/removed to running state transition.

If there is race between resume vs. removal, that is nothing to do
this patchset.

We definitely need to prevent new requests from being allocated after
SCSI device is put into quiesce. I don't see another better way
than freezing queue, because it is the only available way to 
prevent new req allocation.

Actually there is race between normal freezing and the preempt freezing
in this patchset, and it will be fixed in V2.


-- 
Ming


Re: [PATCH 7/8] scsi: sd_zbc: Disable zone write locking with scsi-mq

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> In the case of a ZBC disk used with scsi-mq, zone write locking does
> not prevent write reordering in sequential zones. Unlike the legacy
> case, zone locking can only be done after the command request is
> removed from the scheduler dispatch queue. That is, at the time of
> zone locking, the write command may already be out of order.
> 
> Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
> used with scsi-mq. Write order guarantees can be provided by an
> adapted I/O scheduler.

Reviewed-by: Bart Van Assche 



Re: [PATCH 6/8] scsi: sd_zbc: Limit zone write locking to sequential zones

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> Zoned block devices have no write constraints for conventional zones
> so write locking of conventional zones is not necessary and can hurt
> performance. To avoid this, introduce the seq_zones bitmap to indicate
> if a zone is a sequential one. Use this information to allow any write
> to be issued to conventional zones, locking only sequential zones.
> 
> As the zone bitmap allocation for seq_zones is identical to the zones
> write lock bitmap, introduce the helper sd_zbc_alloc_zone_bitmap().
> Using this helper, wait for the disk capacity and number of zones to
> stabilize on the second revalidation pass to allocate and initialize
> the bitmaps.

Reviewed-by: Bart Van Assche 

Re: [PATCH 5/8] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> The three values starting at byte 8 of the Zoned Block Device
> Characteristics VPD page B6h are 32 bits values, not 64bits. So use
> get_unaligned_be32() to retrieve the values and not get_unaligned_be64()

Reviewed-by: Bart Van Assche 

Re: [PATCH 4/8] scsi: sd_zbc: Reorganize and cleanup

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> Introduce sd_zbc.h for zbc related declarations (avoid cluttering sd.h).
> 
> Fix comments style in sd_zbc.c (do not use documentation format) and
> add/fix comments to enhance explanations. Also remove a useless blank
> line in sd_zbc_read_zones().
> 
> Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
> assignment and simplify nr_zones calculation.
> 
> Finally, use the min() macro sd_zbc_check_zone_size() to get a report
> zone reply size instead of the open coded version.

Hello Damien,

Should this patch perhaps be split into multiple patches to make review
easier?

> +static inline void sd_zbc_remove(struct scsi_disk *sdkp) {}

Please use the same coding style as in other Linux kernel header files,
namely '{' and '}' both at the start of a line. See e.g. include/linux/sysfs.h.

Thanks,

Bart.

Re: [PATCH 3/8] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> +/* Zone types of REPORT ZONES zone descriptors */
> +enum zbc_zone_type {
> + ZBC_ZONE_TYPE_CONV  = 0x1,
> + ZBC_ZONE_TYPE_SEQWRITE_REQ  = 0x2,
> + ZBC_ZONE_TYPE_SEQWRITE_PREF = 0x3,
> + /* 0x4 to 0xf are reserved */
> + ZBC_ZONE_TYPE_RESERVED  = 0x4,
> +};

Hello Damien,

Since no code is using ZBC_ZONE_TYPE_RESERVED, is the comment about reserved
values sufficient and can the definition of ZBC_ZONE_TYPE_RESERVED be left out?

Anyway:

Reviewed-by: Bart Van Assche 



Re: [PATCH 1/9] percpu-refcount: introduce percpu_ref_is_dead()

2017-09-01 Thread Ming Lei
On Fri, Sep 01, 2017 at 06:59:15AM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> > +/**
> > + * percpu_ref_is_dead - test whether a percpu refcount is killed
> > + * @ref: percpu_ref to test
> > + *
> > + * Returns %true if @ref is dead
> > + *
> > + * This function is safe to call as long as @ref is between init and exit.
> > + */
> > +static inline bool percpu_ref_is_dead(struct percpu_ref *ref)
> > +{
> > +   unsigned long __percpu *percpu_count;
> > +
> > +   if (__ref_is_percpu(ref, _count))
> > +   return false;
> > +   return ref->percpu_count_ptr & __PERCPU_REF_DEAD;
> > +}
> 
> Can you please explain why percpu_ref_is_dying() isn't enough for your
> use case?

Hi Tejun,

You are right, looks percpu_ref_is_dying() is enough,
percpu_ref_get_many() works fine no matter the ref is dying or not.

Thanks!

-- 
Ming


Re: [PATCH 2/8] block: Fix declaration of blk-mq scheduler functions

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -284,6 +284,16 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
> *set, int nr_hw_queues);
>  void blk_mq_quiesce_queue_nowait(struct request_queue *q);
>  
>  /*
> + * Scheduler helper functions.
> + */
> +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> +  void (*exit)(struct blk_mq_hw_ctx *));
> +bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> + struct request **merged_request);
> +bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request 
> *rq);
> +void blk_mq_sched_request_inserted(struct request *rq);
> +
> +/*
>   * Driver command data is immediately after the request. So subtract request
>   * size to get back to the original request, add request size to get the PDU.
>   */

Hello Damien,

A similar comment as for patch 1/8 for this patch: should these declarations
perhaps be moved to a new header file, e.g. include/linux/blk-mq-sched.h
because these functions are only used by I/O schedulers and not in the
implementation of block drivers? That will help to avoid that block driver
authors start wondering whether or not they have to call use these functions
while they are reading include/linux/blk-mq.h.

Thanks,

Bart.

Re: [PATCH 1/8] block: Fix declaration of blk-mq debugfs functions

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 14542308d25b..a369174a9679 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -5,6 +5,21 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_BLK_DEBUG_FS
> +
> +#include 
> +
> +struct blk_mq_debugfs_attr {
> + const char *name;
> + umode_t mode;
> + int (*show)(void *, struct seq_file *);
> + ssize_t (*write)(void *, const char __user *, size_t, loff_t *);
> + /* Set either .show or .seq_ops. */
> + const struct seq_operations *seq_ops;
> +};
> +
> +#endif
> +
>  struct blk_mq_tags;
>  struct blk_flush_queue;
>  
> @@ -289,4 +304,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
>   for ((i) = 0; (i) < (hctx)->nr_ctx &&   \
>({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
>  
> +#ifdef CONFIG_BLK_DEBUG_FS
> +int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
> +int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
> +#endif
> +
>  #endif

Hello Damien,

Should these definitions perhaps be moved to a new header file, e.g.
include/linux/blk-mq-debugfs.h, such that the introduction of more #ifdefs
and #include  in include/linux/blk-mq.h can be avoided?

Thanks,

Bart.

Re: [PATCH v3 2/5] dt-bindings: scsi: ufs: add document for hisi-ufs

2017-09-01 Thread Rob Herring
On Tue, Aug 29, 2017 at 04:41:13PM +0800, Li Wei wrote:
> add ufs node document for Hisilicon
> 
> Signed-off-by: Li Wei 
> ---
>  Documentation/devicetree/bindings/ufs/ufs-hisi.txt | 35 
> ++
>  1 file changed, 35 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/ufs/ufs-hisi.txt
> 
> diff --git a/Documentation/devicetree/bindings/ufs/ufs-hisi.txt 
> b/Documentation/devicetree/bindings/ufs/ufs-hisi.txt
> new file mode 100644
> index ..cfc84c821d50
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/ufs/ufs-hisi.txt
> @@ -0,0 +1,35 @@
> +* Hisilicon Universal Flash Storage (UFS) Host Controller
> +
> +UFS nodes are defined to describe on-chip UFS hardware macro.
> +Each UFS Host Controller should have its own node.
> +
> +Required properties:
> +- compatible: compatible list, contains one of the following -
> + "hisilicon,hi3660-ufs" for hisi ufs host controller
> +  present on Hi3660 chipset.
> +- reg   : should contain UFS register address space & UFS SYS 
> CTRL register address,
> +- interrupt-parent  : interrupt device
> +- interrupts: interrupt number
> +- clocks : List of phandle and clock specifier pairs
> +- clock-names   : List of clock input name strings sorted in the same
> +   order as the clocks property. "clk_ref", "clk_phy" is 
> optional
> +
> +Optional properties for board device:
> +- reset-gpio : specifies to reset devices

reset-gpios

Really, this is should be in a sub node representing the device. It's 
fine if this is it, but if we start adding power supplies and other 
properties it should be a separate node.

> +
> +Example:
> +
> + ufs: ufs@ff3b {
> + compatible = "jedec,ufs-1.1", "hisilicon,hi3660-ufs";

Need to reverse the order here. Most specific first.

> + /* 0: HCI standard */
> + /* 1: UFS SYS CTRL */
> + reg = <0x0 0xff3b 0x0 0x1000>,
> + <0x0 0xff3b1000 0x0 0x1000>;
> + interrupt-parent = <>;
> + interrupts = ;
> + clocks = <_ctrl HI3660_CLK_GATE_UFSIO_REF>,
> + <_ctrl HI3660_CLK_GATE_UFSPHY_CFG>;
> + clock-names = "clk_ref", "clk_phy";
> + freq-table-hz = <0 0>, <0 0>;
> + reset-gpio = < 1 0>;
> + }
> -- 
> 2.11.0
> 


Re: [PATCH 1/9] percpu-refcount: introduce percpu_ref_is_dead()

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 06:59 -0700, Tejun Heo wrote:
> > +/**
> > + * percpu_ref_is_dead - test whether a percpu refcount is killed
> > + * @ref: percpu_ref to test
> > + *
> > + * Returns %true if @ref is dead
> > + *
> > + * This function is safe to call as long as @ref is between init and exit.
> > + */
> > +static inline bool percpu_ref_is_dead(struct percpu_ref *ref)
> > +{
> > +   unsigned long __percpu *percpu_count;
> > +
> > +   if (__ref_is_percpu(ref, _count))
> > +   return false;
> > +   return ref->percpu_count_ptr & __PERCPU_REF_DEAD;
> > +}
> 
> Can you please explain why percpu_ref_is_dying() isn't enough for your
> use case?

Hello Tejun,

The approach of the whole series is wrong so I think that you can ignore this 
patch.

Bart.

Re: [PATCH 8/9] block: allow to allocate req with REQF_PREEMPT when queue is frozen

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 11:55 +0800, Ming Lei wrote:
> On Thu, Aug 31, 2017 at 10:50:25PM +, Bart Van Assche wrote:
> > On Fri, 2017-09-01 at 01:27 +0800, Ming Lei wrote:
> > > @@ -1413,9 +1414,17 @@ static struct request *blk_old_get_request(struct 
> > > request_queue *q,
> > >   /* create ioc upfront */
> > >   create_io_context(gfp_mask, q->node);
> > >  
> > > - ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
> > > + /*
> > > +  * When queue is frozen, we still need to allocate req for
> > > +  * REQF_PREEMPT.
> > > +  */
> > > + if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_frozen(q))
> > > + blk_queue_enter_live(q);
> > > + else
> > > + ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
> > 
> > Sorry but I doubt that calling blk_queue_enter_live() from inside
> > blk_old_get_request() is safe. Calling blk_queue_enter_live() is only safe
> > before a request queue has been marked dead. What prevents a kernel thread
> > that holds a reference on a request queue and that is running concurrently
> > with blk_cleanup_queue() to call blk_old_get_request() after a queue has
> > been marked dead?
> 
> I have sent one delta patch in list, which will only call
> blk_queue_enter_live() iff SCSI device is in QUIESCE.

That wouldn't make this hack less ugly.

It is possible to trigger the running -> quiesce state transition through
/sys/class/scsi_device/*/device/state and the quiesce -> removed transition
through /sys/class/scsi_device/*/device/delete. An example:

modprobe scsi_debug delay=0 dev_size_mb=16
lsscsi | grep scsi_debug
cd /sys/class/scsi_device/8:0:0:0/device
echo quiesce > state
echo 1 > delete

So the code from your patch 8/9 can race against device removal.

I think we need a better approach than the REQ_PREEMPT hack. How about
implementing resume as a callback by adding an additional function pointer
to struct request_queue / struct blk_mq_ops instead of implementing it as
a request? For SCSI devices races of resume against removal can e.g. be
handled by checking the scsi_device_set_state() return value. That function
namely does not allow the removing/removed to running state transition.

Bart.

Re: [PATCH 1/9] percpu-refcount: introduce percpu_ref_is_dead()

2017-09-01 Thread Tejun Heo
Hello, Ming.

> +/**
> + * percpu_ref_is_dead - test whether a percpu refcount is killed
> + * @ref: percpu_ref to test
> + *
> + * Returns %true if @ref is dead
> + *
> + * This function is safe to call as long as @ref is between init and exit.
> + */
> +static inline bool percpu_ref_is_dead(struct percpu_ref *ref)
> +{
> + unsigned long __percpu *percpu_count;
> +
> + if (__ref_is_percpu(ref, _count))
> + return false;
> + return ref->percpu_count_ptr & __PERCPU_REF_DEAD;
> +}

Can you please explain why percpu_ref_is_dying() isn't enough for your
use case?

Thanks.

-- 
tejun


Re: [PATCH 4/9] blk-mq: only run hw queues for blk-mq

2017-09-01 Thread Johannes Thumshirn
On Fri, Sep 01, 2017 at 08:33:40PM +0800, Ming Lei wrote:
> Given it is the only case that blk_mq_run_hw_queues() is run on
> !q->mq_ops now, I suggest to check q->mq_ops outside, otherwise
> it can be a bit overkill.

Fair enough,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 4/9] blk-mq: only run hw queues for blk-mq

2017-09-01 Thread Ming Lei
On Fri, Sep 01, 2017 at 10:16:34AM +0200, Johannes Thumshirn wrote:
> Hi Ming,
> 
> On Fri, Sep 01, 2017 at 01:27:23AM +0800, Ming Lei wrote:
> > -   blk_mq_run_hw_queues(q, false);
> > +   if (q->mq_ops)
> > +   blk_mq_run_hw_queues(q, false);
> 
> What speaks against putting the if (q->mq_ops) directly into
> blk_mq_run_hw_queues() so we can't accidently call it from the sq path?
> 
> Just an idea, no hard preferences here.

Given it is the only case that blk_mq_run_hw_queues() is run on
!q->mq_ops now, I suggest to check q->mq_ops outside, otherwise
it can be a bit overkill.

-- 
Ming


[PATCH 8/8] scsi: Introduce ZBC disk I/O scheduler

2017-09-01 Thread Damien Le Moal
The zoned I/O scheduler is mostly identical to mq-deadline and retains
the same configuration attributes. The main difference is that the
zoned scheduler will ensure that at any time at most only one write
request (command) per sequential zone is in flight (has been issued to
the disk) in order to protect against sequential write reordering
potentially resulting from the concurrent execution of request dispatch
by multiple contexts.

This is achieved similarly to the legacy SCSI I/O path by write locking
zones when a write requests is issued. Subsequent writes to the same
zone have to wait for the completion of the previously issued write
before being in turn dispatched to the disk. This ensures that
sequential writes are processed in the correct order without needing
any modification to the execution model of blk-mq. In addition, this
also protects against write reordering at the HBA level (e.g. AHCI).

This zoned scheduler code is added under the drivers/scsi directory so
that information managed using the scsi_disk structure can be directly
referenced. Doing so, cluttering the block layer with device type
specific code is avoided.

Signed-off-by: Damien Le Moal 
---
 Documentation/block/zoned-iosched.txt |  48 ++
 block/Kconfig.iosched |  12 +
 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.h |   3 +-
 drivers/scsi/sd_zbc.c |  16 +-
 drivers/scsi/sd_zbc.h |  11 +-
 drivers/scsi/sd_zbc_iosched.c | 934 ++
 7 files changed, 1009 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/block/zoned-iosched.txt
 create mode 100644 drivers/scsi/sd_zbc_iosched.c

diff --git a/Documentation/block/zoned-iosched.txt 
b/Documentation/block/zoned-iosched.txt
new file mode 100644
index ..b269125bdc61
--- /dev/null
+++ b/Documentation/block/zoned-iosched.txt
@@ -0,0 +1,48 @@
+Zoned I/O scheduler
+===
+
+The Zoned I/O scheduler solves zoned block devices write ordering problems
+inherent to the absence of a global request queue lock in the blk-mq
+infrastructure. Multiple contexts may try to dispatch simultaneously write
+requests to the same sequential zone of a zoned block device, doing so
+potentially breaking the sequential write order imposed by the device.
+
+The Zoned I/O scheduler is based on the mq-deadline scheduler. It shares the
+same tunables and behaves in a comparable manner. The main difference 
introduced
+with the zoned scheduler is handling of write batches. Whereas mq-deadline will
+keep dispatching write requests to the device as long as the batching size
+allows and reads are not starved, the zoned scheduler introduces additional
+constraints:
+1) At most only one write request can be issued to a sequential zone of the
+device. This ensures that no reordering of sequential writes to a sequential
+zone can happen once the write request leaves the scheduler internal queue (rb
+tree).
+2) The number of sequential zones that can be simultaneously written is limited
+to the device advertized maximum number of open zones. This additional 
condition
+avoids performance degradation due to excessive open zone resource use at the
+device level.
+
+These conditions do not apply to write requests targeting conventional zones.
+For these, the zoned scheduler behaves exactly like the mq-deadline scheduler.
+
+The zoned I/O scheduler cannot be used with regular block devices. It can only
+be used with host-managed or host-aware zoned block devices.
+Using the zoned I/O scheduler is mandatory with host-managed disks unless the
+disk user tightly controls itself write sequencing to sequential zones. The
+zoned scheduler will treat host-aware disks exactly the same way as 
host-managed
+devices. That is, eventhough host aware disks can be randomly written, the 
zoned
+scheduler will still impose the limit to one write per zone so that sequential
+writes sequences are preserved.
+
+For host-managed disks, automating the used of the zoned scheduler can be done
+simply with a udev rule. An example of such rule is shown below.
+
+# Set zoned scheduler for host-managed zoned block devices
+ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/zoned}=="host-managed", \
+   ATTR{queue/scheduler}="zoned"
+
+Zoned I/O scheduler tunables
+
+
+Tunables of the Zoned I/O scheduler are identical to those of the deadline
+I/O scheduler. See Documentation/block/deadline-iosched.txt for details.
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index fd2cefa47d35..b87c67dbf1f6 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -70,6 +70,18 @@ config MQ_IOSCHED_DEADLINE
---help---
  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_ZONED
+   tristate "Zoned I/O scheduler"
+   depends on BLK_DEV_ZONED
+   depends on SCSI_MOD
+   depends on BLK_DEV_SD
+  

[PATCH 6/8] scsi: sd_zbc: Limit zone write locking to sequential zones

2017-09-01 Thread Damien Le Moal
Zoned block devices have no write constraints for conventional zones
so write locking of conventional zones is not necessary and can hurt
performance. To avoid this, introduce the seq_zones bitmap to indicate
if a zone is a sequential one. Use this information to allow any write
to be issued to conventional zones, locking only sequential zones.

As the zone bitmap allocation for seq_zones is identical to the zones
write lock bitmap, introduce the helper sd_zbc_alloc_zone_bitmap().
Using this helper, wait for the disk capacity and number of zones to
stabilize on the second revalidation pass to allocate and initialize
the bitmaps.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd.h |   1 +
 drivers/scsi/sd_zbc.c | 113 ++
 drivers/scsi/sd_zbc.h |   9 
 3 files changed, 106 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5940390d30f3..761d3fbf72ef 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -77,6 +77,7 @@ struct scsi_disk {
unsigned intzone_blocks;
unsigned intzone_shift;
unsigned long   *zones_wlock;
+   unsigned long   *seq_zones;
unsigned intzones_optimal_open;
unsigned intzones_optimal_nonseq;
unsigned intzones_max_open;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 0e6b5f5d14e7..3a9feadcc133 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -246,7 +246,7 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
sector_t sector = blk_rq_pos(rq);
sector_t zone_sectors = sd_zbc_zone_sectors(sdkp);
-   unsigned int zno = sd_zbc_zone_no(sdkp, sector);
+   unsigned int zno;
 
/*
 * Note: Checks of the alignment of the write command on
@@ -254,21 +254,20 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 */
 
/* Do not allow zone boundaries crossing on host-managed drives */
-   if (blk_queue_zoned_model(sdkp->disk->queue) == BLK_ZONED_HM &&
+   if (blk_queue_zoned_model(rq->q) == BLK_ZONED_HM &&
(sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors)
return BLKPREP_KILL;
 
/*
-* Do not issue more than one write at a time per
-* zone. This solves write ordering problems due to
-* the unlocking of the request queue in the dispatch
-* path in the non scsi-mq case. For scsi-mq, this
-* also avoids potential write reordering when multiple
-* threads running on different CPUs write to the same
-* zone (with a synchronized sequential pattern).
+* There is no write constraint on conventional zones, but do not issue
+* more than one write at a time per sequential zone. This avoids write
+* ordering problems due to the unlocking of the request queue in the
+* dispatch path of the non scsi-mq (legacy) case.
 */
-   if (sdkp->zones_wlock &&
-   test_and_set_bit(zno, sdkp->zones_wlock))
+   zno = sd_zbc_zone_no(sdkp, sector);
+   if (!test_bit(zno, sdkp->seq_zones))
+   return BLKPREP_OK;
+   if (test_and_set_bit(zno, sdkp->zones_wlock))
return BLKPREP_DEFER;
 
WARN_ON_ONCE(cmd->flags & SCMD_ZONE_WRITE_LOCK);
@@ -286,8 +285,9 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
struct request *rq = cmd->request;
struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 
-   if (sdkp->zones_wlock && cmd->flags & SCMD_ZONE_WRITE_LOCK) {
+   if (cmd->flags & SCMD_ZONE_WRITE_LOCK) {
unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq));
+
WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock));
cmd->flags &= ~SCMD_ZONE_WRITE_LOCK;
clear_bit_unlock(zno, sdkp->zones_wlock);
@@ -510,8 +510,67 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
return 0;
 }
 
+/*
+ * Initialize the sequential zone bitmap to allow identifying sequential zones.
+ */
+static int sd_zbc_setup_seq_zones(struct scsi_disk *sdkp)
+{
+   unsigned long *seq_zones;
+   sector_t block = 0;
+   unsigned char *buf;
+   unsigned char *rec;
+   unsigned int buf_len;
+   unsigned int list_length;
+   unsigned int n = 0;
+   int ret = -ENOMEM;
+
+   /* Allocate zone type bitmap */
+   seq_zones = sd_zbc_alloc_zone_bitmap(sdkp);
+   if (!seq_zones)
+   return -ENOMEM;
+
+   buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL);
+   if (!buf)
+   goto out;
+
+   while (block < sdkp->capacity) {
+
+   ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, block);
+   if (ret)
+   goto out;
+
+   /* Parse reported zone descriptors */
+   list_length = get_unaligned_be32([0]) + 64;
+   

[PATCH 3/8] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h

2017-09-01 Thread Damien Le Moal
Move standard macro definitions for the zone types and zone conditions
to scsi_proto.h together with the definitions related to the
REPORT ZONES command.

While at it, define all values in the enums to be clear.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd_zbc.c | 18 --
 include/scsi/scsi_proto.h | 46 +++---
 2 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 8aa54779aac1..d445a57f99bd 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -37,24 +37,6 @@
 #include "sd.h"
 #include "scsi_priv.h"
 
-enum zbc_zone_type {
-   ZBC_ZONE_TYPE_CONV = 0x1,
-   ZBC_ZONE_TYPE_SEQWRITE_REQ,
-   ZBC_ZONE_TYPE_SEQWRITE_PREF,
-   ZBC_ZONE_TYPE_RESERVED,
-};
-
-enum zbc_zone_cond {
-   ZBC_ZONE_COND_NO_WP,
-   ZBC_ZONE_COND_EMPTY,
-   ZBC_ZONE_COND_IMP_OPEN,
-   ZBC_ZONE_COND_EXP_OPEN,
-   ZBC_ZONE_COND_CLOSED,
-   ZBC_ZONE_COND_READONLY = 0xd,
-   ZBC_ZONE_COND_FULL,
-   ZBC_ZONE_COND_OFFLINE,
-};
-
 /**
  * Convert a zone descriptor to a zone struct.
  */
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 8c285d9a06d8..20d7b3fbc351 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -301,19 +301,43 @@ struct scsi_lun {
 
 /* Reporting options for REPORT ZONES */
 enum zbc_zone_reporting_options {
-   ZBC_ZONE_REPORTING_OPTION_ALL = 0,
-   ZBC_ZONE_REPORTING_OPTION_EMPTY,
-   ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN,
-   ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN,
-   ZBC_ZONE_REPORTING_OPTION_CLOSED,
-   ZBC_ZONE_REPORTING_OPTION_FULL,
-   ZBC_ZONE_REPORTING_OPTION_READONLY,
-   ZBC_ZONE_REPORTING_OPTION_OFFLINE,
-   ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
-   ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE,
-   ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f,
+   ZBC_ZONE_REPORTING_OPTION_ALL   = 0x00,
+   ZBC_ZONE_REPORTING_OPTION_EMPTY = 0x01,
+   ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN = 0x02,
+   ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN = 0x03,
+   ZBC_ZONE_REPORTING_OPTION_CLOSED= 0x04,
+   ZBC_ZONE_REPORTING_OPTION_FULL  = 0x05,
+   ZBC_ZONE_REPORTING_OPTION_READONLY  = 0x06,
+   ZBC_ZONE_REPORTING_OPTION_OFFLINE   = 0x07,
+   /* 0x08 to 0x0f are reserved */
+   ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
+   ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE  = 0x11,
+   /* 0x12 to 0x3e are reserved */
+   ZBC_ZONE_REPORTING_OPTION_NON_WP= 0x3f,
 };
 
 #define ZBC_REPORT_ZONE_PARTIAL 0x80
 
+/* Zone types of REPORT ZONES zone descriptors */
+enum zbc_zone_type {
+   ZBC_ZONE_TYPE_CONV  = 0x1,
+   ZBC_ZONE_TYPE_SEQWRITE_REQ  = 0x2,
+   ZBC_ZONE_TYPE_SEQWRITE_PREF = 0x3,
+   /* 0x4 to 0xf are reserved */
+   ZBC_ZONE_TYPE_RESERVED  = 0x4,
+};
+
+/* Zone conditions of REPORT ZONES zone descriptors */
+enum zbc_zone_cond {
+   ZBC_ZONE_COND_NO_WP = 0x0,
+   ZBC_ZONE_COND_EMPTY = 0x1,
+   ZBC_ZONE_COND_IMP_OPEN  = 0x2,
+   ZBC_ZONE_COND_EXP_OPEN  = 0x3,
+   ZBC_ZONE_COND_CLOSED= 0x4,
+   /* 0x5 to 0xc are reserved */
+   ZBC_ZONE_COND_READONLY  = 0xd,
+   ZBC_ZONE_COND_FULL  = 0xe,
+   ZBC_ZONE_COND_OFFLINE   = 0xf,
+};
+
 #endif /* _SCSI_PROTO_H_ */
-- 
2.13.5



[PATCH 7/8] scsi: sd_zbc: Disable zone write locking with scsi-mq

2017-09-01 Thread Damien Le Moal
In the case of a ZBC disk used with scsi-mq, zone write locking does
not prevent write reordering in sequential zones. Unlike the legacy
case, zone locking can only be done after the command request is
removed from the scheduler dispatch queue. That is, at the time of
zone locking, the write command may already be out of order.

Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
used with scsi-mq. Write order guarantees can be provided by an
adapted I/O scheduler.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd_zbc.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 3a9feadcc133..0f0a74fbd9c5 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -258,6 +258,10 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
(sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors)
return BLKPREP_KILL;
 
+   /* No write locking with scsi-mq */
+   if (rq->q->mq_ops)
+   return BLKPREP_OK;
+
/*
 * There is no write constraint on conventional zones, but do not issue
 * more than one write at a time per sequential zone. This avoids write
-- 
2.13.5



[PATCH 2/8] block: Fix declaration of blk-mq scheduler functions

2017-09-01 Thread Damien Le Moal
The functions blk_mq_sched_free_hctx_data(), blk_mq_sched_try_merge(),
blk_mq_sched_try_insert_merge() and blk_mq_sched_request_inserted() are
all exported symbols but declared only internally in
block/blk-mq-sched.h. Move their declaration to include/linux/blk-mq.h
to make them available to block external scheduler modules.

Signed-off-by: Damien Le Moal 
---
 block/blk-mq-sched.h   |  7 ---
 include/linux/blk-mq.h | 10 ++
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9267d0b7c197..3b725f8fbebe 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -4,16 +4,9 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-void blk_mq_sched_free_hctx_data(struct request_queue *q,
-void (*exit)(struct blk_mq_hw_ctx *));
-
 void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio);
 
-void blk_mq_sched_request_inserted(struct request *rq);
-bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-   struct request **merged_request);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
-bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request 
*rq);
 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a369174a9679..128742552549 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -284,6 +284,16 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues);
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
 /*
+ * Scheduler helper functions.
+ */
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+void (*exit)(struct blk_mq_hw_ctx *));
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
+   struct request **merged_request);
+bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request 
*rq);
+void blk_mq_sched_request_inserted(struct request *rq);
+
+/*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
  */
-- 
2.13.5



[PATCH 5/8] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()

2017-09-01 Thread Damien Le Moal
The three values starting at byte 8 of the Zoned Block Device
Characteristics VPD page B6h are 32 bits values, not 64bits. So use
get_unaligned_be32() to retrieve the values and not get_unaligned_be64()

Fixes: 89d947561077 ("sd: Implement support for ZBC devices")
Cc: 

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd_zbc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 810377007665..0e6b5f5d14e7 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -356,15 +356,15 @@ static int sd_zbc_read_zoned_characteristics(struct 
scsi_disk *sdkp,
if (sdkp->device->type != TYPE_ZBC) {
/* Host-aware */
sdkp->urswrz = 1;
-   sdkp->zones_optimal_open = get_unaligned_be64([8]);
-   sdkp->zones_optimal_nonseq = get_unaligned_be64([12]);
+   sdkp->zones_optimal_open = get_unaligned_be32([8]);
+   sdkp->zones_optimal_nonseq = get_unaligned_be32([12]);
sdkp->zones_max_open = 0;
} else {
/* Host-managed */
sdkp->urswrz = buf[4] & 1;
sdkp->zones_optimal_open = 0;
sdkp->zones_optimal_nonseq = 0;
-   sdkp->zones_max_open = get_unaligned_be64([16]);
+   sdkp->zones_max_open = get_unaligned_be32([16]);
}
 
return 0;
-- 
2.13.5



[PATCH 4/8] scsi: sd_zbc: Reorganize and cleanup

2017-09-01 Thread Damien Le Moal
Introduce sd_zbc.h for zbc related declarations (avoid cluttering sd.h).

Fix comments style in sd_zbc.c (do not use documentation format) and
add/fix comments to enhance explanations. Also remove a useless blank
line in sd_zbc_read_zones().

Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
assignment and simplify nr_zones calculation.

Finally, use the min() macro sd_zbc_check_zone_size() to get a report
zone reply size instead of the open coded version.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd.c |  1 +
 drivers/scsi/sd.h | 53 
 drivers/scsi/sd_zbc.c | 85 +--
 drivers/scsi/sd_zbc.h | 75 +
 4 files changed, 117 insertions(+), 97 deletions(-)
 create mode 100644 drivers/scsi/sd_zbc.h

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 962cef13d406..6711d49fde79 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -68,6 +68,7 @@
 #include 
 
 #include "sd.h"
+#include "sd_zbc.h"
 #include "scsi_priv.h"
 #include "scsi_logging.h"
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 99c4dde9b6bf..5940390d30f3 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -272,57 +272,4 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, 
unsigned int a)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
-static inline int sd_is_zoned(struct scsi_disk *sdkp)
-{
-   return sdkp->zoned == 1 || sdkp->device->type == TYPE_ZBC;
-}
-
-#ifdef CONFIG_BLK_DEV_ZONED
-
-extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
-extern void sd_zbc_remove(struct scsi_disk *sdkp);
-extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
-extern int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd);
-extern void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd);
-extern int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd);
-extern int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd);
-extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
-   struct scsi_sense_hdr *sshdr);
-
-#else /* CONFIG_BLK_DEV_ZONED */
-
-static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
-   unsigned char *buf)
-{
-   return 0;
-}
-
-static inline void sd_zbc_remove(struct scsi_disk *sdkp) {}
-
-static inline void sd_zbc_print_zones(struct scsi_disk *sdkp) {}
-
-static inline int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
-{
-   /* Let the drive fail requests */
-   return BLKPREP_OK;
-}
-
-static inline void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) {}
-
-static inline int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
-{
-   return BLKPREP_INVALID;
-}
-
-static inline int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
-{
-   return BLKPREP_INVALID;
-}
-
-static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
-  unsigned int good_bytes,
-  struct scsi_sense_hdr *sshdr) {}
-
-#endif /* CONFIG_BLK_DEV_ZONED */
-
 #endif /* _SCSI_DISK_H */
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index d445a57f99bd..810377007665 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -26,22 +26,12 @@
 
 #include 
 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "sd.h"
-#include "scsi_priv.h"
-
-/**
+#include "sd_zbc.h"
+
+/*
  * Convert a zone descriptor to a zone struct.
  */
-static void sd_zbc_parse_report(struct scsi_disk *sdkp,
-   u8 *buf,
+static void sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
struct blk_zone *zone)
 {
struct scsi_device *sdp = sdkp->device;
@@ -63,8 +53,9 @@ static void sd_zbc_parse_report(struct scsi_disk *sdkp,
zone->wp = zone->start + zone->len;
 }
 
-/**
+/*
  * Issue a REPORT ZONES scsi command.
+ * For internal use.
  */
 static int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buf,
   unsigned int buflen, sector_t lba)
@@ -105,6 +96,9 @@ static int sd_zbc_report_zones(struct scsi_disk *sdkp, 
unsigned char *buf,
return 0;
 }
 
+/*
+ * Prepare a REPORT ZONES scsi command for a REQ_OP_ZONE_REPORT request.
+ */
 int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
 {
struct request *rq = cmd->request;
@@ -147,6 +141,10 @@ int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
return BLKPREP_OK;
 }
 
+/*
+ * Process a REPORT ZONES scsi command reply, converting all reported
+ * zone descriptors to blk_zone structures.
+ */
 static void sd_zbc_report_zones_complete(struct scsi_cmnd *scmd,
 unsigned int good_bytes)
 {
@@ -202,17 +200,9 @@ static void sd_zbc_report_zones_complete(struct scsi_cmnd 
*scmd,
local_irq_restore(flags);
 }
 
-static inline sector_t sd_zbc_zone_sectors(struct 

[PATCH 1/8] block: Fix declaration of blk-mq debugfs functions

2017-09-01 Thread Damien Le Moal
__blk_mq_debugfs_rq_show() and blk_mq_debugfs_rq_show() are exported
symbols but declared in the block internal file block/blk-mq-debugfs.h.
Move the declaration of these functions to the public linux/blk-mq.h
file to make these functions available to other modules.

While at it, also move the definition of the blk_mq_debugfs_attr
strcture to allow scheduler modules outside of block to define debugfs
attributes.

Signed-off-by: Damien Le Moal 
---
 block/blk-mq-debugfs.h | 14 --
 include/linux/blk-mq.h | 20 
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index a182e6f97565..85759aef53e1 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -3,20 +3,6 @@
 
 #ifdef CONFIG_BLK_DEBUG_FS
 
-#include 
-
-struct blk_mq_debugfs_attr {
-   const char *name;
-   umode_t mode;
-   int (*show)(void *, struct seq_file *);
-   ssize_t (*write)(void *, const char __user *, size_t, loff_t *);
-   /* Set either .show or .seq_ops. */
-   const struct seq_operations *seq_ops;
-};
-
-int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
-int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
-
 int blk_mq_debugfs_register(struct request_queue *q);
 void blk_mq_debugfs_unregister(struct request_queue *q);
 int blk_mq_debugfs_register_hctx(struct request_queue *q,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 14542308d25b..a369174a9679 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -5,6 +5,21 @@
 #include 
 #include 
 
+#ifdef CONFIG_BLK_DEBUG_FS
+
+#include 
+
+struct blk_mq_debugfs_attr {
+   const char *name;
+   umode_t mode;
+   int (*show)(void *, struct seq_file *);
+   ssize_t (*write)(void *, const char __user *, size_t, loff_t *);
+   /* Set either .show or .seq_ops. */
+   const struct seq_operations *seq_ops;
+};
+
+#endif
+
 struct blk_mq_tags;
 struct blk_flush_queue;
 
@@ -289,4 +304,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
for ((i) = 0; (i) < (hctx)->nr_ctx &&   \
 ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
 
+#ifdef CONFIG_BLK_DEBUG_FS
+int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
+int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
+#endif
+
 #endif
-- 
2.13.5



[PATCH 0/8] blk-mq/scsi-mq support for ZBC disks

2017-09-01 Thread Damien Le Moal
This series implements support for ZBC disks used through the blk-mq/scsi-mq
I/O path.

The current scsi level support of ZBC disks guarantees write request ordering
using a per-zone write lock which prevents issuing simultaneously multiple
write commands to a zone, doing so avoid reordering of sequential writes to
sequential zones. This method is however ineffective when scsi-mq is used with
zoned block devices. This is due to the different execution model of blk-mq
which passes a request to the scsi layer for dispatching after the request has
been removed from the I/O scheduler queue. That is, when the scsi layer tries
to lock the target zone of the request, the request may already be out of
order and zone write locking fails to prevent that.

Various approaches have been tried to solve this problem. All of them had the
serious disadvantage of cluttering blk-mq code with zoned block device specific
conditions and processing. As such extensive changes can only transform into
maintenance nightmares, a radically different solution is proposed here.

This series support implementation is in the form of a new "zoned" I/O
scheduler based on mq-deadline. Zoned is mostly identical to the mq-deadline
scheduler. It only differs from mq-deadline from the addition of a per zone
write locking mechanism similar to that implemented in sd_zbc.c. The zoned
scheduler zone write locking mechanism is used for the exact same purpose as
the one in the scsi disk driver: limit writes per zone to one to avoid
reordering. The locking context however changes and is moved to the
dispatch_request method of the scheduler, that is, target zones of write
requests can be locked before the requests are issued. In effect, this results
in the same behavior as the legacy scsi path. Sequential write ordering is
preserved.

The zoned scheduler is added as part of the scsi code under driver/scsi. This
allows access to information on the disk zones maintained within the device
scsi_disk structure as this information (e.g. zone types) cannot be retreived
from the context of an I/O scheduler initialization method (init_queue()
method).

the series patches are as follows:
- The first 2 patches fix exported symbols declaration to allow compiling an
  I/O scheduler outside of the block/ directory. No functional changes from
  these patches.
- Patch 3 and 4 reorganize and cleanup the scsi ZBC support to facilitate the
  intorduction of the scheduler. No functional changes from these patches.
- Path 5 fixes a bug
- Patch 6 is an optimization for the legacy scsi path which avoids zone
  locking if a write request targets a conventional zone. 
- Path 7 disables zone write locking for disks accessed through scsi-mq.
- Patch 8 introduces the zoned scheduler.

Comments are as always very much appreciated.

Thank you.


Damien Le Moal (8):
  block: Fix declaration of blk-mq debugfs functions
  block: Fix declaration of blk-mq scheduler functions
  scsi: sd_zbc: Move ZBC declarations to scsi_proto.h
  scsi: sd_zbc: Reorganize and cleanup
  scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()
  scsi: sd_zbc: Limit zone write locking to sequential zones
  scsi: sd_zbc: Disable zone write locking with scsi-mq
  scsi: Introduce ZBC disk I/O scheduler

 Documentation/block/zoned-iosched.txt |  48 ++
 block/Kconfig.iosched |  12 +
 block/blk-mq-debugfs.h|  14 -
 block/blk-mq-sched.h  |   7 -
 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.c |   1 +
 drivers/scsi/sd.h |  57 +--
 drivers/scsi/sd_zbc.c | 236 +
 drivers/scsi/sd_zbc.h |  79 +++
 drivers/scsi/sd_zbc_iosched.c | 934 ++
 include/linux/blk-mq.h|  30 ++
 include/scsi/scsi_proto.h |  46 +-
 12 files changed, 1293 insertions(+), 172 deletions(-)
 create mode 100644 Documentation/block/zoned-iosched.txt
 create mode 100644 drivers/scsi/sd_zbc.h
 create mode 100644 drivers/scsi/sd_zbc_iosched.c

-- 
2.13.5



Re: [PATCH 3/6] pm80xx : Different SAS addresses for phys.

2017-09-01 Thread Jack Wang
2017-08-30 18:55 GMT+02:00 Viswas G :
> Hi Jack,
>
> This is a customer requirement. Since the SAS addresses of all the phys were 
> same , when the attached SAS addresses are different, it was forming only one 
> domain.  If we assign different SAS addresses, this will cause duplicate 
> domain unless we set the strict_wide_port to 1.
>
> Regards,
> Viswas G

Ok, understand now!

Acked-by: Jack Wang 
Thanks!
>
>
>> -Original Message-
>> From: Jack Wang [mailto:xjtu...@gmail.com]
>> Sent: Tuesday, August 29, 2017 4:55 PM
>> To: Viswas G 
>> Cc: linux-scsi@vger.kernel.org; Vasanthalakshmi Tharmarajan
>> 
>> Subject: Re: [PATCH 3/6] pm80xx : Different SAS addresses for phys.
>>
>> EXTERNAL EMAIL
>>
>>
>> 2015-01-30 7:06 GMT+01:00 Viswas G :
>> > Different SAS addresses are assigned for each set of phys.
>> >
>> > Signed-off-by: Viswas G 
>> > ---
>> >  drivers/scsi/pm8001/pm8001_init.c | 13 +
>> >  drivers/scsi/pm8001/pm80xx_hwi.c  |  3 +--
>> >  2 files changed, 10 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/drivers/scsi/pm8001/pm8001_init.c
>> b/drivers/scsi/pm8001/pm8001_init.c
>> > index 034b2f7d1135..d282f1562615 100644
>> > --- a/drivers/scsi/pm8001/pm8001_init.c
>> > +++ b/drivers/scsi/pm8001/pm8001_init.c
>> > @@ -132,7 +132,7 @@ static void pm8001_phy_init(struct
>> pm8001_hba_info *pm8001_ha, int phy_id)
>> > sas_phy->oob_mode = OOB_NOT_CONNECTED;
>> > sas_phy->linkrate = SAS_LINK_RATE_UNKNOWN;
>> > sas_phy->id = phy_id;
>> > -   sas_phy->sas_addr = _ha->sas_addr[0];
>> > +   sas_phy->sas_addr = (u8 *)>dev_sas_addr;
>> > sas_phy->frame_rcvd = >frame_rcvd[0];
>> > sas_phy->ha = (struct sas_ha_struct *)pm8001_ha->shost->hostdata;
>> > sas_phy->lldd_phy = phy;
>> > @@ -593,10 +593,12 @@ static void  pm8001_post_sas_ha_init(struct
>> Scsi_Host *shost,
>> > for (i = 0; i < chip_info->n_phy; i++) {
>> > sha->sas_phy[i] = _ha->phy[i].sas_phy;
>> > sha->sas_port[i] = _ha->port[i].sas_port;
>> > +   sha->sas_phy[i]->sas_addr =
>> > +   (u8 *)_ha->phy[i].dev_sas_addr;
>> > }
>> > sha->sas_ha_name = DRV_NAME;
>> > sha->dev = pm8001_ha->dev;
>> > -
>> > +   sha->strict_wide_ports = 1;
>> > sha->lldd_module = THIS_MODULE;
>> > sha->sas_addr = _ha->sas_addr[0];
>> > sha->num_phys = chip_info->n_phy;
>> > @@ -613,6 +615,7 @@ static void  pm8001_post_sas_ha_init(struct
>> Scsi_Host *shost,
>> >  static void pm8001_init_sas_add(struct pm8001_hba_info *pm8001_ha)
>> >  {
>> > u8 i, j;
>> > +   u8 sas_add[8];
>> >  #ifdef PM8001_READ_VPD
>> > /* For new SPC controllers WWN is stored in flash vpd
>> > *  For SPC/SPCve controllers WWN is stored in EEPROM
>> > @@ -674,10 +677,12 @@ static void pm8001_init_sas_add(struct
>> pm8001_hba_info *pm8001_ha)
>> > pm8001_ha->sas_addr[j] =
>> > payload.func_specific[0x804 + i];
>> > }
>> > -
>> > +   memcpy(sas_add, pm8001_ha->sas_addr, SAS_ADDR_SIZE);
>> > for (i = 0; i < pm8001_ha->chip->n_phy; i++) {
>> > +   if (i && ((i % 4) == 0))
>> > +   sas_add[7] = sas_add[7] + 4;
>> > memcpy(_ha->phy[i].dev_sas_addr,
>> > -   pm8001_ha->sas_addr, SAS_ADDR_SIZE);
>> > +   sas_add, SAS_ADDR_SIZE);
>> > PM8001_INIT_DBG(pm8001_ha,
>> > pm8001_printk("phy %d sas_addr = %016llx\n", i,
>> > pm8001_ha->phy[i].dev_sas_addr));
>> > diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c
>> b/drivers/scsi/pm8001/pm80xx_hwi.c
>> > index 8fb5ddf08cc4..a07b023c09bf 100644
>> > --- a/drivers/scsi/pm8001/pm80xx_hwi.c
>> > +++ b/drivers/scsi/pm8001/pm80xx_hwi.c
>> > @@ -3041,7 +3041,6 @@ hw_event_phy_down(struct pm8001_hba_info
>> *pm8001_ha, void *piomb)
>> > port->port_state = portstate;
>> > phy->identify.device_type = 0;
>> > phy->phy_attached = 0;
>> > -   memset(>dev_sas_addr, 0, SAS_ADDR_SIZE);
>> > switch (portstate) {
>> > case PORT_VALID:
>> > break;
>> > @@ -4394,7 +4393,7 @@ pm80xx_chip_phy_start_req(struct
>> pm8001_hba_info *pm8001_ha, u8 phy_id)
>> > payload.sas_identify.dev_type = SAS_END_DEVICE;
>> > payload.sas_identify.initiator_bits = SAS_PROTOCOL_ALL;
>> > memcpy(payload.sas_identify.sas_addr,
>> > -   pm8001_ha->sas_addr, SAS_ADDR_SIZE);
>> > +   _ha->phy[phy_id].dev_sas_addr, SAS_ADDR_SIZE);
>> > payload.sas_identify.phy_id = phy_id;
>> > ret = pm8001_mpi_build_cmd(pm8001_ha, circularQ, opcode,
>> , 0);
>> > return ret;
>> > --
>> > 

Re: [PATCH V2 1/2] scsi: sd: Fix sd_config_write_same()

2017-09-01 Thread Damien Le Moal
Martin,

On 9/1/17 12:36, Martin K. Petersen wrote:
> 
> Damien,
> 
>> +if (sdkp->max_ws_blocks &&
>> +sdkp->physical_block_size > logical_block_size) {
>> +/*
>> + * Reporting a maximum number of blocks that is not aligned
>> + * on the device physical size would cause a large write same
>> + * request to be split into physically unaligned chunks by
>> + * __blkdev_issue_write_zeroes() and __blkdev_issue_write_same()
>> + * even if the caller of these functions took care to align the
>> + * large request. So make sure the maximum reported is aligned
>> + * to the device physical block size. This is only an optional
>> + * optimization for regular disks, but this is mandatory to
>> + * avoid failure of large write same requests directed at
>> + * sequential write required zones of host-managed ZBC disks.
>> + */
>> +sector_t phys_mask =
>> +bytes_to_logical(sdkp->device,
>> + sdkp->physical_block_size) - 1;
>> +
>> +sdkp->max_ws_blocks &= ~phys_mask;
>> +}
>> +
>>  out:
>>  blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
>>   (logical_block_size >> 9));
> 
> ALIGN_DOWN(sdkp->max_ws_blocks, sdkp->physical_block_size)?

Sure. But let's use the same unit then :)

sdkp->max_ws_blocks =
ALIGN_DOWN(sdkp->max_ws_blocks,
   bytes_to_logical(sdkp->device, sdkp->physical_block_size));

Isn't it ?

Do you want me to resend ?

Thank you.

-- 
Damien Le Moal,
Western Digital


[Resend v5 04/14] mpt3sas: Added support for nvme encapsulated request message.

2017-09-01 Thread Suganath Prabu S
* Mpt3sas driver uses the NVMe Encapsulated Request message to
send an NVMe command to an NVMe device attached to the IOC.

* Normal I/O commands like reads and writes are passed to the
controller as SCSI commands and the controller has the ability
to translate the commands to NVMe equivalent.

* This encapsulated NVMe command is used by applications to send
direct NVMe commands to NVMe drives.

Signed-off-by: Chaitra P B 
Signed-off-by: Suganath Prabu S 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/mpt3sas/mpt3sas_base.c | 56 +-
 drivers/scsi/mpt3sas/mpt3sas_base.h |  1 +
 drivers/scsi/mpt3sas/mpt3sas_ctl.c  | 69 +
 3 files changed, 119 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index dcf5157..9653cfa 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -557,6 +557,11 @@ _base_sas_ioc_info(struct MPT3SAS_ADAPTER *ioc, 
MPI2DefaultReply_t *mpi_reply,
frame_sz = sizeof(Mpi2SmpPassthroughRequest_t) + ioc->sge_size;
func_str = "smp_passthru";
break;
+   case MPI2_FUNCTION_NVME_ENCAPSULATED:
+   frame_sz = sizeof(Mpi26NVMeEncapsulatedRequest_t) +
+   ioc->sge_size;
+   func_str = "nvme_encapsulated";
+   break;
default:
frame_sz = 32;
func_str = "unknown";
@@ -985,7 +990,9 @@ _base_interrupt(int irq, void *bus_id)
if (request_desript_type ==
MPI25_RPY_DESCRIPT_FLAGS_FAST_PATH_SCSI_IO_SUCCESS ||
request_desript_type ==
-   MPI2_RPY_DESCRIPT_FLAGS_SCSI_IO_SUCCESS) {
+   MPI2_RPY_DESCRIPT_FLAGS_SCSI_IO_SUCCESS ||
+   request_desript_type ==
+   MPI26_RPY_DESCRIPT_FLAGS_PCIE_ENCAPSULATED_SUCCESS) {
cb_idx = _base_get_cb_idx(ioc, smid);
if ((likely(cb_idx < MPT_MAX_CALLBACKS)) &&
(likely(mpt_callbacks[cb_idx] != NULL))) {
@@ -3046,6 +3053,30 @@ _base_put_smid_hi_priority(struct MPT3SAS_ADAPTER *ioc, 
u16 smid,
 }
 
 /**
+ * _base_put_smid_nvme_encap - send NVMe encapsulated request to
+ *  firmware
+ * @ioc: per adapter object
+ * @smid: system request message index
+ *
+ * Return nothing.
+ */
+static void
+_base_put_smid_nvme_encap(struct MPT3SAS_ADAPTER *ioc, u16 smid)
+{
+   Mpi2RequestDescriptorUnion_t descriptor;
+   u64 *request = (u64 *)
+
+   descriptor.Default.RequestFlags =
+   MPI26_REQ_DESCRIPT_FLAGS_PCIE_ENCAPSULATED;
+   descriptor.Default.MSIxIndex =  _base_get_msix_index(ioc);
+   descriptor.Default.SMID = cpu_to_le16(smid);
+   descriptor.Default.LMID = 0;
+   descriptor.Default.DescriptorTypeDependent = 0;
+   _base_writeq(*request, >chip->RequestDescriptorPostLow,
+   >scsi_lookup_lock);
+}
+
+/**
  * _base_put_smid_default - Default, primarily used for config pages
  * @ioc: per adapter object
  * @smid: system request message index
@@ -3136,6 +3167,27 @@ _base_put_smid_hi_priority_atomic(struct MPT3SAS_ADAPTER 
*ioc, u16 smid,
 }
 
 /**
+ * _base_put_smid_nvme_encap_atomic - send NVMe encapsulated request to
+ *   firmware using Atomic Request Descriptor
+ * @ioc: per adapter object
+ * @smid: system request message index
+ *
+ * Return nothing.
+ */
+static void
+_base_put_smid_nvme_encap_atomic(struct MPT3SAS_ADAPTER *ioc, u16 smid)
+{
+   Mpi26AtomicRequestDescriptor_t descriptor;
+   u32 *request = (u32 *)
+
+   descriptor.RequestFlags = MPI26_REQ_DESCRIPT_FLAGS_PCIE_ENCAPSULATED;
+   descriptor.MSIxIndex = _base_get_msix_index(ioc);
+   descriptor.SMID = cpu_to_le16(smid);
+
+   writel(cpu_to_le32(*request), >chip->AtomicRequestDescriptorPost);
+}
+
+/**
  * _base_put_smid_default - Default, primarily used for config pages
  * use Atomic Request Descriptor
  * @ioc: per adapter object
@@ -5968,11 +6020,13 @@ mpt3sas_base_attach(struct MPT3SAS_ADAPTER *ioc)
ioc->put_smid_scsi_io = &_base_put_smid_scsi_io_atomic;
ioc->put_smid_fast_path = &_base_put_smid_fast_path_atomic;
ioc->put_smid_hi_priority = &_base_put_smid_hi_priority_atomic;
+   ioc->put_smid_nvme_encap = &_base_put_smid_nvme_encap_atomic;
} else {
ioc->put_smid_default = &_base_put_smid_default;
ioc->put_smid_scsi_io = &_base_put_smid_scsi_io;
ioc->put_smid_fast_path = &_base_put_smid_fast_path;
ioc->put_smid_hi_priority = &_base_put_smid_hi_priority;
+   ioc->put_smid_nvme_encap = &_base_put_smid_nvme_encap;
}
 
 
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
b/drivers/scsi/mpt3sas/mpt3sas_base.h
index 

Re: [PATCH v4 00/14] mpt3sas driver NVMe support:

2017-09-01 Thread Suganath Prabu Subramani
Hi Martin,

On Fri, Sep 1, 2017 at 8:52 AM, Martin K. Petersen
 wrote:
>
> Hi Suganath,
>
>> Let me explain - NVME device fast path is possible in two ways.  IEEE
>> SGL and PRP SGL. Due to h/w constraint we choose IEEE SGL only for
>> smaller IO size.  Both above is true h/w Fast Path and no firmware
>> involvement.
>
>> Agree with you. We are planning to see if we can keep only simple Fast
>> Path using only PRP.
>
> That would be great, thank you!
>
>> Currently there is no performance issue for UNMAP translation in FW.
>
> Good!
>
>>> And yet patch 4 circumvents that statement by adding support for
>>> encapsulated commands to bypass the FW translation...
>>
>> This path is not due to performance reason. User wants to interact
>> with NVME drive in native NVME command for management.
>
> Patch 4 states:
>
> "This encapsulated NVMe command is used by applications to send direct
> NVMe commands to NVMe drives or for handling unmap where the translation
> at controller/firmware level is having performance issues."
>
> --

The statement in description of patch 4 is added by mistake, We ll
correct the description and re sending that.

> Martin K. Petersen  Oracle Linux Engineering

Thanks,
Suganath Prabu S


Re: Buffer overflow in the mptctl_replace_fw() function in linux kernel MPT ioctl driver

2017-09-01 Thread Dan Carpenter
On Fri, Sep 01, 2017 at 02:00:48PM +0800, Dison River wrote:
> newFwSize = ALIGN(karg.newImageSize, 4);

This is an integer overflow, but it's harmless...  As a static checker
developer this is where I would print a warning:
drivers/message/fusion/mptctl.c:1748 mptctl_replace_fw() warn: potential 
integer overflow from user '((karg.newImageSize)) + (((4)) - 1)'
I also caught the integer overflow from two days ago but there are too
many ones like this so I can't check them all.  In mpt_alloc_fw_memory()
there is another potential integer overflow when we do:

ioc->alloc_total += size;

But ->alloc_total is not used anywhere.

I don't see a buffer overflow here.

regards,
dan carpenter



Re: [PATCH 4/9] blk-mq: only run hw queues for blk-mq

2017-09-01 Thread Johannes Thumshirn
Hi Ming,

On Fri, Sep 01, 2017 at 01:27:23AM +0800, Ming Lei wrote:
> - blk_mq_run_hw_queues(q, false);
> + if (q->mq_ops)
> + blk_mq_run_hw_queues(q, false);

What speaks against putting the if (q->mq_ops) directly into
blk_mq_run_hw_queues() so we can't accidently call it from the sq path?

Just an idea, no hard preferences here.

Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] Remove Scsi_Host.uspace_req_q

2017-09-01 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH] Remove Scsi_Host.uspace_req_q

2017-09-01 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: Buffer overflow in the mptctl_replace_fw() function in linux kernel MPT ioctl driver

2017-09-01 Thread Kees Cook
On Thu, Aug 31, 2017 at 11:00 PM, Dison River  wrote:
> Hi:
> Buffer overflow in the mptctl_replace_fw() function in linux kernel
> MPT ioctl driver.
>
> In mptctl_replace_fw function, kernel didn't check the size of
> "newFwSize" variable allows attackers to cause a denial of service via
> unspecified vectors that trigger copy_from_user function calls with
> improper length arguments.

Is there a specific path you see to trigger this?

>
>
> static int
> mptctl_replace_fw (unsigned long arg)
> {
> ..
> if (copy_from_user(, uarg, sizeof(struct mpt_ioctl_replace_fw))) {
> printk(KERN_ERR MYNAM "%s@%d::mptctl_replace_fw - "
> "Unable to read in mpt_ioctl_replace_fw struct @ %p\n",
> __FILE__, __LINE__, uarg);
> return -EFAULT;
> }
>
> ..
>
> mpt_free_fw_memory(ioc);

This frees ioc->cached_fw.

>
> /* Allocate memory for the new FW image
>  */
> newFwSize = ALIGN(karg.newImageSize, 4);
>
> mpt_alloc_fw_memory(ioc, newFwSize);

This allocates ioc->cached_fw to newFwSize.

> ..
>
> if (copy_from_user(ioc->cached_fw, uarg->newImage, newFwSize)) {

This copies newFwSize bytes into a buffer that is allocated to newFwSize bytes.

What am I missing? I see some games getting played in
mpt_alloc_fw_memory(), but they seem to be covered due to the earlier
call to mpt_free_fw_memory()...

-Kees

> ///--->newFwSize can control in userspace
> printk(MYIOC_s_ERR_FMT "%s@%d::mptctl_replace_fw - "
> "Unable to read in mpt_ioctl_replace_fw image "
> "@ %p\n", ioc->name, __FILE__, __LINE__, uarg);
> mpt_free_fw_memory(ioc);
> return -EFAULT;
> }
>
> ..
>
> return 0;
> }

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 0/9] block/scsi: safe SCSI quiescing

2017-09-01 Thread Ming Lei
On Fri, Sep 01, 2017 at 08:24:46AM +0200, oleksa...@natalenko.name wrote:
> Hi.
> 
> 01.09.2017 05:45, Ming Lei wrote:
> > Could you try the following patch against this patchset to see
> > if there is still the warning?
> 
> With this patch I wasn't able to trigger per-CPU-related warning.

That is great!

> 
> Also, for 8th patch you've written:
> 
> > I have sent one delta patch in list, which will only call
> > blk_queue_enter_live() iff SCSI device is in QUIESCE.
> 
> This refers to the patch I've just tested, right?

Yeah, you are right, thanks again for your test!

I will prepare V2 and send it out soon.

-- 
Ming


Re: [PATCH 0/9] block/scsi: safe SCSI quiescing

2017-09-01 Thread oleksandr

Hi.

01.09.2017 05:45, Ming Lei wrote:

Could you try the following patch against this patchset to see
if there is still the warning?


With this patch I wasn't able to trigger per-CPU-related warning.

Also, for 8th patch you've written:

I have sent one delta patch in list, which will only call 
blk_queue_enter_live() iff SCSI device is in QUIESCE.


This refers to the patch I've just tested, right?


Buffer overflow in the mptctl_replace_fw() function in linux kernel MPT ioctl driver

2017-09-01 Thread Dison River
Hi:
Buffer overflow in the mptctl_replace_fw() function in linux kernel
MPT ioctl driver.

In mptctl_replace_fw function, kernel didn't check the size of
"newFwSize" variable allows attackers to cause a denial of service via
unspecified vectors that trigger copy_from_user function calls with
improper length arguments.


static int
mptctl_replace_fw (unsigned long arg)
{
..
if (copy_from_user(, uarg, sizeof(struct mpt_ioctl_replace_fw))) {
printk(KERN_ERR MYNAM "%s@%d::mptctl_replace_fw - "
"Unable to read in mpt_ioctl_replace_fw struct @ %p\n",
__FILE__, __LINE__, uarg);
return -EFAULT;
}

..

mpt_free_fw_memory(ioc);

/* Allocate memory for the new FW image
 */
newFwSize = ALIGN(karg.newImageSize, 4);

mpt_alloc_fw_memory(ioc, newFwSize);
..

if (copy_from_user(ioc->cached_fw, uarg->newImage, newFwSize)) {
///--->newFwSize can control in userspace
printk(MYIOC_s_ERR_FMT "%s@%d::mptctl_replace_fw - "
"Unable to read in mpt_ioctl_replace_fw image "
"@ %p\n", ioc->name, __FILE__, __LINE__, uarg);
mpt_free_fw_memory(ioc);
return -EFAULT;
}

..

return 0;
}