[PATCH V6 6/6] SCSI: set block queue at preempt only when SCSI device is put into quiesce

2017-09-26 Thread Ming Lei
Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
request pool because all allocated requests before can't
be dispatched when device is put in QIUESCE. Then no request
can be allocated for RQF_PREEMPT, and system may hang somewhere,
such as When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch sets block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_set_preempt_only(true)
is returned. Then RQF_PREEMPT can be allocated successfully duirng
preempt freeze.

Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..82c51619f1b7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,10 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
+   req = __blk_get_request(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+   BLK_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -2928,12 +2929,28 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
int err;
 
+   /*
+* Simply quiesing SCSI device isn't safe, it is easy
+* to use up requests because all these allocated requests
+* can't be dispatched when device is put in QIUESCE.
+* Then no request can be allocated and we may hang
+* somewhere, such as system suspend/resume.
+*
+* So we set block queue in preempt only first, no new
+* normal request can enter queue any more, and all pending
+* requests are drained once blk_set_preempt_only()
+* returns. Only RQF_PREEMPT is allowed in preempt only mode.
+*/
+   blk_set_preempt_only(sdev->request_queue, true);
+
mutex_lock(>state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
mutex_unlock(>state_mutex);
 
-   if (err)
+   if (err) {
+   blk_set_preempt_only(sdev->request_queue, false);
return err;
+   }
 
scsi_run_queue(sdev->request_queue);
while (atomic_read(>device_busy)) {
@@ -2964,6 +2981,8 @@ void scsi_device_resume(struct scsi_device *sdev)
scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
scsi_run_queue(sdev->request_queue);
mutex_unlock(>state_mutex);
+
+   blk_set_preempt_only(sdev->request_queue, false);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5



[PATCH V6 3/6] block: pass flags to blk_queue_enter()

2017-09-26 Thread Ming Lei
We need to pass PREEMPT flags to blk_queue_enter()
for allocating request with RQF_PREEMPT in the
following patch.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 10 ++
 block/blk-mq.c |  5 +++--
 block/blk-timeout.c|  2 +-
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  7 ++-
 5 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index abfba798ee03..be17b5bcf6e7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -766,7 +766,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+int blk_queue_enter(struct request_queue *q, unsigned flags)
 {
while (true) {
int ret;
@@ -774,7 +774,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
if (percpu_ref_tryget_live(>q_usage_counter))
return 0;
 
-   if (nowait)
+   if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -1405,7 +1405,8 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -2212,7 +2213,8 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bio->bi_disk->queue;
 
-   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+   if (likely(blk_queue_enter(q, (bio->bi_opf & REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 10c1f49f663d..45bff90e08f7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,7 +384,8 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
 
@@ -423,7 +424,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
if (hctx_idx >= q->nr_hw_queues)
return ERR_PTR(-EIO);
 
-   ret = blk_queue_enter(q, true);
+   ret = blk_queue_enter(q, BLK_REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 17ec83bb0900..e803106a5e5b 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
struct request *rq, *tmp;
int next_set = 0;
 
-   if (blk_queue_enter(q, true))
+   if (blk_queue_enter(q, BLK_REQ_NOWAIT))
return;
spin_lock_irqsave(q->queue_lock, flags);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..98cf2d7ee9d3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -674,7 +674,7 @@ int bdev_read_page(struct block_device *bdev, sector_t 
sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return result;
 
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@@ -710,7 +710,7 @@ int bdev_write_page(struct block_device *bdev, sector_t 
sector,
 
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 460294bb0fa5..107e2fd48486 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -857,6 +857,11 @@ enum {
BLKPREP_INVALID,/* invalid command, kill, return -EREMOTEIO */
 };
 
+/* passed to blk_queue_enter */
+enum {
+   BLK_REQ_NOWAIT = (1 << 0),
+};
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 /*
@@ -962,7 +967,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct 
gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 struct scsi_ioctl_command __user *);
 
-extern int blk_queue_enter(struct request_queue *q, bool nowait);
+extern int blk_queue_enter(struct request_queue *q, 

[PATCH V6 2/6] block: tracking request allocation with q_usage_counter

2017-09-26 Thread Ming Lei
This usage is basically same with blk-mq, so that we can
support to freeze legacy queue easily.

Also 'wake_up_all(>mq_freeze_wq)' has to be moved
into blk_set_queue_dying() since both legacy and blk-mq
may wait on the wait queue of .mq_freeze_wq.

Signed-off-by: Ming Lei 
---
 block/blk-core.c | 14 ++
 block/blk-mq.c   |  7 ---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index aebe676225e6..abfba798ee03 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,12 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /*
+* We need to ensure that processes currently waiting on
+* the queue are notified as well.
+*/
+   wake_up_all(>mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1392,16 +1398,21 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1573,6 +1584,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1854,8 +1866,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fd9f86fc86d..10c1f49f663d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -256,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(>mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.9.5



[PATCH V6 5/6] block: support PREEMPT_ONLY

2017-09-26 Thread Ming Lei
When queue is in PREEMPT_ONLY mode, only RQF_PREEMPT request
can be allocated and dispatched, other requests won't be allowed
to enter I/O path.

This is useful for supporting safe SCSI quiesce.

Part of this patch is from Bart's '[PATCH v4 4∕7] block: Add the 
QUEUE_FLAG_PREEMPT_ONLY
request queue flag'.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 25 +++--
 include/linux/blkdev.h |  5 +
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0a8396e8e4ff..3c0a7e7f172f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,17 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+void blk_set_preempt_only(struct request_queue *q, bool preempt_only)
+{
+   blk_mq_freeze_queue(q);
+   if (preempt_only)
+   queue_flag_set_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q);
+   else
+   queue_flag_clear_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q);
+   blk_mq_unfreeze_queue(q);
+}
+EXPORT_SYMBOL(blk_set_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
@@ -771,9 +782,18 @@ int blk_queue_enter(struct request_queue *q, unsigned 
flags)
while (true) {
int ret;
 
+   /*
+* preempt_only flag has to be set after queue is frozen,
+* so it can be checked here lockless and safely
+*/
+   if (blk_queue_preempt_only(q)) {
+   if (!(flags & BLK_REQ_PREEMPT))
+   goto slow_path;
+   }
+
if (percpu_ref_tryget_live(>q_usage_counter))
return 0;
-
+ slow_path:
if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
@@ -787,7 +807,8 @@ int blk_queue_enter(struct request_queue *q, unsigned flags)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   (!atomic_read(>mq_freeze_depth) &&
+   !blk_queue_preempt_only(q)) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d1ab950a7f72..8f5b15b2bf06 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -630,6 +630,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_STACKABLE)|   \
@@ -734,6 +735,10 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern void blk_set_preempt_only(struct request_queue *q, bool preempt_only);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.9.5



[PATCH V6 4/6] block: prepare for passing RQF_PREEMPT to request allocation

2017-09-26 Thread Ming Lei
REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() and allows users to pass
RQF_PREEMPT flag in, then we can allow to allocate request of RQF_PREEMPT
when queue is in mode of PREEMPT ONLY which will be introduced
in the following patch.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 19 +--
 block/blk-mq.c |  3 +--
 include/linux/blk-mq.h |  7 ---
 include/linux/blkdev.h | 17 ++---
 4 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index be17b5bcf6e7..0a8396e8e4ff 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1395,7 +1395,8 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, gfp_t gfp_mask,
+  unsigned int flags)
 {
struct request *rq;
int ret = 0;
@@ -1405,8 +1406,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -1424,26 +1424,25 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+ gfp_t gfp_mask, unsigned int flags)
 {
struct request *req;
 
+   flags |= gfp_mask & __GFP_DIRECT_RECLAIM ? 0 : BLK_REQ_NOWAIT;
if (q->mq_ops) {
-   req = blk_mq_alloc_request(q, op,
-   (gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   0 : BLK_MQ_REQ_NOWAIT);
+   req = blk_mq_alloc_request(q, op, flags);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
-   req = blk_old_get_request(q, op, gfp_mask);
+   req = blk_old_get_request(q, op, gfp_mask, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
 
return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45bff90e08f7..90b43f607e3c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,8 +384,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..066a676d7749 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -197,9 +197,10 @@ void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-   BLK_MQ_REQ_NOWAIT   = (1 << 0), /* return when out of requests */
-   BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
-   BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
+   BLK_MQ_REQ_NOWAIT   = BLK_REQ_NOWAIT, /* return when out of 
requests */
+   BLK_MQ_REQ_PREEMPT  = BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT 
*/
+   BLK_MQ_REQ_RESERVED = (1 << BLK_REQ_MQ_START_BIT), /* allocate from 
reserved pool */
+   BLK_MQ_REQ_INTERNAL = (1 << (BLK_REQ_MQ_START_BIT + 1)), /* 
allocate internal/sched tag */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 107e2fd48486..d1ab950a7f72 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -859,7 +859,10 @@ enum {
 
 /* passed to blk_queue_enter */
 enum {
-   BLK_REQ_NOWAIT = (1 << 0),
+   BLK_REQ_NOWAIT  = (1 << 0),
+   BLK_REQ_PREEMPT = (1 << 1),
+   BLK_REQ_MQ_START_BIT= 2,
+   BLK_REQ_BITS_MASK   = (1U << BLK_REQ_MQ_START_BIT) - 1,
 };
 
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
@@ -944,8 +947,9 @@ extern void blk_rq_init(struct request_queue *q, struct 
request *rq);
 

[PATCH V6 1/6] blk-mq: only run hw queues for blk-mq

2017-09-26 Thread Ming Lei
This patch just makes it explicitely.

Reviewed-by: Johannes Thumshirn 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..6fd9f86fc86d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5



[PATCH V6 0/6] block/scsi: safe SCSI quiescing

2017-09-26 Thread Ming Lei
Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be comming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated and wait forever,
meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
then system hangs forever, such as during system suspend or
sending SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt only mode, and solves the issue
by allowing RQF_PREEMP only during SCSI quiesce.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all.

V6:
- borrow Bart's idea of preempt only, with clean
  implementation(patch 5/patch 6)
- needn't any external driver's dependency, such as MD's
change

V5:
- fix one tiny race by introducing blk_queue_enter_preempt_freeze()
given this change is small enough compared with V4, I added
tested-by directly

V4:
- reorganize patch order to make it more reasonable
- support nested preempt freeze, as required by SCSI transport spi
- check preempt freezing in slow path of of blk_queue_enter()
- add "SCSI: transport_spi: resume a quiesced device"
- wake up freeze queue in setting dying for both blk-mq and legacy
- rename blk_mq_[freeze|unfreeze]_queue() in one patch
- rename .mq_freeze_wq and .mq_freeze_depth
- improve comment

V3:
- introduce q->preempt_unfreezing to fix one bug of preempt freeze
- call blk_queue_enter_live() only when queue is preempt frozen
- cleanup a bit on the implementation of preempt freeze
- only patch 6 and 7 are changed

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun
- introduce preempt version of blk_[freeze|unfreeze]_queue
- sync between preempt freeze and normal freeze
- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013=3=2


Thanks,
Ming

Ming Lei (6):
  blk-mq: only run hw queues for blk-mq
  block: tracking request allocation with q_usage_counter
  block: pass flags to blk_queue_enter()
  block: prepare for passing RQF_PREEMPT to request allocation
  block: support PREEMPT_ONLY
  SCSI: set block queue at preempt only when SCSI device is put into
quiesce

 block/blk-core.c| 62 ++---
 block/blk-mq.c  | 14 ---
 block/blk-timeout.c |  2 +-
 drivers/scsi/scsi_lib.c | 25 +---
 fs/block_dev.c  |  4 ++--
 include/linux/blk-mq.h  |  7 +++---
 include/linux/blkdev.h  | 27 ++---
 7 files changed, 106 insertions(+), 35 deletions(-)

-- 
2.9.5



Re: [PATCH v4 5/7] scsi: Reduce suspend latency

2017-09-26 Thread Ming Lei
On Wed, Sep 27, 2017 at 03:43:11AM +, Bart Van Assche wrote:
> On Wed, 2017-09-27 at 06:23 +0800, Ming Lei wrote:
> > mutex_lock(>state_mutex);
> > err = scsi_device_set_state(sdev, SDEV_QUIESCE);
> > if (err == 0)
> > blk_set_preempt_only(q, true);
> > mutex_unlock(>state_mutex);
> > 
> > if (err)
> > return err;
> > 
> > blk_mq_freeze_queue(q);
> > blk_mq_unfreeze_queue(q);
> > 
> > Any requests allocated before scsi_device_set_state() and
> > dispatched after scsi_device_set_state() can't be drained up
> > any more, then blk_mq_freeze_queue() will wait forever for the
> > drain up, so not only this patchset can't fix the current quiesce
> > issue, but also introduces new I/O hang in scsi_device_quiesce().
> 
> That's a good catch, but fortunately this is very easy to fix: move the
> blk_mq_freeze_queue() call before the mutex_lock() and scsi_device_set_state()
> calls. The error path will also need some adjustment.

I am also working towards that direction, patches have been ready.

-- 
Ming


Re: [PATCH v4 5/7] scsi: Reduce suspend latency

2017-09-26 Thread Bart Van Assche
On Wed, 2017-09-27 at 06:23 +0800, Ming Lei wrote:
> mutex_lock(>state_mutex);
> err = scsi_device_set_state(sdev, SDEV_QUIESCE);
> if (err == 0)
> blk_set_preempt_only(q, true);
> mutex_unlock(>state_mutex);
> 
> if (err)
> return err;
> 
> blk_mq_freeze_queue(q);
> blk_mq_unfreeze_queue(q);
> 
> Any requests allocated before scsi_device_set_state() and
> dispatched after scsi_device_set_state() can't be drained up
> any more, then blk_mq_freeze_queue() will wait forever for the
> drain up, so not only this patchset can't fix the current quiesce
> issue, but also introduces new I/O hang in scsi_device_quiesce().

That's a good catch, but fortunately this is very easy to fix: move the
blk_mq_freeze_queue() call before the mutex_lock() and scsi_device_set_state()
calls. The error path will also need some adjustment.

Bart.

Re: [PATCH 1/5] bcache: don't write back data if reading it failed

2017-09-26 Thread tang . junhui
From: Tang Junhui 

It looks good to me,
I have noted this issue before.
Thanks.

---
Tang Junhui

> If an IO operation fails, and we didn't successfully read data from the
> cache, don't writeback invalid/partial data to the backing disk.
> 
> Signed-off-by: Michael Lyle 
> ---
>  drivers/md/bcache/writeback.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index e663ca082183..eea49bf36401 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -179,13 +179,20 @@ static void write_dirty(struct closure *cl)
>struct dirty_io *io = container_of(cl, struct dirty_io, cl);
>struct keybuf_key *w = io->bio.bi_private;
>  
> -  dirty_init(w);
> -  bio_set_op_attrs(>bio, REQ_OP_WRITE, 0);
> -  io->bio.bi_iter.bi_sector = KEY_START(>key);
> -  bio_set_dev(>bio, io->dc->bdev);
> -  io->bio.bi_end_io   = dirty_endio;
> +  /* IO errors are signalled using the dirty bit on the key.
> +   * If we failed to read, we should not attempt to write to the
> +   * backing device.  Instead, immediately go to 
> write_dirty_finish
> +   * to clean up.
> +   */
> +  if (KEY_DIRTY(>key)) {
> +  dirty_init(w);
> +  bio_set_op_attrs(>bio, REQ_OP_WRITE, 0);
> +  io->bio.bi_iter.bi_sector = KEY_START(>key);
> +  bio_set_dev(>bio, io->dc->bdev);
> +  io->bio.bi_end_io   = dirty_endio;
>  
> -  closure_bio_submit(>bio, cl);
> +  closure_bio_submit(>bio, cl);
> +  }
>  
>continue_at(cl, write_dirty_finish, 
> io->dc->writeback_write_wq);
>  }
> -- 


Re: [PATCH v4 1/7] md: Make md resync and reshape threads freezable

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 22:59 +0800, Ming Lei wrote:
> On Tue, Sep 26, 2017 at 02:42:07PM +, Bart Van Assche wrote:
> > On Tue, 2017-09-26 at 19:17 +0800, Ming Lei wrote:
> > > Just test this patch a bit and the following failure of freezing task
> > > is triggered during suspend: [ ... ]
> > 
> > What kernel version did you start from and which patches were applied on 
> > top of
> > that kernel? Only patch 1/7 or all seven patches? What storage 
> > configuration did
> 
> It is v4.14-rc1+, and top commit is 8d93c7a43157, with all your 7 patches
> applied.
> 
> > you use in your test and what command(s) did you use to trigger suspend?
> 
> Follows my pm test script:
> 
>   #!/bin/sh
>   
>   echo check > /sys/block/md127/md/sync_action
>   
>   mkfs.ext4 -F /dev/md127
>   
>   mount /dev/md0 /mnt/data
>   
>   dd if=/dev/zero of=/mnt/data/d1.img bs=4k count=128k&
>   
>   echo 9 > /proc/sys/kernel/printk
>   echo devices > /sys/power/pm_test
>   echo mem > /sys/power/state
>   
>   wait
>   umount /mnt/data
> 
> Storage setting:
> 
>   sudo mdadm --create /dev/md/test /dev/sda /dev/sdb --level=1 
> --raid-devices=2
>   both /dev/sda and /dev/sdb are virtio-scsi.

Thanks for the detailed reply. I have been able to reproduce the freeze failure
you reported. The output of SysRq-t learned me that the md reboot notifier was
waiting for the frozen md sync thread and that this caused the freeze failure. 
So
I have started testing the patch below instead of the patch at the start of this
e-mail thread:


Subject: [PATCH] md: Stop resync and reshape upon system freeze

Some people use the md driver on laptops and use the suspend and
resume functionality. Since it is essential that submitting of
new I/O requests stops before a hibernation image is created,
interrupt the md resync and reshape actions if the system is
being frozen. Note: the resync and reshape will restart after
the system is resumed and a message similar to the following
will appear in the system log:

md: md0: data-check interrupted.

---
 drivers/md/md.c | 34 --
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 08fcaebc61bd..1e9d50f7345e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "md.h"
@@ -8103,6 +8104,12 @@ void md_allow_write(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(md_allow_write);
 
+static bool md_sync_interrupted(struct mddev *mddev)
+{
+   return test_bit(MD_RECOVERY_INTR, >recovery) ||
+   freezing(current);
+}
+
 #define SYNC_MARKS 10
 #defineSYNC_MARK_STEP  (3*HZ)
 #define UPDATE_FREQUENCY (5*60*HZ)
@@ -8133,6 +8140,8 @@ void md_do_sync(struct md_thread *thread)
return;
}
 
+   set_freezable();
+
if (mddev_is_clustered(mddev)) {
ret = md_cluster_ops->resync_start(mddev);
if (ret)
@@ -8184,7 +8193,7 @@ void md_do_sync(struct md_thread *thread)
mddev->curr_resync = 2;
 
try_again:
-   if (test_bit(MD_RECOVERY_INTR, >recovery))
+   if (md_sync_interrupted(mddev))
goto skip;
for_each_mddev(mddev2, tmp) {
if (mddev2 == mddev)
@@ -8208,7 +8217,7 @@ void md_do_sync(struct md_thread *thread)
 * be caught by 'softlockup'
 */
prepare_to_wait(_wait, , 
TASK_INTERRUPTIBLE);
-   if (!test_bit(MD_RECOVERY_INTR, 
>recovery) &&
+   if (!md_sync_interrupted(mddev) &&
mddev2->curr_resync >= mddev->curr_resync) {
if (mddev2_minor != mddev2->md_minor) {
mddev2_minor = mddev2->md_minor;
@@ -8335,8 +8344,7 @@ void md_do_sync(struct md_thread *thread)
sysfs_notify(>kobj, NULL, "sync_completed");
}
 
-   while (j >= mddev->resync_max &&
-  !test_bit(MD_RECOVERY_INTR, >recovery)) {
+   while (j >= mddev->resync_max && !md_sync_interrupted(mddev)) {
/* As this condition is controlled by user-space,
 * we can block indefinitely, so use '_interruptible'
 * to avoid triggering warnings.
@@ -8348,7 +8356,7 @@ void md_do_sync(struct md_thread *thread)
 >recovery));
}
 
-   if (test_bit(MD_RECOVERY_INTR, >recovery))
+   if (md_sync_interrupted(mddev))
break;
 
sectors = mddev->pers->sync_request(mddev, j, );
@@ -8362,7 +8370,7 @@ void 

Re: [PATCH V9 12/15] mmc: core: Export mmc_retune_hold_now() and mmc_retune_release()

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:37 PM, Adrian Hunter  wrote:

> Export mmc_retune_hold_now() and mmc_retune_release() so they can be used
> by the block driver.
>
> Signed-off-by: Adrian Hunter 

Actually you convert mmc_retune_hold_now() into a static inline, so the commit
message is a bit ambigous but who cares.
Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 10/15] mmc: core: Export mmc_start_bkops()

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Export mmc_start_bkops() so that the block driver can use it.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 11/15] mmc: core: Export mmc_start_request()

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:37 PM, Adrian Hunter  wrote:

> Export mmc_start_request() so that the block driver can use it.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 09/15] mmc: core: Add parameter use_blk_mq

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Until mmc has blk-mq support fully implemented and tested, add a
> parameter use_blk_mq, default to false unless config option MMC_MQ_DEFAULT
> is selected.
>
> Signed-off-by: Adrian Hunter 

> +config MMC_MQ_DEFAULT
> +   bool "MMC: use blk-mq I/O path by default"
> +   depends on MMC && BLOCK

I would say:
default y

Why not. SCSI is starting to enable this by default so IMO we should
not take the
intermediate step of having this as optional. Otherwise it never gets tested.

Set it to default y and after two kernel releases, if nothing happens, we simply
delete the old block layer path.

> +#ifdef CONFIG_MMC_MQ_DEFAULT
> +bool mmc_use_blk_mq = true;
> +#else
> +bool mmc_use_blk_mq = false;
> +#endif
> +module_param_named(use_blk_mq, mmc_use_blk_mq, bool, S_IWUSR | S_IRUGO);

Are people really modprobing this so it needs to be a module parameter?

Maybe I'm the only developer stupid enough to just recompile and reboot
the whole kernel, I guess this makes sense if you're testing on the same
machine you're developing on (no cross-compilation and remote target)
which I guess is what some Intel people are doing with their laptops.

Yours,
Linus Walleij


Re: [PATCH v4 0/7] Make suspend and resume safe for md-on-SCSI

2017-09-26 Thread Ming Lei
On Tue, Sep 26, 2017 at 08:17:59PM +, Bart Van Assche wrote:
> On Tue, 2017-09-26 at 22:54 +0800, Ming Lei wrote:
> > On Tue, Sep 26, 2017 at 02:29:06PM +, Bart Van Assche wrote:
> > > On Tue, 2017-09-26 at 17:15 +0800, Ming Lei wrote:
> > > > No, if you only address issue on MD device, that is definitely not
> > > > alternative of my patchset.
> > > 
> > > A clarification: my patch series not only fixes suspend and resume for 
> > > md-on-SCSI
> > > but also for all other scenario's in which resume locks up due to no tag 
> > > being
> > > available when the SCSI core tries to execute a power management 
> > > command.> 
> > I do care about if this patchset can fix non-MD cases, like
> > btrfs/raid, or transport_spi, both are real reports. Could you
> > make sure if your patchset can cover this non-MD devices?
> 
> Yes, I'm sure this patch series covers non-MD devices.
> 
> It is easy to see that this patch series also covers SCSI parallel domain
> validation.

Now I don't believe this patchset can do that, see my comment on your
patch 5:

https://www.mail-archive.com/linux-block@vger.kernel.org/msg13624.html

More worse, your patch introduces a new I/O hang inside SCSI quiesce.

> 
> Regarding BTRFS RAID: you should know that filesystems are responsible for
> implementing freeze / thaw support themselves. I think a filesystem must
> surround metadata operations by sb_start_write() / sb_end_write() and that
> it optionally can provide freeze_super / thaw_super callbacks in
> struct super_operations. In the BTRFS changelog there are several fixes
> with regard to freeze / thaw so I think this aspect of BTRFS has been tested.
> 
> Bart.

-- 
Ming


Re: [PATCH v4 5/7] scsi: Reduce suspend latency

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 01:29:22PM -0700, Bart Van Assche wrote:
> Avoid that it can take 200 ms too long to wait for requests to
> finish. Note: blk_mq_freeze_queue() uses a wait queue to wait
> for ongoing requests to finish.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Martin K. Petersen 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  drivers/scsi/scsi_lib.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 62f905b22821..c261498606c4 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -2927,6 +2927,7 @@ static void scsi_wait_for_queuecommand(struct 
> scsi_device *sdev)
>  int
>  scsi_device_quiesce(struct scsi_device *sdev)
>  {
> + struct request_queue *q = sdev->request_queue;
>   int err;
>  
>   mutex_lock(>state_mutex);
> @@ -2936,11 +2937,8 @@ scsi_device_quiesce(struct scsi_device *sdev)
>   if (err)
>   return err;
>  
> - scsi_run_queue(sdev->request_queue);
> - while (atomic_read(>device_busy)) {
> - msleep_interruptible(200);
> - scsi_run_queue(sdev->request_queue);
> - }
> + blk_mq_freeze_queue(q);
> + blk_mq_unfreeze_queue(q);
>   return 0;

I see the serious issue now with this patch, and it should have
been found much earlier, just because the setting quiesce isn't
shown in the patch. Sorry for finding it so late.

Once the patch is applied, scsi_device_quiesce() becomes
the following shape:

mutex_lock(>state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
if (err == 0)
blk_set_preempt_only(q, true);
mutex_unlock(>state_mutex);

if (err)
return err;

blk_mq_freeze_queue(q);
blk_mq_unfreeze_queue(q);

Any requests allocated before scsi_device_set_state() and
dispatched after scsi_device_set_state() can't be drained up
any more, then blk_mq_freeze_queue() will wait forever for the
drain up, so not only this patchset can't fix the current quiesce
issue, but also introduces new I/O hang in scsi_device_quiesce().

Now this approach looks broken fundamentally.

NAK.

-- 
Ming


Re: [PATCH V9 00/15] mmc: Add Command Queue support

2017-09-26 Thread Ulf Hansson
On 22 September 2017 at 14:36, Adrian Hunter  wrote:
> Hi
>
> Here is V9 of the hardware command queue patches without the software
> command queue patches, now using blk-mq and now with blk-mq support for
> non-CQE I/O.
>
> HW CMDQ offers 25% - 50% better random multi-threaded I/O.  I see a slight
> 2% drop in sequential read speed but no change to sequential write.
>
> Non-CQE blk-mq showed a 3% decrease in sequential read performance.  This
> seemed to be coming from the inferior latency of running work items compared
> with a dedicated thread.  Hacking blk-mq workqueue to be unbound reduced the
> performance degradation from 3% to 1%.
>
> While we should look at changing blk-mq to give better workqueue performance,
> a bigger gain is likely to be made by adding a new host API to enable the
> next already-prepared request to be issued directly from within ->done()
> callback of the current request.

I have looked at patch 1->8, and those looks nice to me so I have
applied those for next. I will do my best to review the rest asap,
however I am currently traveling so in worst case it will have wait
until next week.

Thanks and kind regards!
Uffe

>
>
> Changes since V8:
> Re-based
>   mmc: core: Introduce host claiming by context
> Slightly simplified as per Ulf
>   mmc: core: Export mmc_retune_hold_now() and mmc_retune_release()
> New patch.
>   mmc: block: Add CQE and blk-mq support
> Fix missing ->post_req() on the error path
>
> Changes since V7:
> Re-based
>   mmc: core: Introduce host claiming by context
> Slightly simplified
>   mmc: core: Add parameter use_blk_mq
> New patch.
>   mmc: core: Remove unnecessary host claim
> New patch.
>   mmc: core: Export mmc_start_bkops()
> New patch.
>   mmc: core: Export mmc_start_request()
> New patch.
>   mmc: block: Add CQE and blk-mq support
> Add blk-mq support for non_CQE requests
>
> Changes since V6:
>   mmc: core: Introduce host claiming by context
> New patch.
>   mmc: core: Move mmc_start_areq() declaration
> Dropped because it has been applied
>   mmc: block: Fix block status codes
> Dropped because it has been applied
>   mmc: host: Add CQE interface
> Dropped because it has been applied
>   mmc: core: Turn off CQE before sending commands
> Dropped because it has been applied
>   mmc: block: Factor out mmc_setup_queue()
> New patch.
>   mmc: block: Add CQE support
> Drop legacy support and add blk-mq support
>
> Changes since V5:
> Re-based
>   mmc: core: Add mmc_retune_hold_now()
> Dropped because it has been applied
>   mmc: core: Add members to mmc_request and mmc_data for CQE's
> Dropped because it has been applied
>   mmc: core: Move mmc_start_areq() declaration
> New patch at Ulf's request
>   mmc: block: Fix block status codes
> Another un-related patch
>   mmc: host: Add CQE interface
> Move recovery_notifier() callback to struct mmc_request
>   mmc: core: Add support for handling CQE requests
> Roll __mmc_cqe_request_done() into mmc_cqe_request_done()
> Move function declarations requested by Ulf
>   mmc: core: Remove unused MMC_CAP2_PACKED_CMD
> Dropped because it has been applied
>   mmc: block: Add CQE support
> Add explanation to commit message
> Adjustment for changed recovery_notifier() callback
>   mmc: cqhci: support for command queue enabled host
> Adjustment for changed recovery_notifier() callback
>   mmc: sdhci-pci: Add CQHCI support for Intel GLK
> Add DCMD capability for Intel controllers except GLK
>
> Changes since V4:
>   mmc: core: Add mmc_retune_hold_now()
> Add explanation to commit message.
>   mmc: host: Add CQE interface
> Add comments to callback declarations.
>   mmc: core: Turn off CQE before sending commands
> Add explanation to commit message.
>   mmc: core: Add support for handling CQE requests
> Add comments as requested by Ulf.
>   mmc: core: Remove unused MMC_CAP2_PACKED_CMD
> New patch.
>   mmc: mmc: Enable Command Queuing
> Adjust for removal of MMC_CAP2_PACKED_CMD.
> Add a comment about Packed Commands.
>   mmc: mmc: Enable CQE's
> Remove un-necessary check for MMC_CAP2_CQE
>   mmc: block: Use local variables in mmc_blk_data_prep()
> New patch.
>   mmc: block: Prepare CQE data
> Adjust due to "mmc: block: Use local variables in mmc_blk_data_prep()"
> Remove priority setting.
> Add explanation to commit message.
>   mmc: cqhci: support for command queue enabled host
> Fix transfer descriptor setting in cqhci_set_tran_desc() for 32-bit 
> DMA
>
> Changes since V3:
> Adjusted 

Re: [PATCH v4 0/7] Make suspend and resume safe for md-on-SCSI

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 22:54 +0800, Ming Lei wrote:
> On Tue, Sep 26, 2017 at 02:29:06PM +, Bart Van Assche wrote:
> > On Tue, 2017-09-26 at 17:15 +0800, Ming Lei wrote:
> > > No, if you only address issue on MD device, that is definitely not
> > > alternative of my patchset.
> > 
> > A clarification: my patch series not only fixes suspend and resume for 
> > md-on-SCSI
> > but also for all other scenario's in which resume locks up due to no tag 
> > being
> > available when the SCSI core tries to execute a power management command.> 
> I do care about if this patchset can fix non-MD cases, like
> btrfs/raid, or transport_spi, both are real reports. Could you
> make sure if your patchset can cover this non-MD devices?

Yes, I'm sure this patch series covers non-MD devices.

It is easy to see that this patch series also covers SCSI parallel domain
validation.

Regarding BTRFS RAID: you should know that filesystems are responsible for
implementing freeze / thaw support themselves. I think a filesystem must
surround metadata operations by sb_start_write() / sb_end_write() and that
it optionally can provide freeze_super / thaw_super callbacks in
struct super_operations. In the BTRFS changelog there are several fixes
with regard to freeze / thaw so I think this aspect of BTRFS has been tested.

Bart.

Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread Michael Lyle
Thanks everyone--

Yes, this looks good to me and works in testing.

Mike

On Tue, Sep 26, 2017 at 12:46 AM, Coly Li  wrote:
> On 2017/9/26 下午12:38, Michael Lyle wrote:
>> I believe this introduces a critical bug.
>>
>> cl->list is used to link together the llists for both things waiting,
>> and for things that are being woken.
>>
>> If a closure that is woken decides to wait again, it will corrupt the
>> llist that __closure_wake_up is using.
>>
>> The previous iteration structure gets the next element of the list
>> before waking and is therefore safe.
>
>
> Hi Michael and Byungchul,
>
> This is my fault to suggest Byungchul to change his correct patch into
> wrong one. But it's good to learn such an implicit race behind bcache
> code. I just post a patch to explain how this race may happen and
> corrupt the reverse list iteration. Could you please to review the fix ?
>
> And thanks to Michael again to catch this bug.
>
> --
> Coly Li


Re: [PATCH] block: fix a build error

2017-09-26 Thread Jens Axboe
On 09/26/2017 08:02 PM, Shaohua Li wrote:
> The code is only for blkcg not for all cgroups

Added, thanks.

-- 
Jens Axboe



[PATCH] block: fix a build error

2017-09-26 Thread Shaohua Li
The code is only for blkcg not for all cgroups

Reported-by: kbuild test robot 
Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c| 2 +-
 include/linux/kthread.h | 2 +-
 kernel/kthread.c| 8 
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 37c4da7..7269341 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1695,7 +1695,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
-#ifdef CONFIG_CGROUPS
+#ifdef CONFIG_BLK_CGROUP
if (cmd->use_aio && cmd->rq->bio && cmd->rq->bio->bi_css) {
cmd->css = cmd->rq->bio->bi_css;
css_get(cmd->css);
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index bd4369c..fb20184 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -199,7 +199,7 @@ bool kthread_cancel_delayed_work_sync(struct 
kthread_delayed_work *work);
 
 void kthread_destroy_worker(struct kthread_worker *worker);
 
-#ifdef CONFIG_CGROUPS
+#ifdef CONFIG_BLK_CGROUP
 void kthread_associate_blkcg(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *kthread_blkcg(void);
 #else
diff --git a/kernel/kthread.c b/kernel/kthread.c
index a8b4e83..ed95828 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -46,7 +46,7 @@ struct kthread {
void *data;
struct completion parked;
struct completion exited;
-#ifdef CONFIG_CGROUPS
+#ifdef CONFIG_BLK_CGROUP
struct cgroup_subsys_state *blkcg_css;
 #endif
 };
@@ -83,7 +83,7 @@ void free_kthread_struct(struct task_struct *k)
 * or if kmalloc() in kthread() failed.
 */
kthread = to_kthread(k);
-#ifdef CONFIG_CGROUPS
+#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread && kthread->blkcg_css);
 #endif
kfree(kthread);
@@ -224,7 +224,7 @@ static int kthread(void *_create)
self->data = data;
init_completion(>exited);
init_completion(>parked);
-#ifdef CONFIG_CGROUPS
+#ifdef CONFIG_BLK_CGROUP
self->blkcg_css = NULL;
 #endif
current->vfork_done = >exited;
@@ -1165,7 +1165,7 @@ void kthread_destroy_worker(struct kthread_worker *worker)
 }
 EXPORT_SYMBOL(kthread_destroy_worker);
 
-#ifdef CONFIG_CGROUPS
+#ifdef CONFIG_BLK_CGROUP
 /**
  * kthread_associate_blkcg - associate blkcg to current kthread
  * @css: the cgroup info
-- 
2.9.5



Re: [PATCH V9 08/15] mmc: block: Factor out mmc_setup_queue()

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Factor out some common code that will also be used with blk-mq.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 07/15] mmc: block: Prepare CQE data

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Enhance mmc_blk_data_prep() to support CQE requests. That means adding
> some things that for non-CQE requests would be encoded into the command
> arguments - such as the block address, reliable-write flag, and data tag
> flag. Also the request tag is needed to provide the command queue task id,
> and a comment is added to explain the future possibility of defining a
> priority.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 06/15] mmc: block: Use local variables in mmc_blk_data_prep()

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Use local variables in mmc_blk_data_prep() in preparation for adding CQE
> support which doesn't use the output variables.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 03/15] mmc: core: Add support for handling CQE requests

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Add core support for handling CQE requests, including starting, completing
> and recovering.
>
> Signed-off-by: Adrian Hunter 

Looks straight-forward to me.
Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 04/15] mmc: mmc: Enable Command Queuing

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Enable the Command Queue if the host controller supports a command queue
> engine. It is not compatible with Packed Commands, so make a note of that in 
> the
> comment.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH V9 05/15] mmc: mmc: Enable CQE's

2017-09-26 Thread Linus Walleij
On Fri, Sep 22, 2017 at 2:36 PM, Adrian Hunter  wrote:

> Enable or disable CQE when a card is added or removed respectively.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH v4 1/7] md: Make md resync and reshape threads freezable

2017-09-26 Thread Ming Lei
On Tue, Sep 26, 2017 at 02:40:36PM +, Bart Van Assche wrote:
> On Tue, 2017-09-26 at 16:13 +0800, Ming Lei wrote:
> > I am pretty sure that suspend/resume can survive when resync in progress
> > with my patchset applied on RAID1, without any MD change.
> 
> The above shows that you do not understand how suspend and resume works.
> In the documents in the directory Documentation/power it is explained clearly
> that I/O must be stopped before the hibernation image is generation to avoid

No, I don't use hibernation, and just use suspend/resume(s2r).

> hard to repair filesystem corruption. Although md is not a filesystem I think
> this also applies to md since md keeps some state information in RAM and some
> state information on disk. It is essential that all that state information is
> in consistent.
> 
> > If your patchset depends on this MD change, something should be wrong
> > in the following patches. Now I need to take a close look.
> 
> The later patches that make SCSI quiesce and resume safe do not depend on
> this change.

Are you sure?

If I remove the 1st patch, system suspend/resume will hang with all your
other 6 patchset.


-- 
Ming


Re: [PATCH v4 1/7] md: Make md resync and reshape threads freezable

2017-09-26 Thread Ming Lei
On Tue, Sep 26, 2017 at 02:42:07PM +, Bart Van Assche wrote:
> On Tue, 2017-09-26 at 19:17 +0800, Ming Lei wrote:
> > Just test this patch a bit and the following failure of freezing task
> > is triggered during suspend: [ ... ]
> 
> What kernel version did you start from and which patches were applied on top 
> of
> that kernel? Only patch 1/7 or all seven patches? What storage configuration 
> did

It is v4.14-rc1+, and top commit is 8d93c7a43157, with all your 7 patches
applied.

> you use in your test and what command(s) did you use to trigger suspend?

Follows my pm test script:

#!/bin/sh

echo check > /sys/block/md127/md/sync_action

mkfs.ext4 -F /dev/md127

mount /dev/md0 /mnt/data

dd if=/dev/zero of=/mnt/data/d1.img bs=4k count=128k&

echo 9 > /proc/sys/kernel/printk
echo devices > /sys/power/pm_test
echo mem > /sys/power/state

wait
umount /mnt/data

Storage setting:

sudo mdadm --create /dev/md/test /dev/sda /dev/sdb --level=1 
--raid-devices=2
both /dev/sda and /dev/sdb are virtio-scsi.


-- 
Ming


Re: [PATCH v4 7/7] block: Make SCSI device suspend and resume work reliably

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 16:32 +0800, Ming Lei wrote:
> On Mon, Sep 25, 2017 at 11:13:47PM +, Bart Van Assche wrote:
> > Sorry but I disagree. I'm using RCU to achieve the same effect as a barrier
> > and to move the cost of the barrier from the reader to the updater. See also
> > Paul E. McKenney, Mathieu Desnoyers, Lai Jiangshan, and Josh Triplett,
> > The RCU-barrier menagerie, LWN.net, November 12, 2013
> > (https://lwn.net/Articles/573497/).
> 
> Let me explain it in a bit details: [ ... ]

The approach I used in patch 7/7 is identical to one of the approaches explained
in the article I referred to. If you do not agree that that approach is safe 
that
means that you do not neither agree with Paul McKenney's view on how RCU works.
Do you really think that you know more about RCU than Paul McKenney?

Bart.

Re: [PATCH v4 1/7] md: Make md resync and reshape threads freezable

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 19:17 +0800, Ming Lei wrote:
> Just test this patch a bit and the following failure of freezing task
> is triggered during suspend: [ ... ]

What kernel version did you start from and which patches were applied on top of
that kernel? Only patch 1/7 or all seven patches? What storage configuration did
you use in your test and what command(s) did you use to trigger suspend?

Bart.

Re: [PATCH v4 1/7] md: Make md resync and reshape threads freezable

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 16:13 +0800, Ming Lei wrote:
> I am pretty sure that suspend/resume can survive when resync in progress
> with my patchset applied on RAID1, without any MD change.

The above shows that you do not understand how suspend and resume works.
In the documents in the directory Documentation/power it is explained clearly
that I/O must be stopped before the hibernation image is generation to avoid
hard to repair filesystem corruption. Although md is not a filesystem I think
this also applies to md since md keeps some state information in RAM and some
state information on disk. It is essential that all that state information is
in consistent.

> If your patchset depends on this MD change, something should be wrong
> in the following patches. Now I need to take a close look.

The later patches that make SCSI quiesce and resume safe do not depend on
this change.

Bart.

Re: [PATCH v4 0/7] Make suspend and resume safe for md-on-SCSI

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 17:15 +0800, Ming Lei wrote:
> No, if you only address issue on MD device, that is definitely not
> alternative of my patchset.

A clarification: my patch series not only fixes suspend and resume for 
md-on-SCSI
but also for all other scenario's in which resume locks up due to no tag being
available when the SCSI core tries to execute a power management command.

Bart.

Re: dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 16:50 +0800, Ming Lei wrote:
> If you don't consider to change mpath into block in .queue_rq() now,
> please take this patch first, and I am sure this way is correct, and
> even it can be thought as a fix.

As I explained earlier, it would be wrong to take this patch upstream without
having tested and measured the alternative I had proposed first
(https://www.spinics.net/lists/dm-devel/msg32282.html). My alternative handles
SCSI hosts with multiple LUNs properly but the patch at the start of this
e-mail thread not.

Bart.

Re: [PATCH] bcache: fix a comments typo in bch_alloc_sectors()

2017-09-26 Thread tang . junhui
From: 10074136 

> Code comments in alloc.c:bch_alloc_sectors() mentions a function
> name find_data_bucket(), the correct function name should be
> pick_data_bucket() indeed. bch_alloc_sectors() is a quite important
> function in bcache allocation code, fixing the typo may help
> other people to have less confusion.
> 
> Signed-off-by: Coly Li 
> ---
>  drivers/md/bcache/alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index cacbe2dbd5c3..071ff28be912 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -600,7 +600,7 @@ bool bch_alloc_sectors(struct cache_set *c, struct bkey 
> *k, unsigned sectors,
>  
>/*
> * If we had to allocate, we might race and not need to 
> allocate the
> -   * second time we call find_data_bucket(). If we allocated a 
> bucket but
> +   * second time we call pick_data_bucket(). If we allocated a 
> bucket but
> * didn't use it, drop the refcount bch_bucket_alloc_set() 
> took:
> */
>if (KEY_PTRS())

Yes, It's useful for code reading, Thanks.

Reviewed-by: Tang Junhui 


Re: [PATCH v4 1/7] md: Make md resync and reshape threads freezable

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 01:29:18PM -0700, Bart Van Assche wrote:
> Some people use the md driver on laptops and use the suspend and
> resume functionality. Since it is essential that submitting of
> new I/O requests stops before device quiescing starts, make the
> md resync and reshape threads freezable.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  drivers/md/md.c | 21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 08fcaebc61bd..26a12bd0db65 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -66,6 +66,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include "md.h"
> @@ -7424,6 +7425,7 @@ static int md_thread(void *arg)
>*/
>  
>   allow_signal(SIGKILL);
> + set_freezable();
>   while (!kthread_should_stop()) {
>  
>   /* We need to wait INTERRUPTIBLE so that
> @@ -7434,7 +7436,7 @@ static int md_thread(void *arg)
>   if (signal_pending(current))
>   flush_signals(current);
>  
> - wait_event_interruptible_timeout
> + wait_event_freezable_timeout
>   (thread->wqueue,
>test_bit(THREAD_WAKEUP, >flags)
>|| kthread_should_stop() || kthread_should_park(),
> @@ -8133,6 +8135,8 @@ void md_do_sync(struct md_thread *thread)
>   return;
>   }
>  
> + set_freezable();
> +
>   if (mddev_is_clustered(mddev)) {
>   ret = md_cluster_ops->resync_start(mddev);
>   if (ret)
> @@ -8324,7 +8328,7 @@ void md_do_sync(struct md_thread *thread)
>mddev->curr_resync_completed > mddev->resync_max
>   )) {
>   /* time to update curr_resync_completed */
> - wait_event(mddev->recovery_wait,
> + wait_event_freezable(mddev->recovery_wait,
>  atomic_read(>recovery_active) == 0);
>   mddev->curr_resync_completed = j;
>   if (test_bit(MD_RECOVERY_SYNC, >recovery) &&
> @@ -8342,10 +8346,10 @@ void md_do_sync(struct md_thread *thread)
>* to avoid triggering warnings.
>*/
>   flush_signals(current); /* just in case */
> - wait_event_interruptible(mddev->recovery_wait,
> -  mddev->resync_max > j
> -  || test_bit(MD_RECOVERY_INTR,
> -  >recovery));
> + wait_event_freezable(mddev->recovery_wait,
> +  mddev->resync_max > j ||
> +  test_bit(MD_RECOVERY_INTR,
> +   >recovery));
>   }
>  
>   if (test_bit(MD_RECOVERY_INTR, >recovery))
> @@ -8421,7 +8425,7 @@ void md_do_sync(struct md_thread *thread)
>* Give other IO more of a chance.
>* The faster the devices, the less we wait.
>*/
> - wait_event(mddev->recovery_wait,
> + wait_event_freezable(mddev->recovery_wait,
>  
> !atomic_read(>recovery_active));
>   }
>   }
> @@ -8433,7 +8437,8 @@ void md_do_sync(struct md_thread *thread)
>* this also signals 'finished resyncing' to md_stop
>*/
>   blk_finish_plug();
> - wait_event(mddev->recovery_wait, !atomic_read(>recovery_active));
> + wait_event_freezable(mddev->recovery_wait,
> +  !atomic_read(>recovery_active));
>  
>   if (!test_bit(MD_RECOVERY_RESHAPE, >recovery) &&
>   !test_bit(MD_RECOVERY_INTR, >recovery) &&
> -- 
> 2.14.1
> 

Just test this patch a bit and the following failure of freezing task
is triggered during suspend:

[   38.903513] PM: suspend entry (deep)
[   38.904443] PM: Syncing filesystems ... done.
[   38.983591] Freezing user space processes ... (elapsed 0.002 seconds) done.
[   38.987522] OOM killer disabled.
[   38.987962] Freezing remaining freezable tasks ...
[   58.998872] Freezing of tasks failed after 20.008 seconds (1 tasks refusing 
to freeze, wq_busy=0):
[   59.002539] md127_resyncD0  1618  2 0x8000
[   59.004954] Call Trace:
[   59.006162]  __schedule+0x41f/0xa50
[   59.007704]  schedule+0x3d/0x90
[   59.009305]  raid1_sync_request+0x2da/0xd10 [raid1]
[   59.011505]  ? 

[PATCH 0/1] bcache fix for 4.14-rc4

2017-09-26 Thread Coly Li
Hi Jens,

Michael Lyle catches a race bug in bcache 4.14 patch set. Here is the fix
for this issue. Could you please to pick it for 4.14-rc4 ? Then we don't need
to sta...@vger.kernel.org in next cycle. 

Thanks in advance.

Coly Li



[PATCH 1/1] bcache: use llist_for_each_entry_safe() in __closure_wake_up()

2017-09-26 Thread Coly Li
Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
API") replaces the following while loop by llist_for_each_entry(),

-
-   while (reverse) {
-   cl = container_of(reverse, struct closure, list);
-   reverse = llist_next(reverse);
-
+   llist_for_each_entry(cl, reverse, list) {
closure_set_waiting(cl, 0);
closure_sub(cl, CLOSURE_WAITING + 1);
}

This modification introduces a potential race by iterating a corrupted
list. Here is how it happens.

In the above modification, closure_sub() may wake up a process which is
waiting on reverse list. If this process decides to wait again by calling
closure_wait(), its cl->list will be added to another wait list. Then
when llist_for_each_entry() continues to iterate next node, it will travel
on another new wait list which is added in closure_wait(), not the
original reverse list in __closure_wake_up(). It is more probably to
happen on UP machine because the waked up process may preempt the process
which wakes up it.

Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
next node before waking up a process. Then the copy of next node will make
sure list iteration stays on original reverse list.

Signed-off-by: Coly Li 
Reported-by: Michael Lyle 
Reviewed-by: Byungchul Park 
---
 drivers/md/bcache/closure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 7d5286b05036..1841d0359bac 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -64,7 +64,7 @@ EXPORT_SYMBOL(closure_put);
 void __closure_wake_up(struct closure_waitlist *wait_list)
 {
struct llist_node *list;
-   struct closure *cl;
+   struct closure *cl, *t;
struct llist_node *reverse = NULL;
 
list = llist_del_all(_list->list);
@@ -73,7 +73,7 @@ void __closure_wake_up(struct closure_waitlist *wait_list)
reverse = llist_reverse_order(list);
 
/* Then do the wakeups */
-   llist_for_each_entry(cl, reverse, list) {
+   llist_for_each_entry_safe(cl, t, reverse, list) {
closure_set_waiting(cl, 0);
closure_sub(cl, CLOSURE_WAITING + 1);
}
-- 
2.13.5



Re: [PATCH v4 0/7] Make suspend and resume safe for md-on-SCSI

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 01:29:17PM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> It is known that during the resume following a hibernate, especially when
> using an md RAID1 array created on top of SCSI devices, sometimes the
> system hangs instead of coming up properly. This patch series fixes this
> problem. This patch series is an alternative for Ming Lei's "[PATCH V5
> 0/10] block/scsi: safe SCSI quiescing" patch series. The advantages of
> this patch series are:

No, if you only address issue on MD device, that is definitely not
alternative of my patchset.

Guys have verified that my patches can fix either MD and
non-MD(btrfs/raid, transport_spi) case. And my patchset can fix this
kind of issue on any SCSI device, w/w.o MD.

-- 
Ming


Re: [PATCH v3 0/6] Make SCSI device suspend and resume work reliably

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 04:17:27PM +, Bart Van Assche wrote:
> On Mon, 2017-09-25 at 10:36 +0800, Ming Lei wrote:
> > On Sat, Sep 23, 2017 at 6:13 AM, Bart Van Assche  
> > wrote:
> > > It is known that during the resume following a hibernate sometimes the
> > > system hangs instead of coming up properly. This patch series fixes this
> > > problem. This patch series is an alternative for Ming Lei's "[PATCH V5
> > > 0/10] block/scsi: safe SCSI quiescing" patch series. The advantages of
> > > this patch series are:
> > 
> > No, your patch doesn't fix scsi quiesce on block legacy, so not an 
> > alternative
> > of my patchset at all.
> 
> This patch series definitely is an alternative for blk-mq/scsi-mq. And as you
> know my approach can be extended easily to the legacy SCSI core by adding
> blk_queue_enter() / blk_queue_exit() calls where necessary in the legacy block
> layer. I have not done this because the bug report was against scsi-mq and not
> against the legacy SCSI core. Additionally, since the legacy block layer and
> SCSI core are on their way out I did not want to spend time on modifying 
> these.

Let me show you the legacy report and verification:

https://www.spinics.net/lists/linux-block/msg17237.html

If you have transport_spi device at hand, the issue can be reproduced
in several minutes by the following way:

- set nr_request of this disk as 4
- while true; do
trigger revalidate once in 5 seconds
meantime run heavy/background concurrent I/O

-- 
Ming


Re: dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 12:17:03PM -0400, Mike Snitzer wrote:
> On Mon, Sep 25 2017 at 12:10pm -0400,
> Ming Lei  wrote:
> 
> > On Mon, Sep 25, 2017 at 03:23:16PM +, Bart Van Assche wrote:
> > > On Mon, 2017-09-25 at 11:06 +0800, Ming Lei wrote:
> > > > On Fri, Sep 22, 2017 at 05:54:48PM +, Bart Van Assche wrote:
> > > > > On Sat, 2017-09-23 at 01:44 +0800, Ming Lei wrote:
> > > > > > On Fri, Sep 22, 2017 at 03:06:16PM +, Bart Van Assche wrote:
> > > > > > > On Fri, 2017-09-22 at 09:35 +0800, Ming Lei wrote:
> > > > > > > > +   /*
> > > > > > > > +* blk-mq's SCHED_RESTART can cover this 
> > > > > > > > requeue, so
> > > > > > > > +* we needn't to deal with it by DELAY_REQUEUE. 
> > > > > > > > More
> > > > > > > > +* importantly, we have to return 
> > > > > > > > DM_MAPIO_REQUEUE
> > > > > > > > +* so that blk-mq can get the queue busy 
> > > > > > > > feedback,
> > > > > > > > +* otherwise I/O merge can be hurt.
> > > > > > > > +*/
> > > > > > > > +   if (q->mq_ops)
> > > > > > > > +   return DM_MAPIO_REQUEUE;
> > > > > > > > +   else
> > > > > > > > +   return DM_MAPIO_DELAY_REQUEUE;
> > > > > > > > }
> > > > > > > 
> > > > > > > This patch is inferior to what I posted because this patch does 
> > > > > > > not avoid
> > > > > > > the delay if multiple LUNs are associated with the same SCSI 
> > > > > > > host. Consider
> > > > > > > e.g. the following configuration:
> > > > > > > * A single SCSI host with two SCSI LUNs associated to that host, 
> > > > > > > e.g. /dev/sda
> > > > > > >   and /dev/sdb.
> > > > > > > * A dm-mpath instance has been created on top of /dev/sda.
> > > > > > > If all tags are in use by requests queued to /dev/sdb, no dm 
> > > > > > > requests are in
> > > > > > > progress and a request is submitted against the dm-mpath device 
> > > > > > > then the
> > > > > > > blk_get_request(q, GFP_ATOMIC) call will fail. The request will 
> > > > > > > be requeued
> > > > > > > and the queue will be rerun after a delay.
> > > > > > > 
> > > > > > > My patch does not introduce a delay in this case.
> > > > > > 
> > > > > > That delay may not matter because SCHED_RESTART will run queue just
> > > > > > after one request is completed.
> > > > > 
> > > > > Did you understand what I wrote? SCHED_RESTART will be set for 
> > > > > /dev/sdb but not
> > > > > for the dm queue. That's what I was trying to explain to you in my 
> > > > > previous e-mail.
> > > > 
> > > > The patch I posted in this thread will set SCHED_RESTART for dm queue.
> > > 
> > > This is not how communication on an open source mailing list is assumed 
> > > to work.
> > > If you know that you are wrong you are assumed either to shut up or to 
> > > admit it.
> 
> Code speaks much better than unnecessarily caustic exchanges.  Not sure
> why Bart persists with that.. but that's for him to sort out.
>  
> > You just mentioned 'This patch is inferior' and never explained my patch
> > is wrong, so please go ahead and show me why this patch(the post in this
> > thread, also the following link) is wrong.
> > 
> > https://marc.info/?l=linux-block=150604412910113=2
> > 
> > I admit both are two ways for the issue, but I don't think my patch
> > is wrong. Your approach can be a very big change because .queue_rq()
> > will block, and I also mentioned it might cause AIO regression.
> 
> I have no interest in changing DM multipath to block in .queue_rq()
> So please consider that approach a dead end.
> 
> Ming, just iterate on your revised patchset, test and post when you're
> happy with it.

Hi Mike,

If you don't consider to change mpath into block in .queue_rq() now,
please take this patch first, and I am sure this way is correct, and
even it can be thought as a fix.


Thanks,
Ming


Re: false positive lockdep splat with loop device

2017-09-26 Thread Amir Goldstein
On Tue, Sep 26, 2017 at 7:24 AM, Dave Chinner  wrote:
> On Thu, Sep 21, 2017 at 09:43:41AM +0300, Amir Goldstein wrote:
>> On Thu, Sep 21, 2017 at 1:22 AM, Dave Chinner  wrote:
>> > [cc lkml, PeterZ and Byungchul]
>> ...
>> > The thing is, this IO completion has nothing to do with the lower
>> > filesystem - it's the IO completion for the filesystem on the loop
>> > device (the upper filesystem) and is not in any way related to the
>> > IO completion from the dax device the lower filesystem is waiting
>> > on.
>> >
>> > IOWs, this is a false positive.
>> >
>> > Peter, this is the sort of false positive I mentioned were likely to
>> > occur without some serious work to annotate the IO stack to prevent
>> > them.  We can nest multiple layers of IO completions and locking in
>> > the IO stack via things like loop and RAID devices.  They can be
>> > nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
>> > dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
>> > checking is going to be a source of false positives until there is
>> > an effective (and simple!) way of providing context based completion
>> > annotations to avoid them...
>> >
>>
>> IMO, the way to handle this is to add 'nesting_depth' information
>> on blockdev (or bdi?). 'nesting' in the sense of blockdev->fs->blockdev->fs.
>> AFAIK, the only blockdev drivers that need to bump nesting_depth
>> are loop and nbd??
>
> You're assumming that this sort of "completion inversion" can only
> happen with bdev->fs->bdev, and that submit_bio_wait() is the only
> place where completions are used in stackable block devices.
>
> AFAICT, this could happen on with any block device that can be
> stacked multiple times that uses completions. e.g. MD has a function
> sync_page_io() that calls submit_bio_wait(), and that is called from
> places in the raid 5, raid 10, raid 1 and bitmap layers (plus others
> in DM). These can get stacked anywhere - even on top of loop devices
> - and so I think the issue has a much wider scope than just loop and
> nbd devices.

True. We can probably duplicate the struct file_system_type pattern,
something like:

struct block_device_type = loop_blkdev_type = {
.owner  = THIS_MODULE,
.name   = "loop",
.devt = MKDEV(LOOP_MAJOR, 0),
.probe   = loop_probe,
.lock  = NULL,
};

...
blk_register_region(_blkdev_type, range, NULL);

And then we have a static place holder for lock_class_key address
to be used when annotating bio completions.
I realize same block device types can be nested via raid/dm/loop,
but as we can see in the splat so do same file system types via loop/nbd,
so we can think of similar solution to both cases.

>
>> Not sure if the kernel should limit loop blockdev nesting depth??
>
> There's no way we should do that just because new lockdep
> functionality is unable to express such constructs.
>

Of course not. I was just wondering if there should be a limit to
blockdev nesting regardless (e.g. too much queuing in the io stack).

But even if there is no limit to blockdev nesting, we can define
CONFIG_LOCKDEP_MAX_NESTING and:

struct lock_class_nested_key {
struct lock_class_key keys[CONFIG_LOCKDEP_MAX_NESTING];
};

Then struct file_system_type and struct block_device_type keys
could be of type struct lock_class_nested_key.

nested_depth should be carried in struct gendisk and the it is the
responsibility of  the blockdev driver to bump nested_depth if it
is a nested blockdev.

Callers of lockdep_set_class() ,like lockdep_annotate_inode_mutex_key()
should use a variant lockdep_set_class_nested(lock, ket, nested_depth)
if appropriate.

For any depth greater than CONFIG_LOCKDEP_MAX_NESTING,
lockdep_set_class_nested() will overload the last key in the
array, so we don't solve false positives for infinite nesting depth,
but we solve them for a defined max nesting depth.

Alas, even if this is workable, it will not be anytime soon that I can
backup this scribble with tested patches.

Amir.


Re: [PATCH v4 2/7] block: Make q_usage_counter also track legacy requests

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 01:29:19PM -0700, Bart Van Assche wrote:
> From: Ming Lei 
> 
> This patch makes it possible to pause request allocation for
> the legacy block layer by calling blk_mq_freeze_queue() and
> blk_mq_unfreeze_queue().
> 
> Signed-off-by: Ming Lei 
> [ bvanassche: Combined two patches into one, edited a comment and made sure
>   REQ_NOWAIT is handled properly in blk_old_get_request() ]
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---

Thanks for considering block legacy path in this version!

-- 
Ming


Re: [PATCH v4 7/7] block: Make SCSI device suspend and resume work reliably

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 11:13:47PM +, Bart Van Assche wrote:
> On Tue, 2017-09-26 at 06:59 +0800, Ming Lei wrote:
> > On Mon, Sep 25, 2017 at 01:29:24PM -0700, Bart Van Assche wrote:
> > > +int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
> > >  {
> > >   while (true) {
> > >   int ret;
> > >  
> > > - if (percpu_ref_tryget_live(>q_usage_counter))
> > > - return 0;
> > > + if (percpu_ref_tryget_live(>q_usage_counter)) {
> > > + /*
> > > +  * Since setting the PREEMPT_ONLY flag is followed
> > > +  * by a switch of q_usage_counter from per-cpu to
> > > +  * atomic mode and back to per-cpu and since the
> > > +  * switch to atomic mode uses call_rcu_sched(), it
> > > +  * is not necessary to call smp_rmb() here.
> > > +  */
> > 
> > rcu_read_lock is held only inside percpu_ref_tryget_live().
> > 
> > Without one explicit barrier(smp_mb) between getting the refcounter
> > and reading the preempt only flag, the two operations(writing to
> > refcounter and reading the flag) can be reordered, so
> > unfreeze/unfreeze may be completed before this IO is completed.
> 
> Sorry but I disagree. I'm using RCU to achieve the same effect as a barrier
> and to move the cost of the barrier from the reader to the updater. See also
> Paul E. McKenney, Mathieu Desnoyers, Lai Jiangshan, and Josh Triplett,
> The RCU-barrier menagerie, LWN.net, November 12, 2013
> (https://lwn.net/Articles/573497/).

Let me explain it in a bit details:

1) in SCSI quiesce path:

We can think there is one synchronize_rcu() in freeze/unfreeze.

Let's see the RCU document:

Documentation/RCU/whatisRCU.txt:

void synchronize_rcu(void);

Marks the end of updater code and the beginning of reclaimer
code.  It does this by blocking until all pre-existing RCU
read-side critical sections on all CPUs have completed.
Note that synchronize_rcu() will -not- necessarily wait for
any subsequent RCU read-side critical sections to complete.
For example, consider the following sequence of events:

So synchronize_rcu() in SCSI quiesce path just waits for completion
of pre-existing read-side critical section, and subsequent RCU
read-side critical sections won't be waited.

2) in normal I/O path of blk_enter_queue()

- only rcu read lock is held in percpu_ref_tryget_live(), and the lock
is released when this helper returns.

- there isn't explicit barrier(smp_mb()) between percpu_ref_tryget_live()
and checking flag of preempt only, so writing to percpu_ref counter
and reading the preempt flag can be reordered as the following:

-- check flag of preempt only 

current process preempt out now, and just at the exact
time, SCSI quiesce is run from another process, then
freeze/unfreeze is completed because no pre-exit read-side
critical sections, and the percpu_ref isn't held too.

finally this process is preeempt in, and try to grab one ref
and submit I/O after SCSI is quiesced(which shouldn't be
allowed)

-- percpu_ref_tryget_live()


-- 
Ming


RE: [PATCH] bcache: use llist_for_each_entry_safe() in __closure_wake_up()

2017-09-26 Thread
> -Original Message-
> From: Coly Li [mailto:i...@coly.li]
> Sent: Tuesday, September 26, 2017 5:00 PM
> To: Byungchul Park
> Subject: Fwd: [PATCH] bcache: use llist_for_each_entry_safe() in
> __closure_wake_up()
> 
> Hi Byungchul,
> 
> I posted the fix on linux-bcache mailing list, could you please to
> review my fix ?
> 
> Sorry for the confusion and thanks for your review in advance.

All right :)

Reviewed-by: Byungchul Park 

> 
> Coly Li
> 
> 
>  Forwarded Message 
> Return-path: 
> Envelope-to: i...@coly.li
> Delivery-date: Tue, 26 Sep 2017 07:45:21 +
> Received: from vger.kernel.org ([209.132.180.67]:56115) by
> server.coly.li with esmtp (Exim 4.87) (envelope-from
> ) id 1dwkYT-0002cE-Mh for i...@coly.li;
> Tue, 26 Sep 2017 07:45:21 +
> Received: (majord...@vger.kernel.org) by vger.kernel.org via listexpand
>   id S936950AbdIZHhM (ORCPT );Tue, 26 Sep
> 2017 03:37:12 -0400
> Received: from mx2.suse.de ([195.135.220.15]:36732 "EHLO mx1.suse.de"
> rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTPid
> S936949AbdIZHhL (ORCPT );
> Tue, 26 Sep 2017 03:37:11 -0400
> X-Virus-Scanned: by amavisd-new at test-mx.suse.de
> Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254])
>   by mx1.suse.de (Postfix) with ESMTP id BB460AE5F;Tue, 26
> Sep 2017 07:37:09 + (UTC)
> From: Coly Li 
> To: linux-bca...@vger.kernel.org, linux-block@vger.kernel.org
> Cc: Coly Li 
> Subject: [PATCH] bcache: use llist_for_each_entry_safe() in
> __closure_wake_up()
> Date: Tue, 26 Sep 2017 15:37:02 +0800
> Message-Id: <20170926073702.71606-1-col...@suse.de>
> X-Mailer: git-send-email 2.13.5
> Sender: linux-bcache-ow...@vger.kernel.org
> Precedence: bulk
> List-ID: 
> X-Mailing-List: linux-bca...@vger.kernel.org
> 
> Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
> API") replaces the following while loop by llist_for_each_entry(),
> 
> -
> - while (reverse) {
> - cl = container_of(reverse, struct closure, list);
> - reverse = llist_next(reverse);
> -
> + llist_for_each_entry(cl, reverse, list) {
>   closure_set_waiting(cl, 0);
>   closure_sub(cl, CLOSURE_WAITING + 1);
>   }
> 
> This modification introduces a potential race by iterating a corrupted
> list. Here is how it happens.
> 
> In the above modification, closure_sub() may wake up a process which is
> waiting on reverse list. If this process decides to wait again by calling
> closure_wait(), its cl->list will be added to another wait list. Then
> when llist_for_each_entry() continues to iterate next node, it will travel
> on another new wait list which is added in closure_wait(), not the
> original reverse list in __closure_wake_up(). It is more probably to
> happen on UP machine because the waked up process may preempt the process
> which wakes up it.
> 
> Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
> next node before waking up a process. Then the copy of next node will make
> sure list iteration stays on original reverse list.
> 
> Signed-off-by: Coly Li 
> Reported-by: Michael Lyle 
> ---
>  drivers/md/bcache/closure.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
> index 7d5286b05036..1841d0359bac 100644
> --- a/drivers/md/bcache/closure.c
> +++ b/drivers/md/bcache/closure.c
> @@ -64,7 +64,7 @@ EXPORT_SYMBOL(closure_put);
>  void __closure_wake_up(struct closure_waitlist *wait_list)
>  {
>   struct llist_node *list;
> - struct closure *cl;
> + struct closure *cl, *t;
>   struct llist_node *reverse = NULL;
>   list = llist_del_all(_list->list);
> @@ -73,7 +73,7 @@ void __closure_wake_up(struct closure_waitlist *wait_list)
>   reverse = llist_reverse_order(list);
>   /* Then do the wakeups */
> - llist_for_each_entry(cl, reverse, list) {
> + llist_for_each_entry_safe(cl, t, reverse, list) {
>   closure_set_waiting(cl, 0);
>   closure_sub(cl, CLOSURE_WAITING + 1);
>   }
> --
> 2.13.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread Coly Li
On 2017/9/26 下午12:38, Michael Lyle wrote:
> I believe this introduces a critical bug.
> 
> cl->list is used to link together the llists for both things waiting,
> and for things that are being woken.
> 
> If a closure that is woken decides to wait again, it will corrupt the
> llist that __closure_wake_up is using.
> 
> The previous iteration structure gets the next element of the list
> before waking and is therefore safe.


Hi Michael and Byungchul,

This is my fault to suggest Byungchul to change his correct patch into
wrong one. But it's good to learn such an implicit race behind bcache
code. I just post a patch to explain how this race may happen and
corrupt the reverse list iteration. Could you please to review the fix ?

And thanks to Michael again to catch this bug.

-- 
Coly Li


Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread Coly Li
On 2017/9/26 下午3:16, 박병철/선임연구원/SW
Platform(연)AOT팀(byungchul.p...@lge.com) wrote:
>> -Original Message-
>> From: Coly Li [mailto:i...@coly.li]
>> Sent: Tuesday, September 26, 2017 4:09 PM
>> To: Michael Lyle; Coly Li
>> Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
>> ax...@kernel.dk; Eric Wheeler; Byungchul Park; Kent Overstreet
>> Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing
>> llist API
>>
>> On 2017/9/26 下午12:38, Michael Lyle wrote:
>>> I believe this introduces a critical bug.
>>>
>>> cl->list is used to link together the llists for both things waiting,
>>> and for things that are being woken.
>>>
>>> If a closure that is woken decides to wait again, it will corrupt the
>>> llist that __closure_wake_up is using.
>>>
>>> The previous iteration structure gets the next element of the list
>>> before waking and is therefore safe.
>>>
>>
>> Hi Mike,
>>
>> Good catch! I see llist_del_all() but forget cl->list can be modified in
>> closure_wait(). Yes there is potential chance to mislead
>> llist_for_each_entry() to iterate wrong list.
>> llist_for_each_entry_safe() should be used here. I will send a fix to
>> Jens, hope to catch up 4.14 still.
> 
> I see. You have a plan to do it. Please fix it.

Oh, I just see this email, when I replied your last one.
Sure, let me fix my fault.

-- 
Coly Li


Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread Coly Li
On 2017/9/26 下午3:15, 박병철/선임연구원/SW
Platform(연)AOT팀(byungchul.p...@lge.com) wrote:
>> -Original Message-
>> From: Coly Li [mailto:i...@coly.li]
>> Sent: Tuesday, September 26, 2017 4:09 PM
>> To: 박병철/선임연구원/SW Platform(연)AOT팀(byungchul.p...@lge.com);
>> Michael Lyle; Coly Li
>> Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
>> ax...@kernel.dk; Eric Wheeler; Kent Overstreet; kernel-t...@lge.com
>> Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing
>> llist API
>>
>> On 2017/9/26 下午2:39, 박병철/선임연구원/SW
>> Platform(연)AOT팀(byungchul.p...@lge.com) wrote:
 -Original Message-
 From: Michael Lyle [mailto:ml...@lyle.org]
 Sent: Tuesday, September 26, 2017 1:38 PM
 To: Coly Li
 Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
 ax...@kernel.dk; Eric Wheeler; Byungchul Park; Kent Overstreet
 Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use 
 existing
 llist API

 I believe this introduces a critical bug.

 cl->list is used to link together the llists for both things waiting,
 and for things that are being woken.

 If a closure that is woken decides to wait again, it will corrupt the
 llist that __closure_wake_up is using.

 The previous iteration structure gets the next element of the list
 before waking and is therefore safe.
>>>
>>> Do you mean we have to use llist_for_each_entry_safe() instead of non-safe
>> version?
>>> Is it ok if we use it instead?
>>
>> Yes, we should use llist_for_each_entry_safe(), there is a quite
>> implicit race here.
> 
> Hi coly,
> 
> As you know, my first patch used the safe version, but you suggested to 
> replace
> It with non-safe one. :(

No doubt, it's my fault. I didn't find the implicit potential race.

> I will change it so it does the same as the original patch did. :)

Yes, please. Because the original patch is merged into Linux upstream,
post a fix to replace llist_for_each_entry() by
llist_for_each_entry_safe(). We still have chance to catch up 4.14.

Thank you !

-- 
Coly Li


RE: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread
> -Original Message-
> From: Coly Li [mailto:i...@coly.li]
> Sent: Tuesday, September 26, 2017 4:09 PM
> To: Michael Lyle; Coly Li
> Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
> ax...@kernel.dk; Eric Wheeler; Byungchul Park; Kent Overstreet
> Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing
> llist API
> 
> On 2017/9/26 下午12:38, Michael Lyle wrote:
> > I believe this introduces a critical bug.
> >
> > cl->list is used to link together the llists for both things waiting,
> > and for things that are being woken.
> >
> > If a closure that is woken decides to wait again, it will corrupt the
> > llist that __closure_wake_up is using.
> >
> > The previous iteration structure gets the next element of the list
> > before waking and is therefore safe.
> >
> 
> Hi Mike,
> 
> Good catch! I see llist_del_all() but forget cl->list can be modified in
> closure_wait(). Yes there is potential chance to mislead
> llist_for_each_entry() to iterate wrong list.
> llist_for_each_entry_safe() should be used here. I will send a fix to
> Jens, hope to catch up 4.14 still.

I see. You have a plan to do it. Please fix it.

Thank you.

> Thanks!
> --
> Coly Li


RE: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread
> -Original Message-
> From: Coly Li [mailto:i...@coly.li]
> Sent: Tuesday, September 26, 2017 4:09 PM
> To: 박병철/선임연구원/SW Platform(연)AOT팀(byungchul.p...@lge.com);
> Michael Lyle; Coly Li
> Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
> ax...@kernel.dk; Eric Wheeler; Kent Overstreet; kernel-t...@lge.com
> Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing
> llist API
> 
> On 2017/9/26 下午2:39, 박병철/선임연구원/SW
> Platform(연)AOT팀(byungchul.p...@lge.com) wrote:
> >> -Original Message-
> >> From: Michael Lyle [mailto:ml...@lyle.org]
> >> Sent: Tuesday, September 26, 2017 1:38 PM
> >> To: Coly Li
> >> Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
> >> ax...@kernel.dk; Eric Wheeler; Byungchul Park; Kent Overstreet
> >> Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use 
> >> existing
> >> llist API
> >>
> >> I believe this introduces a critical bug.
> >>
> >> cl->list is used to link together the llists for both things waiting,
> >> and for things that are being woken.
> >>
> >> If a closure that is woken decides to wait again, it will corrupt the
> >> llist that __closure_wake_up is using.
> >>
> >> The previous iteration structure gets the next element of the list
> >> before waking and is therefore safe.
> >
> > Do you mean we have to use llist_for_each_entry_safe() instead of non-safe
> version?
> > Is it ok if we use it instead?
> 
> Yes, we should use llist_for_each_entry_safe(), there is a quite
> implicit race here.

Hi coly,

As you know, my first patch used the safe version, but you suggested to replace
It with non-safe one. :(

I will change it so it does the same as the original patch did. :)

Thank you very much,
Byungchul

> --
> Coly Li


Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread Coly Li
On 2017/9/26 下午2:39, 박병철/선임연구원/SW
Platform(연)AOT팀(byungchul.p...@lge.com) wrote:
>> -Original Message-
>> From: Michael Lyle [mailto:ml...@lyle.org]
>> Sent: Tuesday, September 26, 2017 1:38 PM
>> To: Coly Li
>> Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
>> ax...@kernel.dk; Eric Wheeler; Byungchul Park; Kent Overstreet
>> Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing
>> llist API
>>
>> I believe this introduces a critical bug.
>>
>> cl->list is used to link together the llists for both things waiting,
>> and for things that are being woken.
>>
>> If a closure that is woken decides to wait again, it will corrupt the
>> llist that __closure_wake_up is using.
>>
>> The previous iteration structure gets the next element of the list
>> before waking and is therefore safe.
> 
> Do you mean we have to use llist_for_each_entry_safe() instead of non-safe 
> version?
> Is it ok if we use it instead?

Yes, we should use llist_for_each_entry_safe(), there is a quite
implicit race here.

-- 
Coly Li


Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread Coly Li
On 2017/9/26 下午12:38, Michael Lyle wrote:
> I believe this introduces a critical bug.
> 
> cl->list is used to link together the llists for both things waiting,
> and for things that are being woken.
> 
> If a closure that is woken decides to wait again, it will corrupt the
> llist that __closure_wake_up is using.
> 
> The previous iteration structure gets the next element of the list
> before waking and is therefore safe.
>

Hi Mike,

Good catch! I see llist_del_all() but forget cl->list can be modified in
closure_wait(). Yes there is potential chance to mislead
llist_for_each_entry() to iterate wrong list.
llist_for_each_entry_safe() should be used here. I will send a fix to
Jens, hope to catch up 4.14 still.

Thanks!
-- 
Coly Li


RE: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-26 Thread
> -Original Message-
> From: Michael Lyle [mailto:ml...@lyle.org]
> Sent: Tuesday, September 26, 2017 1:38 PM
> To: Coly Li
> Cc: linux-bca...@vger.kernel.org; linux-block@vger.kernel.org;
> ax...@kernel.dk; Eric Wheeler; Byungchul Park; Kent Overstreet
> Subject: Re: [PATCH 04/12] bcache: Don't reinvent the wheel but use existing
> llist API
> 
> I believe this introduces a critical bug.
> 
> cl->list is used to link together the llists for both things waiting,
> and for things that are being woken.
> 
> If a closure that is woken decides to wait again, it will corrupt the
> llist that __closure_wake_up is using.
> 
> The previous iteration structure gets the next element of the list
> before waking and is therefore safe.

Do you mean we have to use llist_for_each_entry_safe() instead of non-safe 
version?
Is it ok if we use it instead?

> Mike


Re: [PATCH v4 1/7] md: Make md resync and reshape threads freezable

2017-09-26 Thread Hannes Reinecke
On 09/25/2017 10:29 PM, Bart Van Assche wrote:
> Some people use the md driver on laptops and use the suspend and
> resume functionality. Since it is essential that submitting of
> new I/O requests stops before device quiescing starts, make the
> md resync and reshape threads freezable.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  drivers/md/md.c | 21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v4 2/7] block: Make q_usage_counter also track legacy requests

2017-09-26 Thread Hannes Reinecke
On 09/25/2017 10:29 PM, Bart Van Assche wrote:
> From: Ming Lei 
> 
> This patch makes it possible to pause request allocation for
> the legacy block layer by calling blk_mq_freeze_queue() and
> blk_mq_unfreeze_queue().
> 
> Signed-off-by: Ming Lei 
> [ bvanassche: Combined two patches into one, edited a comment and made sure
>   REQ_NOWAIT is handled properly in blk_old_get_request() ]
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-core.c | 12 
>  block/blk-mq.c   | 10 ++
>  2 files changed, 14 insertions(+), 8 deletions(-)
> 
I actually had a customer reoport hitting a q_usage_counter underflow,
so this is a real bugfix. And I'd advocate to have it pushed
independently on the overall patchset, if it should be delayed even further.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)