[PATCH RFC] Block/blk-wbt: do not let background writes block sync writes

2017-08-25 Thread Liu Bo
While using blk-wbt, sometimes sync writes are blocked by background
writes, for example,

a) a background write reaches the (background) limit returned by
get_limit(), so it's added into the rqw->wait (non kswapd's rqw in
this case) and goes to sleep.

b) then a sync write gets queued and goes to sleep when finding that
waitqueue_active() returns true and someone else is already on the
waiting list.

Thus, the sync write will get its rq after the background write
getting rq.

With this, only background writes will check waitqueue's status and
sync writes will only be throttled by the (max) limit returned by
get_limit().

Signed-off-by: Liu Bo 
---

- Besides the above problem, it seems waitqueue_active() also requires
  a smp_mb().

 block/blk-wbt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 6a9a0f0..698d9f7 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -519,7 +519,8 @@ static inline bool may_queue(struct rq_wb *rwb, struct 
rq_wait *rqw,
 * If the waitqueue is already active and we are not the next
 * in line to be woken up, wait for our turn.
 */
-   if (waitqueue_active(>wait) &&
+   if ((rw & REQ_BACKGROUND) &&
+   waitqueue_active(>wait) &&
rqw->wait.head.next != >entry)
return false;
 
-- 
2.9.4



kernel panic in generic_make_request() in block/blk-core.c

2017-08-25 Thread sathyanarayanan kuppuswamy

Hi All,

I am trying to use the latest (4.13-rc6) kernel in my android device 
(Intel APL SOC, running Android O). But sometimes, during the boot 
process, when MMC partition is getting mounted, I hit the following 
kernel panic. Its not 100% reproducible. But I hit it twice in 10 
cycles. Copied the dmesg log for your reference.


After my initial analysis I found that the panic happens in 
generic_make_request(struct bio *bio) function in block/bio-core.c. To 
be exact, its happening when q->make_request_fn pointer becomes NULL.


2194 ret = q->make_request_fn(q, bio);

I am wondering whether any of you came across such issue.  Please let me 
know your comments.


jmp 0x0010 (setup @0x14430910)
[0.262030] ACPI Error: Method parse/execution failed 
\_SB.PCI0.I2C4.HDAC._CRS, AE_AML_NO_R)
[0.262030] ACPI Error: Method execution failed \_SB.PCI0.I2C4.HDAC._CRS, 
AE_AML_NO_RESOURC)
[0.291237] ACPI Error: Method parse/execution failed 
\_SB.PCI0.I2C4.HDAC._CRS, AE_AML_NO_R)
[0.291618] ACPI Error: Method execution failed \_SB.PCI0.I2C4.HDAC._CRS, 
AE_AML_NO_RESOURC)
[0.299845] dmi: Firmware registration failed.
[0.355993] intel_punit_ipc intel_punit_ipc: can't request region for 
resource [mem 0x0]
[0.356507] intel_punit_ipc intel_punit_ipc: can't request region for 
resource [mem 0x0]
[0.356782] intel_punit_ipc intel_punit_ipc: can't request region for 
resource [mem 0x0]
[0.357092] intel_punit_ipc intel_punit_ipc: can't request region for 
resource [mem 0x0]
[0.376350] i915 :00:02.0: Invalid PCI ROM header signature: expecting 
0xaa55, got 0x07c
[0.784294] intel_powerclamp: CPU does not support MWAIT
[0.792623] BUG: sleeping function called from invalid context at 
/var/work/CodeBase/androi8
[0.807057] in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: swapper/0
[0.814466] Preemption disabled at:
[0.822490] genirq: Setting trigger mode 3 for irq 348 failed 
(intel_gpio_irq_type+0x0/0x13)
[0.833719] dmi-sysfs: dmi entry is absent.
[0.841197] sth 0-sth: stm_register_device failed
[0.858540] device-mapper: table: 253:0: verity: Data device lookup failed
[0.866429] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[0.875181] IP:   (null)
[0.878784] PGD 0
[0.878785] P4D 0
[0.881027]
[0.884925] Oops: 0010 [#1] PREEMPT SMP
[0.889206] Modules linked in:
[0.892621] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G U  W   
4.13.0-rc6-quilt-2e5dc01
[0.902446] task: 8c1676a06040 task.stack: ad718001
[0.909054] RIP: 0010:  (null)
[0.913239] RSP: :ad7180013a30 EFLAGS: 00010246
[0.915596] mmc1: new HS400 MMC card at address 0001
[0.915787] mmcblk1: mmc1:0001 R1J56L 14.7 GiB
[0.915858] mmcblk1boot0: mmc1:0001 R1J56L partition 1 4.00 MiB
[0.915925] mmcblk1boot1: mmc1:0001 R1J56L partition 2 4.00 MiB
[0.915995] mmcblk1rpmb: mmc1:0001 R1J56L partition 3 4.00 MiB
[0.918195]  mmcblk1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 
p17 p18 p19
[0.958453] RAX:  RBX: 8c16749f1cc0 RCX: 
[0.966429] RDX:  RSI: 8c16749f1cc0 RDI: 8c16749c3370
[0.974404] RBP: ad7180013a88 R08: 0001 R09: 8c16749f1d30
[0.982372] R10: fc51c9d34d40 R11: 8c16754c72c0 R12: 8c16749c3370
[0.990334] R13:  R14:  R15: 
[0.998308] FS:  () GS:8c167fc8() 
knlGS:
[1.007354] CS:  0010 DS:  ES:  CR0: 80050033
[1.013771] CR2:  CR3: 00019a212000 CR4: 003406e0
[1.021757] Call Trace:
[1.024499]  ? generic_make_request+0x122/0x320
[1.029565]  submit_bio+0x73/0x160
[1.033363]  submit_bh_wbc.isra.44+0x113/0x140
[1.038323]  __bread_gfp+0x67/0x120
[1.042207]  ext4_fill_super+0x184/0x3880
[1.046684]  ? vsnprintf+0x201/0x490
[1.050676]  ? set_bdev_super+0x30/0x30
[1.054958]  ? snprintf+0x43/0x60
[1.058656]  mount_bdev+0x17d/0x1b0
[1.062548]  ? ext4_calculate_overhead+0x430/0x430
[1.067905]  ext4_mount+0x15/0x20
[1.071603]  mount_fs+0x153/0x180
[1.075302]  vfs_kern_mount+0x90/0x180
[1.079488]  do_mount+0x1e0/0xd00
[1.083189]  ? _copy_from_user+0x60/0xb0
[1.087570]  ? memdup_user+0x53/0x80
[1.091562]  SyS_mount+0x94/0xd0
[1.095166]  mount_block_root+0x105/0x2c4
[1.099641]  mount_root+0x6d/0x71
[1.103340]  prepare_namespace+0x172/0x19f
[1.107911]  kernel_init_freeable+0x21f/0x243
[1.112774]  ? rest_init+0xd0/0xd0
[1.116569]  kernel_init+0xe/0x100
[1.120364]  ret_from_fork+0x27/0x40
[1.124347] Code:  Bad RIP value.
[1.128051] RIP:   (null) RSP: ad7180013a30
[1.133879] CR2: 
[1.137579] ---[ end trace b3d4f4f1eb2bcb37 ]---
[

Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH

2017-08-25 Thread Bart Van Assche
On Thu, 2017-08-24 at 14:52 +0800, Ming Lei wrote:
> On Tue, Aug 22, 2017 at 09:55:57PM +, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > > + /*
> > > +  * if there is q->queue_depth, all hw queues share
> > > +  * this queue depth limit
> > > +  */
> > > + if (q->queue_depth) {
> > > + queue_for_each_hw_ctx(q, hctx, i)
> > > + hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
> > > + }
> > > +
> > > + if (!q->elevator)
> > > + goto exit;
> > 
> > Hello Ming,
> > 
> > It seems very fragile to me to set BLK_MQ_F_SHARED_DEPTH if and only if
> > q->queue_depth != 0. Wouldn't it be better to let the block driver tell the
> > block layer whether or not there is a queue depth limit across hardware
> > queues, e.g. through a tag set flag?
> 
> One reason for not doing in that way is because q->queue_depth can be
> changed via sysfs interface.

Only the SCSI core allows the queue depth to be changed through sysfs. The
other block drivers I am familiar with set the queue depth when the block
layer queue is created and do not allow the queue depth to be changed later
on.

> Another reason is that better to not exposing this flag to drivers since
> it isn't necessary, that said it is an internal flag actually.

As far as I know only the SCSI core can create request queues that have a
queue depth that is lower than the number of tags in the tag set. So for all
block drivers except the SCSI core it is OK to dispatch all requests at once
to which a hardware tag has been assigned. My proposal is either to let
drivers like the SCSI core set BLK_MQ_F_SHARED_DEPTH or to modify
blk_mq_sched_dispatch_requests() such that it flushes all requests if the
request queue depth is not lower than the hardware tag set size instead of if
q->queue_depth == 0.

Bart.

Re: [PATCH V2 03/20] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()

2017-08-25 Thread Bart Van Assche
On Thu, 2017-08-24 at 12:52 +0800, Ming Lei wrote:
> On Tue, Aug 22, 2017 at 06:45:46PM +, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > > More importantly, for some SCSI devices, driver
> > > tags are host wide, and the number is quite big,
> > > but each lun has very limited queue depth.
> > 
> > This may be the case but is not always the case. Another important use-case
> > is one LUN per host and where the queue depth per LUN is identical to the
> > number of host tags.
> 
> This patchset won't hurt this case because the BUSY info is returned
> from driver.  In this case, BLK_STS_RESOURCE should seldom be returned
> from .queue_rq generally.
> 
> Also one important fact is that once q->queue_depth is set, that
> means there is per-request_queue limit on pending I/Os, and the
> single LUN is just the special case which is covered by this whole
> patchset. We don't need to pay special attention in this special case
> at all.

The purpose of my comment was not to ask for further clarification but to
report that the description of this patch is not correct.

> > 
> > > +struct request *blk_mq_dispatch_rq_from_ctx(struct blk_mq_hw_ctx *hctx,
> > > + struct blk_mq_ctx *start)
> > > +{
> > > + unsigned off = start ? start->index_hw : 0;
> > 
> > Please consider to rename this function into 
> > blk_mq_dispatch_rq_from_next_ctx()
> > and to start from start->index_hw + 1 instead of start->index_hw. I think 
> > that
> > will not only result in simpler but also in faster code.
> 
> I believe this helper with blk_mq_next_ctx(hctx, rq->mq_ctx) together
> will be much simpler and easier to implement, and code can be much
> readable too.
> 
> blk_mq_dispatch_rq_from_next_ctx() is ugly and mixing two things
> together.

Sorry but I disagree with both of the above statements.

Bart.

Re: [PATCH V2 02/20] sbitmap: introduce __sbitmap_for_each_set()

2017-08-25 Thread Bart Van Assche
On Thu, 2017-08-24 at 11:57 +0800, Ming Lei wrote:
> On Tue, Aug 22, 2017 at 06:28:54PM +, Bart Van Assche wrote:
> > * Whether or not index >= sb->map_nr. I propose to start iterating from the
> >   start of @sb in this case.
> 
> It has been checked at the end of the loop.

That's not sufficient to avoid an out-of-bounds access if the start index is
large. If __sbitmap_for_each_set() would accept values for the start index
argument that result in index >= sb->map_nr then that will simplify code that
accesses an sbitmap in a round-robin fashion.

> > }
> > 
> > while (true) {
> > struct sbitmap_word *word = >map[i];
> > unsigned int off;
> 
> Looks you removed the check on 'word->word'.

Yes, and I did that on purpose. If the start index refers to a word that is
zero then the "if (word->word) continue;" code will cause the end-of-loop
check to be skipped and hence will cause an infinite loop.

Bart.

Re: [PATCH 0/4] Four more patches for the sTec s1120 driver (skd)

2017-08-25 Thread Jens Axboe
On 08/25/2017 03:24 PM, Bart Van Assche wrote:
> Hello Jens,
> 
> The previous series of patches for the sTec s1120 driver was followed by two
> requests from Christoph to reorganize the code slightly and one report from
> Dan Carpenter with a (false positive) Smatch report. This patch series
> addresses that feedback and includes a fourth patch I came up with
> myself. Please consider these patches for kernel v4.14.

LGTM, applied for 4.14, thanks.

-- 
Jens Axboe



[PATCH 4/4] skd: Remove SKD_ID_INCR

2017-08-25 Thread Bart Van Assche
The SKD_ID_INCR flag in skd_request_context.id duplicates information
that is already available otherwise, e.g. through the block layer
request state and through skd_request_context.state. Hence remove
the code that manipulates this flag and also the flag itself.
Since skd_isr_completion_posted() only uses the lower bits of
skd_request_context.id as hardware tag, this patch does not change
the behavior of the skd driver. I'm referring to the following code:

tag = req_id & SKD_ID_SLOT_AND_TABLE_MASK;

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/block/skd_main.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index 34188a600bfa..00a86252b3c5 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -89,7 +89,6 @@ MODULE_DESCRIPTION("STEC s1120 PCIe SSD block driver");
  sizeof(struct fit_comp_error_info)) * SKD_N_COMPLETION_ENTRY)
 
 /* 5 bits of uniqifier, 0xF800 */
-#define SKD_ID_INCR (0x400)
 #define SKD_ID_TABLE_MASK   (3u << 8u)
 #define  SKD_ID_RW_REQUEST  (0u << 8u)
 #define  SKD_ID_INTERNAL(1u << 8u)
@@ -921,9 +920,7 @@ static void skd_send_internal_skspcl(struct skd_device 
*skdev,
 */
return;
 
-   SKD_ASSERT((skspcl->req.id & SKD_ID_INCR) == 0);
skspcl->req.state = SKD_REQ_STATE_BUSY;
-   skspcl->req.id += SKD_ID_INCR;
 
scsi = >msg_buf->scsi[0];
scsi->hdr.tag = skspcl->req.id;
@@ -1044,7 +1041,6 @@ static void skd_complete_internal(struct skd_device 
*skdev,
 
skspcl->req.completion = *skcomp;
skspcl->req.state = SKD_REQ_STATE_IDLE;
-   skspcl->req.id += SKD_ID_INCR;
 
status = skspcl->req.completion.status;
 
@@ -1451,7 +1447,6 @@ static void skd_release_skreq(struct skd_device *skdev,
 * Reclaim the skd_request_context
 */
skreq->state = SKD_REQ_STATE_IDLE;
-   skreq->id += SKD_ID_INCR;
 }
 
 static int skd_isr_completion_posted(struct skd_device *skdev,
-- 
2.14.1



[PATCH 2/4] skd: Inline skd_end_request()

2017-08-25 Thread Bart Van Assche
It is not worth to keep the debug statements in skd_end_request().
Without debug statements that function only consists of two
statements. Hence inline skd_end_request().

Suggested-by: Christoph Hellwig 
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/block/skd_main.c | 45 +
 1 file changed, 13 insertions(+), 32 deletions(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index a55c8ef1a21d..8ae0320f02b5 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -360,8 +360,6 @@ static void skd_send_fitmsg(struct skd_device *skdev,
struct skd_fitmsg_context *skmsg);
 static void skd_send_special_fitmsg(struct skd_device *skdev,
struct skd_special_context *skspcl);
-static void skd_end_request(struct skd_device *skdev, struct request *req,
-   blk_status_t status);
 static bool skd_preop_sg_list(struct skd_device *skdev,
 struct skd_request_context *skreq);
 static void skd_postop_sg_list(struct skd_device *skdev,
@@ -520,8 +518,8 @@ static blk_status_t skd_mq_queue_rq(struct blk_mq_hw_ctx 
*hctx,
 
if (req->bio && !skd_preop_sg_list(skdev, skreq)) {
dev_dbg(>pdev->dev, "error Out\n");
-   skd_end_request(skdev, blk_mq_rq_from_pdu(skreq),
-   BLK_STS_RESOURCE);
+   skreq->status = BLK_STS_RESOURCE;
+   blk_mq_complete_request(req);
return BLK_STS_OK;
}
 
@@ -608,27 +606,6 @@ static enum blk_eh_timer_return skd_timed_out(struct 
request *req,
return BLK_EH_RESET_TIMER;
 }
 
-static void skd_end_request(struct skd_device *skdev, struct request *req,
-   blk_status_t error)
-{
-   struct skd_request_context *skreq = blk_mq_rq_to_pdu(req);
-
-   if (unlikely(error)) {
-   char *cmd = (rq_data_dir(req) == READ) ? "read" : "write";
-   u32 lba = (u32)blk_rq_pos(req);
-   u32 count = blk_rq_sectors(req);
-
-   dev_err(>pdev->dev,
-   "Error cmd=%s sect=%u count=%u id=0x%x\n", cmd, lba,
-   count, req->tag);
-   } else
-   dev_dbg(>pdev->dev, "id=0x%x error=%d\n", req->tag,
-   error);
-
-   skreq->status = error;
-   blk_mq_complete_request(req);
-}
-
 static void skd_complete_rq(struct request *req)
 {
struct skd_request_context *skreq = blk_mq_rq_to_pdu(req);
@@ -1438,7 +1415,8 @@ static void skd_resolve_req_exception(struct skd_device 
*skdev,
switch (skd_check_status(skdev, cmp_status, >err_info)) {
case SKD_CHECK_STATUS_REPORT_GOOD:
case SKD_CHECK_STATUS_REPORT_SMART_ALERT:
-   skd_end_request(skdev, req, BLK_STS_OK);
+   skreq->status = BLK_STS_OK;
+   blk_mq_complete_request(req);
break;
 
case SKD_CHECK_STATUS_BUSY_IMMINENT:
@@ -1460,7 +1438,8 @@ static void skd_resolve_req_exception(struct skd_device 
*skdev,
 
case SKD_CHECK_STATUS_REPORT_ERROR:
default:
-   skd_end_request(skdev, req, BLK_STS_IOERR);
+   skreq->status = BLK_STS_IOERR;
+   blk_mq_complete_request(req);
break;
}
 }
@@ -1579,10 +1558,12 @@ static int skd_isr_completion_posted(struct skd_device 
*skdev,
/*
 * Capture the outcome and post it back to the native request.
 */
-   if (likely(cmp_status == SAM_STAT_GOOD))
-   skd_end_request(skdev, rq, BLK_STS_OK);
-   else
+   if (likely(cmp_status == SAM_STAT_GOOD)) {
+   skreq->status = BLK_STS_OK;
+   blk_mq_complete_request(rq);
+   } else {
skd_resolve_req_exception(skdev, skreq, rq);
+   }
 
/* skd_isr_comp_limit equal zero means no limit */
if (limit) {
@@ -1926,8 +1907,8 @@ static void skd_recover_request(struct request *req, void 
*data, bool reserved)
skd_postop_sg_list(skdev, skreq);
 
skreq->state = SKD_REQ_STATE_IDLE;
-
-   skd_end_request(skdev, req, BLK_STS_IOERR);
+   skreq->status = BLK_STS_IOERR;
+   blk_mq_complete_request(req);
 }
 
 static void skd_recover_requests(struct skd_device *skdev)
-- 
2.14.1



[PATCH 3/4] skd: Make it easier for static analyzers to analyze skd_free_disk()

2017-08-25 Thread Bart Van Assche
Although it is easy to see that skdev->disk != NULL if skdev->queue
!= NULL, add a test for skdev->disk to avoid that smatch reports the
following warning:

drivers/block/skd_main.c:3080 skd_free_disk()
 error: we previously assumed 'disk' could be null (see line 3074)

Reported-by: Dan Carpenter 
Signed-off-by: Bart Van Assche 
Cc: Dan Carpenter 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/block/skd_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index 8ae0320f02b5..34188a600bfa 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -3041,7 +3041,8 @@ static void skd_free_disk(struct skd_device *skdev)
if (skdev->queue) {
blk_cleanup_queue(skdev->queue);
skdev->queue = NULL;
-   disk->queue = NULL;
+   if (disk)
+   disk->queue = NULL;
}
 
if (skdev->tag_set.tags)
-- 
2.14.1



[PATCH 1/4] skd: Rename skd_softirq_done() into skd_complete_rq()

2017-08-25 Thread Bart Van Assche
The latter name follows more closely the function names used in
other blk-mq drivers.

Suggested-by: Christoph Hellwig 
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/block/skd_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index 577618c57975..a55c8ef1a21d 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -629,7 +629,7 @@ static void skd_end_request(struct skd_device *skdev, 
struct request *req,
blk_mq_complete_request(req);
 }
 
-static void skd_softirq_done(struct request *req)
+static void skd_complete_rq(struct request *req)
 {
struct skd_request_context *skreq = blk_mq_rq_to_pdu(req);
 
@@ -2821,7 +2821,7 @@ static int skd_cons_sksb(struct skd_device *skdev)
 
 static const struct blk_mq_ops skd_mq_ops = {
.queue_rq   = skd_mq_queue_rq,
-   .complete   = skd_softirq_done,
+   .complete   = skd_complete_rq,
.timeout= skd_timed_out,
.init_request   = skd_init_request,
.exit_request   = skd_exit_request,
-- 
2.14.1



[PATCH 0/4] Four more patches for the sTec s1120 driver (skd)

2017-08-25 Thread Bart Van Assche
Hello Jens,

The previous series of patches for the sTec s1120 driver was followed by two
requests from Christoph to reorganize the code slightly and one report from
Dan Carpenter with a (false positive) Smatch report. This patch series
addresses that feedback and includes a fourth patch I came up with
myself. Please consider these patches for kernel v4.14.

Thanks,

Bart.

Bart Van Assche (4):
  skd: Rename skd_softirq_done() into skd_complete_rq()
  skd: Inline skd_end_request()
  skd: Make it easier for static analyzers to analyze skd_free_disk()
  skd: Remove SKD_ID_INCR

 drivers/block/skd_main.c | 57 +++-
 1 file changed, 17 insertions(+), 40 deletions(-)

-- 
2.14.1



Re: [PATCH] block/nullb: fix NULL deference

2017-08-25 Thread Jens Axboe
On 08/25/2017 02:46 PM, Shaohua Li wrote:
> Dan reported this:
> 
> The patch 2984c8684f96: "nullb: factor disk parameters" from Aug 14,
> 2017, leads to the following Smatch complaint:
> 
> drivers/block/null_blk.c:1759 null_init_tag_set()
>error: we previously assumed 'nullb' could be null (see line
> 1750)
> 
>   1755set->cmd_size   = sizeof(struct nullb_cmd);
>   1756set->flags = BLK_MQ_F_SHOULD_MERGE;
>   1757set->driver_data = NULL;
>   1758
>   1759if (nullb->dev->blocking)
> 
> And an unchecked dereference.
> 
> nullb could be NULL here.

Applied, thanks. Fixed up your subject line typo.

-- 
Jens Axboe



[PATCH] block/nullb: fix NULL deference

2017-08-25 Thread Shaohua Li
Dan reported this:

The patch 2984c8684f96: "nullb: factor disk parameters" from Aug 14,
2017, leads to the following Smatch complaint:

drivers/block/null_blk.c:1759 null_init_tag_set()
 error: we previously assumed 'nullb' could be null (see line
1750)

  1755  set->cmd_size   = sizeof(struct nullb_cmd);
  1756  set->flags = BLK_MQ_F_SHOULD_MERGE;
  1757  set->driver_data = NULL;
  1758
  1759  if (nullb->dev->blocking)

And an unchecked dereference.

nullb could be NULL here.

Reported-by: Dan Carpenter 
Signed-off-by: Shaohua Li 
---
 drivers/block/null_blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 2032360..4d328e3 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1756,7 +1756,7 @@ static int null_init_tag_set(struct nullb *nullb, struct 
blk_mq_tag_set *set)
set->flags = BLK_MQ_F_SHOULD_MERGE;
set->driver_data = NULL;
 
-   if (nullb->dev->blocking)
+   if ((nullb && nullb->dev->blocking) || g_blocking)
set->flags |= BLK_MQ_F_BLOCKING;
 
return blk_mq_alloc_tag_set(set);
-- 
2.9.5



Re: [bug report] nullb: factor disk parameters

2017-08-25 Thread Jens Axboe
On 08/25/2017 02:24 PM, Dan Carpenter wrote:
> Hello Shaohua Li,
> 
> This is a semi-automatic email about new static checker warnings.
> 
> The patch 2984c8684f96: "nullb: factor disk parameters" from Aug 14, 
> 2017, leads to the following Smatch complaint:
> 
> drivers/block/null_blk.c:1759 null_init_tag_set()
>error: we previously assumed 'nullb' could be null (see line 1750)

That's a bug, for shared tags we passed in nullb == NULL.

-- 
Jens Axboe



[bug report] nullb: factor disk parameters

2017-08-25 Thread Dan Carpenter
Hello Shaohua Li,

This is a semi-automatic email about new static checker warnings.

The patch 2984c8684f96: "nullb: factor disk parameters" from Aug 14, 
2017, leads to the following Smatch complaint:

drivers/block/null_blk.c:1759 null_init_tag_set()
 error: we previously assumed 'nullb' could be null (see line 1750)

drivers/block/null_blk.c
  1749  set->ops = _mq_ops;
  1750  set->nr_hw_queues = nullb ? nullb->dev->submit_queues :
  1751  g_submit_queues;
  1752  set->queue_depth = nullb ? nullb->dev->hw_queue_depth :
  1753  g_hw_queue_depth;
  1754  set->numa_node = nullb ? nullb->dev->home_node : g_home_node;
 ^
The patch introduces a series of new NULL checks

  1755  set->cmd_size   = sizeof(struct nullb_cmd);
  1756  set->flags = BLK_MQ_F_SHOULD_MERGE;
  1757  set->driver_data = NULL;
  1758  
  1759  if (nullb->dev->blocking)

And an unchecked dereference.

  1760  set->flags |= BLK_MQ_F_BLOCKING;
  1761  

regards,
dan carpenter


Re: [bug report] skd: Avoid that module unloading triggers a use-after-free

2017-08-25 Thread Bart Van Assche
On Thu, 2017-08-24 at 21:36 +0300, Dan Carpenter wrote:
> On Thu, Aug 24, 2017 at 03:04:12PM +, Bart Van Assche wrote:
> > If you have a look at skd_cons_disk() you will see that skdev->queue != NULL
> > implies that skdev->disk != NULL. So I think the above report is a false
> > positive.
> 
> Oh, yeah.  You're right.  Thanks for taking a look at this.

Thank you for all the work you have done and are still doing on smatch and on
verifying all new code!

Bart.

Re: [PATCH] blkcg: avoid free blkcg_root when failed to alloc blkcg policy

2017-08-25 Thread Jens Axboe
On 08/25/2017 09:49 AM, weiping zhang wrote:
> this patch fix two errors, firstly avoid kfree blk_root, secondly not
> free(blkcg) ,if blkcg alloc fail(blkcg == NULL), just unlock that mutex;

Looks good, applied.

-- 
Jens Axboe



[PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-25 Thread Bart Van Assche
Since .initialize_rq_fn() is called from inside blk_get_request()
that function is only called for pass-through requests but not for
filesystem requests. There is agreement in the SCSI community that
the jiffies_at_alloc and retries members of struct scsi_cmnd
should be initialized at request allocation time instead of when
preparing a request. This will allow to preserve the value of these
members when requeuing a request. Since the block layer does not
yet allow block drivers to initialize filesystem requests without
overriding request_queue.make_request_fn, move the
.initialize_rq_fn() calls from blk_get_request() into
blk_mq_rq_ctx_init() and blk_rq_init().

See also https://www.spinics.net/lists/linux-scsi/msg108934.html.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Martin K. Petersen 
Cc: Hannes Reinecke 
---
 block/blk-core.c | 20 +++-
 block/blk-mq.c   |  4 
 2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 78a46794053e..463aa0ee36ce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,9 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
+
+   if (q && q->initialize_rq_fn)
+   q->initialize_rq_fn(rq);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1420,21 +1423,12 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
 struct request *blk_get_request(struct request_queue *q, unsigned int op,
gfp_t gfp_mask)
 {
-   struct request *req;
-
-   if (q->mq_ops) {
-   req = blk_mq_alloc_request(q, op,
+   if (q->mq_ops)
+   return blk_mq_alloc_request(q, op,
(gfp_mask & __GFP_DIRECT_RECLAIM) ?
0 : BLK_MQ_REQ_NOWAIT);
-   if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
-   q->mq_ops->initialize_rq_fn(req);
-   } else {
-   req = blk_old_get_request(q, op, gfp_mask);
-   if (!IS_ERR(req) && q->initialize_rq_fn)
-   q->initialize_rq_fn(req);
-   }
-
-   return req;
+   else
+   return blk_old_get_request(q, op, gfp_mask);
 }
 EXPORT_SYMBOL(blk_get_request);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 21f2cff217ce..fdb33f8a2860 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -273,6 +273,7 @@ EXPORT_SYMBOL(blk_mq_can_queue);
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
unsigned int tag, unsigned int op)
 {
+   struct request_queue *q = data->q;
struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
struct request *rq = tags->static_rqs[tag];
 
@@ -325,6 +326,9 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->end_io_data = NULL;
rq->next_rq = NULL;
 
+   if (q->mq_ops->initialize_rq_fn)
+   q->mq_ops->initialize_rq_fn(rq);
+
data->ctx->rq_dispatched[op_is_sync(op)]++;
return rq;
 }
-- 
2.14.1



Re: dm-rq: do not update rq partially in each ending bio

2017-08-25 Thread Mike Snitzer
On Fri, Aug 25 2017 at  1:07pm -0400,
Ming Lei  wrote:

> On Fri, Aug 25, 2017 at 12:32:33PM -0400, Mike Snitzer wrote:
> > On Fri, Aug 25 2017 at 12:08pm -0400,
> > Ming Lei  wrote:
> > 
> > > On Fri, Aug 25, 2017 at 11:48:39AM -0400, Mike Snitzer wrote:
> > > > On Fri, Aug 25 2017 at 11:27am -0400,
> > > > Ming Lei  wrote:
> > > > 
> > > > > We don't need to update orignal dm request partially
> > > > > when ending each cloned bio, and this patch just
> > > > > updates orignal dm request once when the whole
> > > > > cloned request is finished.
> > > > > 
> > > > > Partial request update can be a bit expensive, so
> > > > > we should try to avoid it, especially it is run
> > > > > in softirq context.
> > > > > 
> > > > > After this patch is applied, both hard lockup and
> > > > > soft lockup aren't reproduced any more in one hour
> > > > > of running Laurence's test[1] on IB/SRP. Without
> > > > > this patch, the lockup can be reproduced in several
> > > > > minutes.
> > > > > 
> > > > > BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
> > > > > rerun the queue at a quiet time), we need to make the
> > > > > test more aggressive for reproducing the lockup:
> > > > > 
> > > > >   1) run hammer_write.sh 32 or 64 concurrently.
> > > > >   2) write 8M each time
> > > > > 
> > > > > [1] https://marc.info/?l=linux-block=150220185510245=2
> > > > 
> > > > Bart said he cannot reproduce the lockups with his patchset applied.
> > > > Have you tested using Bart's patchset?
> > > 
> > > d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list() rerun the
> > > queue at a quiet time) has been in linus tree.
> > > 
> > > For other patches, I didn't test it yet. Because every time
> > > when the lockup is triggered, it is always in blk_recalc_rq_segments(),
> > > and not see any patch is dealing with that.
> > 
> > Please test with all of Bart's patches applied!
> 
> Just done the test with Bart's patch, still can
> see soft lockup when running the test described
> in commit log for a couple of minutes. BTW, my
> test is much more aggressive than Laurence's, I
> write 8M each time, and run 64 hammer_write.sh
> concurrently.

OK, thanks for verifying as much.
 
> I don't think the two are contradictory. Anyway,
> this patch will decrease CPU utilization of SOFTIRQ, and
> it is a improvement.

OK, looking closer I can see you accomplish the same (retaining partial
completion) in a more optimal way.

I'll include this in my review/staging of dm-multipath patches for 4.14.

Thanks,
Mike


Re: dm-rq: do not update rq partially in each ending bio

2017-08-25 Thread Ming Lei
On Sat, Aug 26, 2017 at 01:07:53AM +0800, Ming Lei wrote:
> On Fri, Aug 25, 2017 at 12:32:33PM -0400, Mike Snitzer wrote:
> > On Fri, Aug 25 2017 at 12:08pm -0400,
> > Ming Lei  wrote:
> > 
> > > On Fri, Aug 25, 2017 at 11:48:39AM -0400, Mike Snitzer wrote:
> > > > On Fri, Aug 25 2017 at 11:27am -0400,
> > > > Ming Lei  wrote:
> > > > 
> > > > > We don't need to update orignal dm request partially
> > > > > when ending each cloned bio, and this patch just
> > > > > updates orignal dm request once when the whole
> > > > > cloned request is finished.
> > > > > 
> > > > > Partial request update can be a bit expensive, so
> > > > > we should try to avoid it, especially it is run
> > > > > in softirq context.
> > > > > 
> > > > > After this patch is applied, both hard lockup and
> > > > > soft lockup aren't reproduced any more in one hour
> > > > > of running Laurence's test[1] on IB/SRP. Without
> > > > > this patch, the lockup can be reproduced in several
> > > > > minutes.
> > > > > 
> > > > > BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
> > > > > rerun the queue at a quiet time), we need to make the
> > > > > test more aggressive for reproducing the lockup:
> > > > > 
> > > > >   1) run hammer_write.sh 32 or 64 concurrently.
> > > > >   2) write 8M each time
> > > > > 
> > > > > [1] https://marc.info/?l=linux-block=150220185510245=2
> > > > 
> > > > Bart said he cannot reproduce the lockups with his patchset applied.
> > > > Have you tested using Bart's patchset?
> > > 
> > > d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list() rerun the
> > > queue at a quiet time) has been in linus tree.
> > > 
> > > For other patches, I didn't test it yet. Because every time
> > > when the lockup is triggered, it is always in blk_recalc_rq_segments(),
> > > and not see any patch is dealing with that.
> > 
> > Please test with all of Bart's patches applied!
> 
> Just done the test with Bart's patch, still can
> see soft lockup when running the test described

Looks no difference, hard lockup can be observed too 
following soft lockup after a while with Bart's patch.

-- 
Ming


Re: dm-rq: do not update rq partially in each ending bio

2017-08-25 Thread Ming Lei
On Fri, Aug 25, 2017 at 12:32:33PM -0400, Mike Snitzer wrote:
> On Fri, Aug 25 2017 at 12:08pm -0400,
> Ming Lei  wrote:
> 
> > On Fri, Aug 25, 2017 at 11:48:39AM -0400, Mike Snitzer wrote:
> > > On Fri, Aug 25 2017 at 11:27am -0400,
> > > Ming Lei  wrote:
> > > 
> > > > We don't need to update orignal dm request partially
> > > > when ending each cloned bio, and this patch just
> > > > updates orignal dm request once when the whole
> > > > cloned request is finished.
> > > > 
> > > > Partial request update can be a bit expensive, so
> > > > we should try to avoid it, especially it is run
> > > > in softirq context.
> > > > 
> > > > After this patch is applied, both hard lockup and
> > > > soft lockup aren't reproduced any more in one hour
> > > > of running Laurence's test[1] on IB/SRP. Without
> > > > this patch, the lockup can be reproduced in several
> > > > minutes.
> > > > 
> > > > BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
> > > > rerun the queue at a quiet time), we need to make the
> > > > test more aggressive for reproducing the lockup:
> > > > 
> > > > 1) run hammer_write.sh 32 or 64 concurrently.
> > > > 2) write 8M each time
> > > > 
> > > > [1] https://marc.info/?l=linux-block=150220185510245=2
> > > 
> > > Bart said he cannot reproduce the lockups with his patchset applied.
> > > Have you tested using Bart's patchset?
> > 
> > d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list() rerun the
> > queue at a quiet time) has been in linus tree.
> > 
> > For other patches, I didn't test it yet. Because every time
> > when the lockup is triggered, it is always in blk_recalc_rq_segments(),
> > and not see any patch is dealing with that.
> 
> Please test with all of Bart's patches applied!

Just done the test with Bart's patch, still can
see soft lockup when running the test described
in commit log for a couple of minutes. BTW, my
test is much more aggressive than Laurence's, I
write 8M each time, and run 64 hammer_write.sh
concurrently.

I don't think the two are contradictory. Anyway,
this patch will decrease CPU utilization of SOFTIRQ, and
it is a improvement.

> 
> > > Comments inlined below.
> > > 
> > > > ---
> > > >  drivers/md/dm-rq.c | 15 +--
> > > >  drivers/md/dm-rq.h |  1 +
> > > >  2 files changed, 6 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > > > index c6ebc5b1e00e..50cd96c7de45 100644
> > > > --- a/drivers/md/dm-rq.c
> > > > +++ b/drivers/md/dm-rq.c
> > > > @@ -117,9 +117,9 @@ static void end_clone_bio(struct bio *clone)
> > > > struct dm_rq_clone_bio_info *info =
> > > > container_of(clone, struct dm_rq_clone_bio_info, clone);
> > > > struct dm_rq_target_io *tio = info->tio;
> > > > -   struct bio *bio = info->orig;
> > > > unsigned int nr_bytes = info->orig->bi_iter.bi_size;
> > > > blk_status_t error = clone->bi_status;
> > > > +   bool is_last = !clone->bi_next;
> > > >  
> > > > bio_put(clone);
> > > >  
> > > > @@ -144,21 +144,15 @@ static void end_clone_bio(struct bio *clone)
> > > >  * I/O for the bio successfully completed.
> > > >  * Notice the data completion to the upper layer.
> > > >  */
> > > > -
> > > > -   /*
> > > > -* bios are processed from the head of the list.
> > > > -* So the completing bio should always be rq->bio.
> > > > -* If it's not, something wrong is happening.
> > > > -*/
> > > > -   if (tio->orig->bio != bio)
> > > > -   DMERR("bio completion is going in the middle of the 
> > > > request");
> > > 
> > > Why did you remove this check?
> > 
> > The check isn't valid any more because this patch only update
> > original dm rq in .end_bio() of the last cloned bio.
> 
> Fair enough, so just a side-effect of your all or nothing completion handling.

It is not all or nothing, and still can do partial update like
current way.

> 
> > > 
> > > > +   tio->completed += nr_bytes;
> > > >  
> > > > /*
> > > >  * Update the original request.
> > > >  * Do not use blk_end_request() here, because it may complete
> > > >  * the original request before the clone, and break the 
> > > > ordering.
> > > >  */
> > > > -   blk_update_request(tio->orig, BLK_STS_OK, nr_bytes);
> > > > +   if (is_last)
> > > > +   blk_update_request(tio->orig, BLK_STS_OK, 
> > > > tio->completed);
> > > >  }
> > > 
> > > Partial completion support is important given the potential for path
> > > failures interrupting requests.  Why do you think it is OK to avoid it
> > > by switching to an all or nothing approach?
> > 
> > If the cloned rq is partially completed, this dm rq is still partially
> > completed. This patch only update dm rq in the last cloned bio's
> > .end_io().
> 
> Which is exactly what we do _not_ want because it 

Re: [PATCH] block: update comments to reflect REQ_FLUSH -> REQ_PREFLUSH rename

2017-08-25 Thread Jens Axboe
On 08/24/2017 12:09 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Normally I wouldn't bother with this, but in my opinion the comments are
> the most important part of this whole file since without them no one
> would have any clue how this insanity works.

What are you talking about, that code is clear as day. But, uhm, applied
for 4.14.

-- 
Jens Axboe



Re: dm-rq: do not update rq partially in each ending bio

2017-08-25 Thread Mike Snitzer
On Fri, Aug 25 2017 at 12:08pm -0400,
Ming Lei  wrote:

> On Fri, Aug 25, 2017 at 11:48:39AM -0400, Mike Snitzer wrote:
> > On Fri, Aug 25 2017 at 11:27am -0400,
> > Ming Lei  wrote:
> > 
> > > We don't need to update orignal dm request partially
> > > when ending each cloned bio, and this patch just
> > > updates orignal dm request once when the whole
> > > cloned request is finished.
> > > 
> > > Partial request update can be a bit expensive, so
> > > we should try to avoid it, especially it is run
> > > in softirq context.
> > > 
> > > After this patch is applied, both hard lockup and
> > > soft lockup aren't reproduced any more in one hour
> > > of running Laurence's test[1] on IB/SRP. Without
> > > this patch, the lockup can be reproduced in several
> > > minutes.
> > > 
> > > BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
> > > rerun the queue at a quiet time), we need to make the
> > > test more aggressive for reproducing the lockup:
> > > 
> > >   1) run hammer_write.sh 32 or 64 concurrently.
> > >   2) write 8M each time
> > > 
> > > [1] https://marc.info/?l=linux-block=150220185510245=2
> > 
> > Bart said he cannot reproduce the lockups with his patchset applied.
> > Have you tested using Bart's patchset?
> 
> d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list() rerun the
> queue at a quiet time) has been in linus tree.
> 
> For other patches, I didn't test it yet. Because every time
> when the lockup is triggered, it is always in blk_recalc_rq_segments(),
> and not see any patch is dealing with that.

Please test with all of Bart's patches applied!

> > Comments inlined below.
> > 
> > > ---
> > >  drivers/md/dm-rq.c | 15 +--
> > >  drivers/md/dm-rq.h |  1 +
> > >  2 files changed, 6 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > > index c6ebc5b1e00e..50cd96c7de45 100644
> > > --- a/drivers/md/dm-rq.c
> > > +++ b/drivers/md/dm-rq.c
> > > @@ -117,9 +117,9 @@ static void end_clone_bio(struct bio *clone)
> > >   struct dm_rq_clone_bio_info *info =
> > >   container_of(clone, struct dm_rq_clone_bio_info, clone);
> > >   struct dm_rq_target_io *tio = info->tio;
> > > - struct bio *bio = info->orig;
> > >   unsigned int nr_bytes = info->orig->bi_iter.bi_size;
> > >   blk_status_t error = clone->bi_status;
> > > + bool is_last = !clone->bi_next;
> > >  
> > >   bio_put(clone);
> > >  
> > > @@ -144,21 +144,15 @@ static void end_clone_bio(struct bio *clone)
> > >* I/O for the bio successfully completed.
> > >* Notice the data completion to the upper layer.
> > >*/
> > > -
> > > - /*
> > > -  * bios are processed from the head of the list.
> > > -  * So the completing bio should always be rq->bio.
> > > -  * If it's not, something wrong is happening.
> > > -  */
> > > - if (tio->orig->bio != bio)
> > > - DMERR("bio completion is going in the middle of the request");
> > 
> > Why did you remove this check?
> 
> The check isn't valid any more because this patch only update
> original dm rq in .end_bio() of the last cloned bio.

Fair enough, so just a side-effect of your all or nothing completion handling.

> > 
> > > + tio->completed += nr_bytes;
> > >  
> > >   /*
> > >* Update the original request.
> > >* Do not use blk_end_request() here, because it may complete
> > >* the original request before the clone, and break the ordering.
> > >*/
> > > - blk_update_request(tio->orig, BLK_STS_OK, nr_bytes);
> > > + if (is_last)
> > > + blk_update_request(tio->orig, BLK_STS_OK, tio->completed);
> > >  }
> > 
> > Partial completion support is important given the potential for path
> > failures interrupting requests.  Why do you think it is OK to avoid it
> > by switching to an all or nothing approach?
> 
> If the cloned rq is partially completed, this dm rq is still partially
> completed. This patch only update dm rq in the last cloned bio's
> .end_io().

Which is exactly what we do _not_ want because it removed partial
completion.  Which is an important feature of DM multipath.  Christoph
echoed its importance some time ago:
https://www.redhat.com/archives/dm-devel/2015-May/msg00228.html

> Also if one middle cloned bio is completed with error, the current
> implementation doesn't update dm rq any more from that bio, so
> looks the following patch is consistent with current
> implementation, what do you think of it?

Not seeing how.  Yes, the current implementation will not account for a
part of the request that failed.  That is fine.. the bio failed, so
nothing to update!

So no, I don't understand why you've added the 'exit' goto to update the
request on error.  If the request is to be failed upgrades it'll get
updated as a side-effect of that completion due to error (if multipath
is configured to fail_if_no_path or whatever).

Mike

> From 2b5ef7e64ba8595c0c58790d14fd77b10a28afee Mon Sep 17 00:00:00 2001
> From: Ming Lei 

Re: dm-rq: do not update rq partially in each ending bio

2017-08-25 Thread Ming Lei
On Fri, Aug 25, 2017 at 11:48:39AM -0400, Mike Snitzer wrote:
> On Fri, Aug 25 2017 at 11:27am -0400,
> Ming Lei  wrote:
> 
> > We don't need to update orignal dm request partially
> > when ending each cloned bio, and this patch just
> > updates orignal dm request once when the whole
> > cloned request is finished.
> > 
> > Partial request update can be a bit expensive, so
> > we should try to avoid it, especially it is run
> > in softirq context.
> > 
> > After this patch is applied, both hard lockup and
> > soft lockup aren't reproduced any more in one hour
> > of running Laurence's test[1] on IB/SRP. Without
> > this patch, the lockup can be reproduced in several
> > minutes.
> > 
> > BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
> > rerun the queue at a quiet time), we need to make the
> > test more aggressive for reproducing the lockup:
> > 
> > 1) run hammer_write.sh 32 or 64 concurrently.
> > 2) write 8M each time
> > 
> > [1] https://marc.info/?l=linux-block=150220185510245=2
> 
> Bart said he cannot reproduce the lockups with his patchset applied.
> Have you tested using Bart's patchset?

d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list() rerun the
queue at a quiet time) has been in linus tree.

For other patches, I didn't test it yet. Because every time
when the lockup is triggered, it is always in blk_recalc_rq_segments(),
and not see any patch is dealing with that.

> 
> Comments inlined below.
> 
> > ---
> >  drivers/md/dm-rq.c | 15 +--
> >  drivers/md/dm-rq.h |  1 +
> >  2 files changed, 6 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > index c6ebc5b1e00e..50cd96c7de45 100644
> > --- a/drivers/md/dm-rq.c
> > +++ b/drivers/md/dm-rq.c
> > @@ -117,9 +117,9 @@ static void end_clone_bio(struct bio *clone)
> > struct dm_rq_clone_bio_info *info =
> > container_of(clone, struct dm_rq_clone_bio_info, clone);
> > struct dm_rq_target_io *tio = info->tio;
> > -   struct bio *bio = info->orig;
> > unsigned int nr_bytes = info->orig->bi_iter.bi_size;
> > blk_status_t error = clone->bi_status;
> > +   bool is_last = !clone->bi_next;
> >  
> > bio_put(clone);
> >  
> > @@ -144,21 +144,15 @@ static void end_clone_bio(struct bio *clone)
> >  * I/O for the bio successfully completed.
> >  * Notice the data completion to the upper layer.
> >  */
> > -
> > -   /*
> > -* bios are processed from the head of the list.
> > -* So the completing bio should always be rq->bio.
> > -* If it's not, something wrong is happening.
> > -*/
> > -   if (tio->orig->bio != bio)
> > -   DMERR("bio completion is going in the middle of the request");
> 
> Why did you remove this check?

The check isn't valid any more because this patch only update
original dm rq in .end_bio() of the last cloned bio.

> 
> > +   tio->completed += nr_bytes;
> >  
> > /*
> >  * Update the original request.
> >  * Do not use blk_end_request() here, because it may complete
> >  * the original request before the clone, and break the ordering.
> >  */
> > -   blk_update_request(tio->orig, BLK_STS_OK, nr_bytes);
> > +   if (is_last)
> > +   blk_update_request(tio->orig, BLK_STS_OK, tio->completed);
> >  }
> 
> Partial completion support is important given the potential for path
> failures interrupting requests.  Why do you think it is OK to avoid it
> by switching to an all or nothing approach?

If the cloned rq is partially completed, this dm rq is still partially
completed. This patch only update dm rq in the last cloned bio's
.end_io().

Also if one middle cloned bio is completed with error, the current
implementation doesn't update dm rq any more from that bio, so
looks the following patch is consistent with current
implementation, what do you think of it?

>From 2b5ef7e64ba8595c0c58790d14fd77b10a28afee Mon Sep 17 00:00:00 2001
From: Ming Lei 
Date: Thu, 24 Aug 2017 20:19:52 +0800
Subject: [PATCH] dm-rq: do not update rq partially in each ending bio

We don't need to update orignal dm request partially
when ending each cloned bio, and this patch just
updates orignal dm request once when the whole
cloned request is finished.

Partial request update can be a bit expensive, so
we should try to avoid it, especially it is run
in softirq context.

After this patch is applied, both hard lockup and
soft lockup aren't reproduced any more in one hour
of running Laurence's test[1] on IB/SRP. Without
this patch, the lockup can be reproduced in several
minutes.

BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
rerun the queue at a quiet time), we need to make the
test more aggressive for reproducing the lockup:

1) run hammer_write.sh 32 or 64 concurrently.
2) write 8M each time

[1] https://marc.info/?l=linux-block=150220185510245=2

Cc: Laurence Oberman 
Cc: Bart Van Assche 

[GIT PULL] Block fixes for 4.13-rc

2017-08-25 Thread Jens Axboe
Hi Linus,

Small batch of fixes that should be included for the 4.13 release. This
pull request contains:

- Revert of the 4k loop blocksize support. Even with a recent batch of 4
  fixes, we're still not really happy with it. Rather than be stuck with
  an API issue, let's revert it and get it right for 4.14.

- Trivial patch from Bart, adding a few flags to the blk-mq debugfs
  exports that were added in this release, but not to the debugfs parts.

- Regression fix for bsg, fixing a potential kernel panic. From
  Benjamin.

- Tweak for the blk throttling, improving how we account discards. From
  Shaohua.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus



Bart Van Assche (1):
  blk-mq-debugfs: Add names for recently added flags

Benjamin Block (1):
  bsg-lib: fix kernel panic resulting from missing allocation of 
reply-buffer

Omar Sandoval (1):
  Revert "loop: support 4k physical blocksize"

Shaohua Li (1):
  blk-throttle: cap discard request size

 block/blk-mq-debugfs.c|  3 ++
 block/blk-throttle.c  | 18 +---
 block/bsg-lib.c   | 74 ---
 drivers/block/loop.c  | 42 ---
 drivers/block/loop.h  |  1 -
 include/linux/blkdev.h|  1 -
 include/linux/bsg-lib.h   |  2 ++
 include/uapi/linux/loop.h |  3 --
 8 files changed, 69 insertions(+), 75 deletions(-)

-- 
Jens Axboe



[PATCH] blkcg: avoid free blkcg_root when failed to alloc blkcg policy

2017-08-25 Thread weiping zhang
this patch fix two errors, firstly avoid kfree blk_root, secondly not
free(blkcg) ,if blkcg alloc fail(blkcg == NULL), just unlock that mutex;

Signed-off-by: weiping zhang 
---
 block/blk-cgroup.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0480892..d3f56ba 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1067,7 +1067,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
if (!blkcg) {
ret = ERR_PTR(-ENOMEM);
-   goto free_blkcg;
+   goto unlock;
}
}
 
@@ -,8 +,10 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
for (i--; i >= 0; i--)
if (blkcg->cpd[i])
blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
-free_blkcg:
-   kfree(blkcg);
+
+   if (blkcg != _root)
+   kfree(blkcg);
+unlock:
mutex_unlock(_pol_mutex);
return ret;
 }
-- 
2.9.4



Re: dm-rq: do not update rq partially in each ending bio

2017-08-25 Thread Mike Snitzer
On Fri, Aug 25 2017 at 11:27am -0400,
Ming Lei  wrote:

> We don't need to update orignal dm request partially
> when ending each cloned bio, and this patch just
> updates orignal dm request once when the whole
> cloned request is finished.
> 
> Partial request update can be a bit expensive, so
> we should try to avoid it, especially it is run
> in softirq context.
> 
> After this patch is applied, both hard lockup and
> soft lockup aren't reproduced any more in one hour
> of running Laurence's test[1] on IB/SRP. Without
> this patch, the lockup can be reproduced in several
> minutes.
> 
> BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
> rerun the queue at a quiet time), we need to make the
> test more aggressive for reproducing the lockup:
> 
>   1) run hammer_write.sh 32 or 64 concurrently.
>   2) write 8M each time
> 
> [1] https://marc.info/?l=linux-block=150220185510245=2

Bart said he cannot reproduce the lockups with his patchset applied.
Have you tested using Bart's patchset?

Comments inlined below.

> ---
>  drivers/md/dm-rq.c | 15 +--
>  drivers/md/dm-rq.h |  1 +
>  2 files changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index c6ebc5b1e00e..50cd96c7de45 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -117,9 +117,9 @@ static void end_clone_bio(struct bio *clone)
>   struct dm_rq_clone_bio_info *info =
>   container_of(clone, struct dm_rq_clone_bio_info, clone);
>   struct dm_rq_target_io *tio = info->tio;
> - struct bio *bio = info->orig;
>   unsigned int nr_bytes = info->orig->bi_iter.bi_size;
>   blk_status_t error = clone->bi_status;
> + bool is_last = !clone->bi_next;
>  
>   bio_put(clone);
>  
> @@ -144,21 +144,15 @@ static void end_clone_bio(struct bio *clone)
>* I/O for the bio successfully completed.
>* Notice the data completion to the upper layer.
>*/
> -
> - /*
> -  * bios are processed from the head of the list.
> -  * So the completing bio should always be rq->bio.
> -  * If it's not, something wrong is happening.
> -  */
> - if (tio->orig->bio != bio)
> - DMERR("bio completion is going in the middle of the request");

Why did you remove this check?

> + tio->completed += nr_bytes;
>  
>   /*
>* Update the original request.
>* Do not use blk_end_request() here, because it may complete
>* the original request before the clone, and break the ordering.
>*/
> - blk_update_request(tio->orig, BLK_STS_OK, nr_bytes);
> + if (is_last)
> + blk_update_request(tio->orig, BLK_STS_OK, tio->completed);
>  }

Partial completion support is important given the potential for path
failures interrupting requests.  Why do you think it is OK to avoid it
by switching to an all or nothing approach?

Mike


[PATCH] dm-rq: do not update rq partially in each ending bio

2017-08-25 Thread Ming Lei
We don't need to update orignal dm request partially
when ending each cloned bio, and this patch just
updates orignal dm request once when the whole
cloned request is finished.

Partial request update can be a bit expensive, so
we should try to avoid it, especially it is run
in softirq context.

After this patch is applied, both hard lockup and
soft lockup aren't reproduced any more in one hour
of running Laurence's test[1] on IB/SRP. Without
this patch, the lockup can be reproduced in several
minutes.

BTW, after d4acf3650c7c(block: Make blk_mq_delay_kick_requeue_list()
rerun the queue at a quiet time), we need to make the
test more aggressive for reproducing the lockup:

1) run hammer_write.sh 32 or 64 concurrently.
2) write 8M each time

[1] https://marc.info/?l=linux-block=150220185510245=2

Cc: Laurence Oberman 
Cc: Bart Van Assche 
Cc: "dm-de...@redhat.com" 
Cc: Mike Snitzer 
Cc: Alasdair Kergon 
Signed-off-by: Ming Lei 
---
 drivers/md/dm-rq.c | 15 +--
 drivers/md/dm-rq.h |  1 +
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index c6ebc5b1e00e..50cd96c7de45 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -117,9 +117,9 @@ static void end_clone_bio(struct bio *clone)
struct dm_rq_clone_bio_info *info =
container_of(clone, struct dm_rq_clone_bio_info, clone);
struct dm_rq_target_io *tio = info->tio;
-   struct bio *bio = info->orig;
unsigned int nr_bytes = info->orig->bi_iter.bi_size;
blk_status_t error = clone->bi_status;
+   bool is_last = !clone->bi_next;
 
bio_put(clone);
 
@@ -144,21 +144,15 @@ static void end_clone_bio(struct bio *clone)
 * I/O for the bio successfully completed.
 * Notice the data completion to the upper layer.
 */
-
-   /*
-* bios are processed from the head of the list.
-* So the completing bio should always be rq->bio.
-* If it's not, something wrong is happening.
-*/
-   if (tio->orig->bio != bio)
-   DMERR("bio completion is going in the middle of the request");
+   tio->completed += nr_bytes;
 
/*
 * Update the original request.
 * Do not use blk_end_request() here, because it may complete
 * the original request before the clone, and break the ordering.
 */
-   blk_update_request(tio->orig, BLK_STS_OK, nr_bytes);
+   if (is_last)
+   blk_update_request(tio->orig, BLK_STS_OK, tio->completed);
 }
 
 static struct dm_rq_target_io *tio_from_request(struct request *rq)
@@ -455,6 +449,7 @@ static void init_tio(struct dm_rq_target_io *tio, struct 
request *rq,
tio->clone = NULL;
tio->orig = rq;
tio->error = 0;
+   tio->completed = 0;
/*
 * Avoid initializing info for blk-mq; it passes
 * target-specific data through info.ptr
diff --git a/drivers/md/dm-rq.h b/drivers/md/dm-rq.h
index 9813922e4fe5..f43c45460aac 100644
--- a/drivers/md/dm-rq.h
+++ b/drivers/md/dm-rq.h
@@ -29,6 +29,7 @@ struct dm_rq_target_io {
struct dm_stats_aux stats_aux;
unsigned long duration_jiffies;
unsigned n_sectors;
+   unsigned completed;
 };
 
 /*
-- 
2.9.5



Re: [PATCH v2] blk-mq-debugfs: Add names for recently added flags

2017-08-25 Thread Jens Axboe
On 08/18/2017 04:52 PM, Bart Van Assche wrote:
> The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED
> and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to
> blk-mq-debugfs.c such that these appear as names in debugfs instead of
> as numbers.

Added for 4.13, thanks.

-- 
Jens Axboe



[PATCH V6 10/12] mmc: block: Add CQE support

2017-08-25 Thread Adrian Hunter
Add CQE support to the block driver, including:
- optionally using DCMD for flush requests
- "manually" issuing discard requests
- issuing read / write requests to the CQE
- supporting block-layer timeouts
- handling recovery
- supporting re-tuning

CQE offers 25% - 50% better random multi-threaded I/O.  There is a slight
(e.g. 2%) drop in sequential read speed but no observable change to sequential
write.

CQE automatically sends the commands to complete requests.  However it only
supports reads / writes and so-called "direct commands" (DCMD).  Furthermore
DCMD is limited to one command at a time, but discards require 3 commands.
That makes issuing discards through CQE very awkward, but some CQE's don't
support DCMD anyway.  So for discards, the existing non-CQE approach is
taken, where the mmc core code issues the 3 commands one at a time i.e.
mmc_erase(). Where DCMD is used, is for issuing flushes.

We need to start with the legacy block API because people want to backport
CQ to earlier kernels (we really need to get features upstream more
quickly), but blk-mq has been evolving a lot (e.g. elevator support), so
backporters face having either something quite different from upstream or
trying to backport great chunks of the block layer.

We also don't know how blk-mq will perform so it would be prudent to start
with support for both the legacy API and blk-mq (as scsi does) so that we
can find out first.

This implementation for CQE support continues to use a thread to fetch
requests from the block layer and issue them. This is slightly complicated
by the different ways that requests can be issued (refer
mmc_cqe_issue_type() and mmc_blk_cqe_issue_rq()).  Also allowance must be
made for runtime pm and re-tuning which cannot be performed while CQE is
active.

Block layer timeouts are useful for CQE. While CQE will detect timeout for
a task, the block layer timeouts ensure that the driver will notice if CQE
itself gets stuck or somehow does not process a task.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 197 +-
 drivers/mmc/core/block.h |   7 ++
 drivers/mmc/core/queue.c | 270 ++-
 drivers/mmc/core/queue.h |  43 +++-
 4 files changed, 510 insertions(+), 7 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index f9ead0701e8d..94dea83d51aa 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -109,6 +109,7 @@ struct mmc_blk_data {
 #define MMC_BLK_WRITE  BIT(1)
 #define MMC_BLK_DISCARDBIT(2)
 #define MMC_BLK_SECDISCARD BIT(3)
+#define MMC_BLK_CQE_RECOVERY   BIT(4)
 
/*
 * Only set in main mmc_blk_data associated
@@ -1659,6 +1660,200 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
*do_data_tag_p = do_data_tag;
 }
 
+#define MMC_CQE_RETRIES 2
+
+void mmc_blk_cqe_complete_rq(struct request *req)
+{
+   struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+   struct mmc_request *mrq = >brq.mrq;
+   struct request_queue *q = req->q;
+   struct mmc_queue *mq = q->queuedata;
+   struct mmc_host *host = mq->card->host;
+   unsigned long flags;
+   bool put_card;
+   int err;
+
+   mmc_cqe_post_req(host, mrq);
+
+   spin_lock_irqsave(q->queue_lock, flags);
+
+   mq->cqe_in_flight[mmc_cqe_issue_type(host, req)] -= 1;
+
+   put_card = mmc_cqe_tot_in_flight(mq) == 0;
+
+   if (mrq->cmd && mrq->cmd->error)
+   err = mrq->cmd->error;
+   else if (mrq->data && mrq->data->error)
+   err = mrq->data->error;
+   else
+   err = 0;
+
+   if (err) {
+   if (mqrq->retries++ < MMC_CQE_RETRIES)
+   blk_requeue_request(q, req);
+   else
+   __blk_end_request_all(req, BLK_STS_IOERR);
+   } else if (mrq->data) {
+   if (__blk_end_request(req, BLK_STS_OK, mrq->data->bytes_xfered))
+   blk_requeue_request(q, req);
+   } else {
+   __blk_end_request_all(req, BLK_STS_OK);
+   }
+
+   mmc_cqe_kick_queue(mq);
+
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   if (put_card)
+   mmc_put_card(mq->card);
+}
+
+void mmc_blk_cqe_recovery(struct mmc_queue *mq)
+{
+   struct mmc_card *card = mq->card;
+   struct mmc_host *host = card->host;
+   int err;
+
+   mmc_get_card(card);
+
+   pr_debug("%s: CQE recovery start\n", mmc_hostname(host));
+
+   mq->cqe_in_recovery = true;
+
+   err = mmc_cqe_recovery(host);
+   if (err)
+   mmc_blk_reset(mq->blkdata, host, MMC_BLK_CQE_RECOVERY);
+   else
+   mmc_blk_reset_success(mq->blkdata, MMC_BLK_CQE_RECOVERY);
+
+   mq->cqe_in_recovery = false;
+
+   pr_debug("%s: CQE recovery done\n", 

[PATCH V6 11/12] mmc: cqhci: support for command queue enabled host

2017-08-25 Thread Adrian Hunter
From: Venkat Gopalakrishnan 

This patch adds CMDQ support for command-queue compatible
hosts.

Command queue is added in eMMC-5.1 specification. This
enables the controller to process upto 32 requests at
a time.

Adrian Hunter contributed renaming to cqhci, recovery, suspend
and resume, cqhci_off, cqhci_wait_for_idle, and external timeout
handling.

Signed-off-by: Asutosh Das 
Signed-off-by: Sujit Reddy Thumma 
Signed-off-by: Konstantin Dorfman 
Signed-off-by: Venkat Gopalakrishnan 
Signed-off-by: Subhash Jadavani 
Signed-off-by: Ritesh Harjani 
Signed-off-by: Adrian Hunter 
---
 drivers/mmc/host/Kconfig  |   13 +
 drivers/mmc/host/Makefile |1 +
 drivers/mmc/host/cqhci.c  | 1154 +
 drivers/mmc/host/cqhci.h  |  240 ++
 4 files changed, 1408 insertions(+)
 create mode 100644 drivers/mmc/host/cqhci.c
 create mode 100644 drivers/mmc/host/cqhci.h

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 60f90d49e7a9..ccf7dab4a9f8 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -843,6 +843,19 @@ config MMC_SUNXI
  This selects support for the SD/MMC Host Controller on
  Allwinner sunxi SoCs.
 
+config MMC_CQHCI
+   tristate "Command Queue Host Controller Interface support"
+   depends on HAS_DMA
+   help
+ This selects the Command Queue Host Controller Interface (CQHCI)
+ support present in host controllers of Qualcomm Technologies, Inc
+ amongst others.
+ This controller supports eMMC devices with command queue support.
+
+ If you have a controller with this interface, say Y or M here.
+
+ If unsure, say N.
+
 config MMC_TOSHIBA_PCI
tristate "Toshiba Type A SD/MMC Card Interface Driver"
depends on PCI
diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile
index 8c46766c000c..3ae71e006890 100644
--- a/drivers/mmc/host/Makefile
+++ b/drivers/mmc/host/Makefile
@@ -88,6 +88,7 @@ obj-$(CONFIG_MMC_SDHCI_MSM)   += sdhci-msm.o
 obj-$(CONFIG_MMC_SDHCI_ST) += sdhci-st.o
 obj-$(CONFIG_MMC_SDHCI_MICROCHIP_PIC32)+= sdhci-pic32.o
 obj-$(CONFIG_MMC_SDHCI_BRCMSTB)+= sdhci-brcmstb.o
+obj-$(CONFIG_MMC_CQHCI)+= cqhci.o
 
 ifeq ($(CONFIG_CB710_DEBUG),y)
CFLAGS-cb710-mmc+= -DDEBUG
diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
new file mode 100644
index ..eb3c1695b0c7
--- /dev/null
+++ b/drivers/mmc/host/cqhci.c
@@ -0,0 +1,1154 @@
+/* Copyright (c) 2015, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "cqhci.h"
+
+#define DCMD_SLOT 31
+#define NUM_SLOTS 32
+
+struct cqhci_slot {
+   struct mmc_request *mrq;
+   unsigned int flags;
+#define CQHCI_EXTERNAL_TIMEOUT BIT(0)
+#define CQHCI_COMPLETEDBIT(1)
+#define CQHCI_HOST_CRC BIT(2)
+#define CQHCI_HOST_TIMEOUT BIT(3)
+#define CQHCI_HOST_OTHER   BIT(4)
+};
+
+static inline u8 *get_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->desc_base + (tag * cq_host->slot_sz);
+}
+
+static inline u8 *get_link_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   u8 *desc = get_desc(cq_host, tag);
+
+   return desc + cq_host->task_desc_len;
+}
+
+static inline dma_addr_t get_trans_desc_dma(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->trans_desc_dma_base +
+   (cq_host->mmc->max_segs * tag *
+cq_host->trans_desc_len);
+}
+
+static inline u8 *get_trans_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->trans_desc_base +
+   (cq_host->trans_desc_len * cq_host->mmc->max_segs * tag);
+}
+
+static void setup_trans_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   u8 *link_temp;
+   dma_addr_t trans_temp;
+
+   link_temp = get_link_desc(cq_host, tag);
+   trans_temp = get_trans_desc_dma(cq_host, tag);
+
+   memset(link_temp, 0, cq_host->link_desc_len);
+   if (cq_host->link_desc_len > 8)
+   *(link_temp + 8) = 0;
+
+   if (tag == DCMD_SLOT && (cq_host->mmc->caps2 & MMC_CAP2_CQE_DCMD)) {
+   *link_temp = CQHCI_VALID(0) 

[PATCH V6 12/12] mmc: sdhci-pci: Add CQHCI support for Intel GLK

2017-08-25 Thread Adrian Hunter
Add CQHCI initialization and implement CQHCI operations for Intel GLK.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/host/Kconfig  |   1 +
 drivers/mmc/host/sdhci-pci-core.c | 154 +-
 2 files changed, 154 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index ccf7dab4a9f8..b8c9ea5cb8cd 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -81,6 +81,7 @@ config MMC_SDHCI_BIG_ENDIAN_32BIT_BYTE_SWAPPER
 config MMC_SDHCI_PCI
tristate "SDHCI support on PCI bus"
depends on MMC_SDHCI && PCI
+   select MMC_CQHCI
help
  This selects the PCI Secure Digital Host Controller Interface.
  Most controllers found today are PCI devices.
diff --git a/drivers/mmc/host/sdhci-pci-core.c 
b/drivers/mmc/host/sdhci-pci-core.c
index bbaddf18a1b3..e22075f99707 100644
--- a/drivers/mmc/host/sdhci-pci-core.c
+++ b/drivers/mmc/host/sdhci-pci-core.c
@@ -30,6 +30,8 @@
 #include 
 #include 
 
+#include "cqhci.h"
+
 #include "sdhci.h"
 #include "sdhci-pci.h"
 #include "sdhci-pci-o2micro.h"
@@ -117,6 +119,28 @@ int sdhci_pci_resume_host(struct sdhci_pci_chip *chip)
 
return 0;
 }
+
+static int sdhci_cqhci_suspend(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = cqhci_suspend(chip->slots[0]->host->mmc);
+   if (ret)
+   return ret;
+
+   return sdhci_pci_suspend_host(chip);
+}
+
+static int sdhci_cqhci_resume(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = sdhci_pci_resume_host(chip);
+   if (ret)
+   return ret;
+
+   return cqhci_resume(chip->slots[0]->host->mmc);
+}
 #endif
 
 #ifdef CONFIG_PM
@@ -167,8 +191,48 @@ static int sdhci_pci_runtime_resume_host(struct 
sdhci_pci_chip *chip)
 
return 0;
 }
+
+static int sdhci_cqhci_runtime_suspend(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = cqhci_suspend(chip->slots[0]->host->mmc);
+   if (ret)
+   return ret;
+
+   return sdhci_pci_runtime_suspend_host(chip);
+}
+
+static int sdhci_cqhci_runtime_resume(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = sdhci_pci_runtime_resume_host(chip);
+   if (ret)
+   return ret;
+
+   return cqhci_resume(chip->slots[0]->host->mmc);
+}
 #endif
 
+static u32 sdhci_cqhci_irq(struct sdhci_host *host, u32 intmask)
+{
+   int cmd_error = 0;
+   int data_error = 0;
+
+   if (!sdhci_cqe_irq(host, intmask, _error, _error))
+   return intmask;
+
+   cqhci_irq(host->mmc, intmask, cmd_error, data_error);
+
+   return 0;
+}
+
+static void sdhci_pci_dumpregs(struct mmc_host *mmc)
+{
+   sdhci_dumpregs(mmc_priv(mmc));
+}
+
 /*\
  *   *
  * Hardware specific quirk handling  *
@@ -567,6 +631,17 @@ static void intel_hs400_enhanced_strobe(struct mmc_host 
*mmc,
.hw_reset   = sdhci_pci_hw_reset,
 };
 
+static const struct sdhci_ops sdhci_intel_glk_ops = {
+   .set_clock  = sdhci_set_clock,
+   .set_power  = sdhci_intel_set_power,
+   .enable_dma = sdhci_pci_enable_dma,
+   .set_bus_width  = sdhci_set_bus_width,
+   .reset  = sdhci_reset,
+   .set_uhs_signaling  = sdhci_set_uhs_signaling,
+   .hw_reset   = sdhci_pci_hw_reset,
+   .irq= sdhci_cqhci_irq,
+};
+
 static void byt_read_dsm(struct sdhci_pci_slot *slot)
 {
struct intel_host *intel_host = sdhci_pci_priv(slot);
@@ -596,15 +671,83 @@ static int glk_emmc_probe_slot(struct sdhci_pci_slot 
*slot)
 {
int ret = byt_emmc_probe_slot(slot);
 
+   slot->host->mmc->caps2 |= MMC_CAP2_CQE;
+
if (slot->chip->pdev->device != PCI_DEVICE_ID_INTEL_GLK_EMMC) {
slot->host->mmc->caps2 |= MMC_CAP2_HS400_ES,
slot->host->mmc_host_ops.hs400_enhanced_strobe =
intel_hs400_enhanced_strobe;
+   slot->host->mmc->caps2 |= MMC_CAP2_CQE_DCMD;
}
 
return ret;
 }
 
+static void glk_cqe_enable(struct mmc_host *mmc)
+{
+   struct sdhci_host *host = mmc_priv(mmc);
+   u32 reg;
+
+   /*
+* CQE gets stuck if it sees Buffer Read Enable bit set, which can be
+* the case after tuning, so ensure the buffer is drained.
+*/
+   reg = sdhci_readl(host, SDHCI_PRESENT_STATE);
+   while (reg & SDHCI_DATA_AVAILABLE) {
+   sdhci_readl(host, SDHCI_BUFFER);
+   reg = sdhci_readl(host, SDHCI_PRESENT_STATE);
+   }
+
+   sdhci_cqe_enable(mmc);
+}
+
+static const struct cqhci_host_ops glk_cqhci_ops = {
+   .enable = glk_cqe_enable,
+   .disable

[PATCH V6 09/12] mmc: block: Prepare CQE data

2017-08-25 Thread Adrian Hunter
Enhance mmc_blk_data_prep() to support CQE requests. That means adding
some things that for non-CQE requests would be encoded into the command
arguments - such as the block address, reliable-write flag, and data tag
flag. Also the request tag is needed to provide the command queue task id,
and a comment is added to explain the future possibility of defining a
priority.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index cc950cb1cfd7..f9ead0701e8d 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1554,6 +1554,7 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
memset(brq, 0, sizeof(struct mmc_blk_request));
 
brq->mrq.data = >data;
+   brq->mrq.tag = req->tag;
 
brq->stop.opcode = MMC_STOP_TRANSMISSION;
brq->stop.arg = 0;
@@ -1568,6 +1569,14 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
 
brq->data.blksz = 512;
brq->data.blocks = blk_rq_sectors(req);
+   brq->data.blk_addr = blk_rq_pos(req);
+
+   /*
+* The command queue supports 2 priorities: "high" (1) and "simple" (0).
+* The eMMC will give "high" priority tasks priority over "simple"
+* priority tasks. Here we always set "simple" priority by not setting
+* MMC_DATA_PRIO.
+*/
 
/*
 * The block layer doesn't support all sector count
@@ -1597,8 +1606,10 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
brq->data.blocks);
}
 
-   if (do_rel_wr)
+   if (do_rel_wr) {
mmc_apply_rel_rw(brq, card, req);
+   brq->data.flags |= MMC_DATA_REL_WR;
+   }
 
/*
 * Data tag is used only during writing meta data to speed
@@ -1610,6 +1621,9 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
  ((brq->data.blocks * brq->data.blksz) >=
   card->ext_csd.data_tag_unit_size);
 
+   if (do_data_tag)
+   brq->data.flags |= MMC_DATA_DAT_TAG;
+
mmc_set_data_timeout(>data, card);
 
brq->data.sg = mqrq->sg;
-- 
1.9.1



[PATCH V6 08/12] mmc: block: Use local variables in mmc_blk_data_prep()

2017-08-25 Thread Adrian Hunter
Use local variables in mmc_blk_data_prep() in preparation for adding CQE
support which doesn't use the output variables.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 2f00d11adfa9..cc950cb1cfd7 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1534,21 +1534,22 @@ static enum mmc_blk_status mmc_blk_err_check(struct 
mmc_card *card,
 }
 
 static void mmc_blk_data_prep(struct mmc_queue *mq, struct mmc_queue_req *mqrq,
- int disable_multi, bool *do_rel_wr,
- bool *do_data_tag)
+ int disable_multi, bool *do_rel_wr_p,
+ bool *do_data_tag_p)
 {
struct mmc_blk_data *md = mq->blkdata;
struct mmc_card *card = md->queue.card;
struct mmc_blk_request *brq = >brq;
struct request *req = mmc_queue_req_to_req(mqrq);
+   bool do_rel_wr, do_data_tag;
 
/*
 * Reliable writes are used to implement Forced Unit Access and
 * are supported only on MMCs.
 */
-   *do_rel_wr = (req->cmd_flags & REQ_FUA) &&
-rq_data_dir(req) == WRITE &&
-(md->flags & MMC_BLK_REL_WR);
+   do_rel_wr = (req->cmd_flags & REQ_FUA) &&
+   rq_data_dir(req) == WRITE &&
+   (md->flags & MMC_BLK_REL_WR);
 
memset(brq, 0, sizeof(struct mmc_blk_request));
 
@@ -1596,18 +1597,18 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
brq->data.blocks);
}
 
-   if (*do_rel_wr)
+   if (do_rel_wr)
mmc_apply_rel_rw(brq, card, req);
 
/*
 * Data tag is used only during writing meta data to speed
 * up write and any subsequent read of this meta data
 */
-   *do_data_tag = card->ext_csd.data_tag_unit_size &&
-  (req->cmd_flags & REQ_META) &&
-  (rq_data_dir(req) == WRITE) &&
-  ((brq->data.blocks * brq->data.blksz) >=
-   card->ext_csd.data_tag_unit_size);
+   do_data_tag = card->ext_csd.data_tag_unit_size &&
+ (req->cmd_flags & REQ_META) &&
+ (rq_data_dir(req) == WRITE) &&
+ ((brq->data.blocks * brq->data.blksz) >=
+  card->ext_csd.data_tag_unit_size);
 
mmc_set_data_timeout(>data, card);
 
@@ -1636,6 +1637,12 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
mqrq->areq.mrq = >mrq;
 
mmc_queue_bounce_pre(mqrq);
+
+   if (do_rel_wr_p)
+   *do_rel_wr_p = do_rel_wr;
+
+   if (do_data_tag_p)
+   *do_data_tag_p = do_data_tag;
 }
 
 static void mmc_blk_rw_rq_prep(struct mmc_queue_req *mqrq,
-- 
1.9.1



[PATCH V6 07/12] mmc: mmc: Enable CQE's

2017-08-25 Thread Adrian Hunter
Enable or disable CQE when a card is added or removed respectively.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/bus.c |  7 +++
 drivers/mmc/core/mmc.c | 12 
 2 files changed, 19 insertions(+)

diff --git a/drivers/mmc/core/bus.c b/drivers/mmc/core/bus.c
index 301246513a37..a4b49e25fe96 100644
--- a/drivers/mmc/core/bus.c
+++ b/drivers/mmc/core/bus.c
@@ -369,10 +369,17 @@ int mmc_add_card(struct mmc_card *card)
  */
 void mmc_remove_card(struct mmc_card *card)
 {
+   struct mmc_host *host = card->host;
+
 #ifdef CONFIG_DEBUG_FS
mmc_remove_card_debugfs(card);
 #endif
 
+   if (host->cqe_enabled) {
+   host->cqe_ops->cqe_disable(host);
+   host->cqe_enabled = false;
+   }
+
if (mmc_card_present(card)) {
if (mmc_host_is_spi(card->host)) {
pr_info("%s: SPI card removed\n",
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 194cf2082e73..0dcbd25126a1 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1807,6 +1807,18 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
 */
card->reenable_cmdq = card->ext_csd.cmdq_en;
 
+   if (card->ext_csd.cmdq_en && !host->cqe_enabled) {
+   err = host->cqe_ops->cqe_enable(host, card);
+   if (err) {
+   pr_err("%s: Failed to enable CQE, error %d\n",
+   mmc_hostname(host), err);
+   } else {
+   host->cqe_enabled = true;
+   pr_info("%s: Command Queue Engine enabled\n",
+   mmc_hostname(host));
+   }
+   }
+
if (!oldcard)
host->card = card;
 
-- 
1.9.1



[PATCH V6 06/12] mmc: mmc: Enable Command Queuing

2017-08-25 Thread Adrian Hunter
Enable the Command Queue if the host controller supports a command queue
engine. It is not compatible with Packed Commands, so make a note of that in the
comment.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/mmc.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index a7eb623f8daa..194cf2082e73 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1784,6 +1784,23 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
}
 
/*
+* Enable Command Queue if supported. Note that Packed Commands cannot
+* be used with Command Queue.
+*/
+   card->ext_csd.cmdq_en = false;
+   if (card->ext_csd.cmdq_support && host->caps2 & MMC_CAP2_CQE) {
+   err = mmc_cmdq_enable(card);
+   if (err && err != -EBADMSG)
+   goto free_card;
+   if (err) {
+   pr_warn("%s: Enabling CMDQ failed\n",
+   mmc_hostname(card->host));
+   card->ext_csd.cmdq_support = false;
+   card->ext_csd.cmdq_depth = 0;
+   err = 0;
+   }
+   }
+   /*
 * In some cases (e.g. RPMB or mmc_test), the Command Queue must be
 * disabled for a time, so a flag is needed to indicate to re-enable the
 * Command Queue.
-- 
1.9.1



[PATCH V6 05/12] mmc: core: Add support for handling CQE requests

2017-08-25 Thread Adrian Hunter
Add core support for handling CQE requests, including starting, completing
and recovering.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/core.c  | 163 +--
 drivers/mmc/core/core.h  |   4 ++
 include/linux/mmc/host.h |   2 +
 3 files changed, 164 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 66c9cf49ad2f..cc146e1dd2e3 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -266,7 +266,8 @@ static void __mmc_start_request(struct mmc_host *host, 
struct mmc_request *mrq)
host->ops->request(host, mrq);
 }
 
-static void mmc_mrq_pr_debug(struct mmc_host *host, struct mmc_request *mrq)
+static void mmc_mrq_pr_debug(struct mmc_host *host, struct mmc_request *mrq,
+bool cqe)
 {
if (mrq->sbc) {
pr_debug("<%s: starting CMD%u arg %08x flags %08x>\n",
@@ -275,9 +276,12 @@ static void mmc_mrq_pr_debug(struct mmc_host *host, struct 
mmc_request *mrq)
}
 
if (mrq->cmd) {
-   pr_debug("%s: starting CMD%u arg %08x flags %08x\n",
-mmc_hostname(host), mrq->cmd->opcode, mrq->cmd->arg,
-mrq->cmd->flags);
+   pr_debug("%s: starting %sCMD%u arg %08x flags %08x\n",
+mmc_hostname(host), cqe ? "CQE direct " : "",
+mrq->cmd->opcode, mrq->cmd->arg, mrq->cmd->flags);
+   } else if (cqe) {
+   pr_debug("%s: starting CQE transfer for tag %d blkaddr %u\n",
+mmc_hostname(host), mrq->tag, mrq->data->blk_addr);
}
 
if (mrq->data) {
@@ -342,7 +346,7 @@ static int mmc_start_request(struct mmc_host *host, struct 
mmc_request *mrq)
if (mmc_card_removed(host->card))
return -ENOMEDIUM;
 
-   mmc_mrq_pr_debug(host, mrq);
+   mmc_mrq_pr_debug(host, mrq, false);
 
WARN_ON(!host->claimed);
 
@@ -482,6 +486,155 @@ void mmc_wait_for_req_done(struct mmc_host *host, struct 
mmc_request *mrq)
 }
 EXPORT_SYMBOL(mmc_wait_for_req_done);
 
+/*
+ * mmc_cqe_start_req - Start a CQE request.
+ * @host: MMC host to start the request
+ * @mrq: request to start
+ *
+ * Start the request, re-tuning if needed and it is possible. Returns an error
+ * code if the request fails to start or -EBUSY if CQE is busy.
+ */
+int mmc_cqe_start_req(struct mmc_host *host, struct mmc_request *mrq)
+{
+   int err;
+
+   /*
+* CQE cannot process re-tuning commands. Caller must hold retuning
+* while CQE is in use.  Re-tuning can happen here only when CQE has no
+* active requests i.e. this is the first.  Note, re-tuning will call
+* ->cqe_off().
+*/
+   err = mmc_retune(host);
+   if (err)
+   goto out_err;
+
+   mrq->host = host;
+
+   mmc_mrq_pr_debug(host, mrq, true);
+
+   err = mmc_mrq_prep(host, mrq);
+   if (err)
+   goto out_err;
+
+   err = host->cqe_ops->cqe_request(host, mrq);
+   if (err)
+   goto out_err;
+
+   trace_mmc_request_start(host, mrq);
+
+   return 0;
+
+out_err:
+   if (mrq->cmd) {
+   pr_debug("%s: failed to start CQE direct CMD%u, error %d\n",
+mmc_hostname(host), mrq->cmd->opcode, err);
+   } else {
+   pr_debug("%s: failed to start CQE transfer for tag %d, error 
%d\n",
+mmc_hostname(host), mrq->tag, err);
+   }
+   return err;
+}
+EXPORT_SYMBOL(mmc_cqe_start_req);
+
+/**
+ * mmc_cqe_request_done - CQE has finished processing an MMC request
+ * @host: MMC host which completed request
+ * @mrq: MMC request which completed
+ *
+ * CQE drivers should call this function when they have completed
+ * their processing of a request.
+ */
+void mmc_cqe_request_done(struct mmc_host *host, struct mmc_request *mrq)
+{
+   mmc_should_fail_request(host, mrq);
+
+   /* Flag re-tuning needed on CRC errors */
+   if ((mrq->cmd && mrq->cmd->error == -EILSEQ) ||
+   (mrq->data && mrq->data->error == -EILSEQ))
+   mmc_retune_needed(host);
+
+   trace_mmc_request_done(host, mrq);
+
+   if (mrq->cmd) {
+   pr_debug("%s: CQE req done (direct CMD%u): %d\n",
+mmc_hostname(host), mrq->cmd->opcode, mrq->cmd->error);
+   } else {
+   pr_debug("%s: CQE transfer done tag %d\n",
+mmc_hostname(host), mrq->tag);
+   }
+
+   if (mrq->data) {
+   pr_debug("%s: %d bytes transferred: %d\n",
+mmc_hostname(host),
+mrq->data->bytes_xfered, mrq->data->error);
+   }
+
+   mrq->done(mrq);
+}
+EXPORT_SYMBOL(mmc_cqe_request_done);
+
+/**
+ * mmc_cqe_post_req - CQE post process of a completed MMC request
+ * @host: MMC host
+ * @mrq: MMC 

[PATCH V6 03/12] mmc: host: Add CQE interface

2017-08-25 Thread Adrian Hunter
Add CQE host operations, capabilities, and host members.

Signed-off-by: Adrian Hunter 
---
 include/linux/mmc/core.h |  6 ++
 include/linux/mmc/host.h | 53 
 2 files changed, 59 insertions(+)

diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index 178f699ac172..927519385482 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -156,6 +156,12 @@ struct mmc_request {
struct completion   completion;
struct completion   cmd_completion;
void(*done)(struct mmc_request *);/* completion 
function */
+   /*
+* Notify uppers layers (e.g. mmc block driver) that recovery is needed
+* due to an error associated with the mmc_request. Currently used only
+* by CQE.
+*/
+   void(*recovery_notifier)(struct mmc_request *);
struct mmc_host *host;
 
/* Allow other commands during this ongoing data transfer or busy wait 
*/
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index e92629518f68..f3f2d07feb2a 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -162,6 +162,50 @@ struct mmc_host_ops {
  unsigned int direction, int blk_size);
 };
 
+struct mmc_cqe_ops {
+   /* Allocate resources, and make the CQE operational */
+   int (*cqe_enable)(struct mmc_host *host, struct mmc_card *card);
+   /* Free resources, and make the CQE non-operational */
+   void(*cqe_disable)(struct mmc_host *host);
+   /*
+* Issue a read, write or DCMD request to the CQE. Also deal with the
+* effect of ->cqe_off().
+*/
+   int (*cqe_request)(struct mmc_host *host, struct mmc_request *mrq);
+   /* Free resources (e.g. DMA mapping) associated with the request */
+   void(*cqe_post_req)(struct mmc_host *host, struct mmc_request *mrq);
+   /*
+* Prepare the CQE and host controller to accept non-CQ commands. There
+* is no corresponding ->cqe_on(), instead ->cqe_request() is required
+* to deal with that.
+*/
+   void(*cqe_off)(struct mmc_host *host);
+   /*
+* Wait for all CQE tasks to complete. Return an error if recovery
+* becomes necessary.
+*/
+   int (*cqe_wait_for_idle)(struct mmc_host *host);
+   /*
+* Notify CQE that a request has timed out. Return false if the request
+* completed or true if a timeout happened in which case indicate if
+* recovery is needed.
+*/
+   bool(*cqe_timeout)(struct mmc_host *host, struct mmc_request *mrq,
+  bool *recovery_needed);
+   /*
+* Stop all CQE activity and prepare the CQE and host controller to
+* accept recovery commands.
+*/
+   void(*cqe_recovery_start)(struct mmc_host *host);
+   /*
+* Clear the queue and call mmc_cqe_request_done() on all requests.
+* Requests that errored will have the error set on the mmc_request
+* (data->error or cmd->error for DCMD).  Requests that did not error
+* will have zero data bytes transferred.
+*/
+   void(*cqe_recovery_finish)(struct mmc_host *host);
+};
+
 struct mmc_async_req {
/* active mmc request */
struct mmc_request  *mrq;
@@ -303,6 +347,8 @@ struct mmc_host {
 #define MMC_CAP2_HS400_ES  (1 << 20)   /* Host supports enhanced 
strobe */
 #define MMC_CAP2_NO_SD (1 << 21)   /* Do not send SD commands 
during initialization */
 #define MMC_CAP2_NO_MMC(1 << 22)   /* Do not send (e)MMC 
commands during initialization */
+#define MMC_CAP2_CQE   (1 << 23)   /* Has eMMC command queue 
engine */
+#define MMC_CAP2_CQE_DCMD  (1 << 24)   /* CQE can issue a direct 
command */
 
mmc_pm_flag_t   pm_caps;/* supported pm features */
 
@@ -386,6 +432,13 @@ struct mmc_host {
int dsr_req;/* DSR value is valid */
u32 dsr;/* optional driver stage (DSR) value */
 
+   /* Command Queue Engine (CQE) support */
+   const struct mmc_cqe_ops *cqe_ops;
+   void*cqe_private;
+   int cqe_qdepth;
+   boolcqe_enabled;
+   boolcqe_on;
+
unsigned long   private[0] cacheline_aligned;
 };
 
-- 
1.9.1



[PATCH V6 04/12] mmc: core: Turn off CQE before sending commands

2017-08-25 Thread Adrian Hunter
CQE needs to be off for the host controller to accept non-CQ commands. Turn
off the CQE before sending commands, and ensure it is off in any reset or
power management paths, or re-tuning.

Signed-off-by: Adrian Hunter 
Reviewed-by: Linus Walleij 
Signed-off-by: Ulf Hansson 
---
 drivers/mmc/core/core.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 5dd1c00d95f5..66c9cf49ad2f 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -260,6 +260,9 @@ static void __mmc_start_request(struct mmc_host *host, 
struct mmc_request *mrq)
 
trace_mmc_request_start(host, mrq);
 
+   if (host->cqe_on)
+   host->cqe_ops->cqe_off(host);
+
host->ops->request(host, mrq);
 }
 
@@ -979,6 +982,9 @@ int mmc_execute_tuning(struct mmc_card *card)
if (!host->ops->execute_tuning)
return 0;
 
+   if (host->cqe_on)
+   host->cqe_ops->cqe_off(host);
+
if (mmc_card_mmc(card))
opcode = MMC_SEND_TUNING_BLOCK_HS200;
else
@@ -1018,6 +1024,9 @@ void mmc_set_bus_width(struct mmc_host *host, unsigned 
int width)
  */
 void mmc_set_initial_state(struct mmc_host *host)
 {
+   if (host->cqe_on)
+   host->cqe_ops->cqe_off(host);
+
mmc_retune_disable(host);
 
if (mmc_host_is_spi(host))
-- 
1.9.1



[PATCH V6 02/12] mmc: block: Fix block status codes

2017-08-25 Thread Adrian Hunter
Commit 2a842acab109 ("block: introduce new block status code type") changed
the error type but not in patches merged through the mmc tree, like
commit 0493f6fe5bde ("mmc: block: Move boot partition locking into a driver
op"). Fix one error code that is incorrect and also use BLK_STS_OK in
preference to 0.

Fixes: 17ece345a042 ("Merge tag 'mmc-v4.13' of 
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc")
Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 0eebc2f726c3..2f00d11adfa9 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1223,7 +1223,7 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, 
struct request *req)
break;
}
mq_rq->drv_op_result = ret;
-   blk_end_request_all(req, ret);
+   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
 }
 
 static void mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req)
@@ -1728,9 +1728,9 @@ static bool mmc_blk_rw_cmd_err(struct mmc_blk_data *md, 
struct mmc_card *card,
if (err)
req_pending = old_req_pending;
else
-   req_pending = blk_end_request(req, 0, blocks << 9);
+   req_pending = blk_end_request(req, BLK_STS_OK, blocks 
<< 9);
} else {
-   req_pending = blk_end_request(req, 0, brq->data.bytes_xfered);
+   req_pending = blk_end_request(req, BLK_STS_OK, 
brq->data.bytes_xfered);
}
return req_pending;
 }
-- 
1.9.1



[PATCH V6 01/12] mmc: core: Move mmc_start_areq() declaration

2017-08-25 Thread Adrian Hunter
mmc_start_areq() is an internal mmc core API. Move the declaration
accordingly.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/core.h  | 6 ++
 include/linux/mmc/core.h | 4 
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 55f543fd37c4..ca861091a776 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -107,6 +107,12 @@ static inline void mmc_unregister_pm_notifier(struct 
mmc_host *host) { }
 void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq);
 bool mmc_is_req_done(struct mmc_host *host, struct mmc_request *mrq);
 
+struct mmc_async_req;
+
+struct mmc_async_req *mmc_start_areq(struct mmc_host *host,
+struct mmc_async_req *areq,
+enum mmc_blk_status *ret_stat);
+
 int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
unsigned int arg);
 int mmc_can_erase(struct mmc_card *card);
diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index bf1788a224e6..178f699ac172 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -165,11 +165,7 @@ struct mmc_request {
 };
 
 struct mmc_card;
-struct mmc_async_req;
 
-struct mmc_async_req *mmc_start_areq(struct mmc_host *host,
-   struct mmc_async_req *areq,
-   enum mmc_blk_status *ret_stat);
 void mmc_wait_for_req(struct mmc_host *host, struct mmc_request *mrq);
 int mmc_wait_for_cmd(struct mmc_host *host, struct mmc_command *cmd,
int retries);
-- 
1.9.1



[PATCH V6 00/12] mmc: Add Command Queue support

2017-08-25 Thread Adrian Hunter
Hi

Here is V6 of the hardware command queue patches without the software
command queue patches.  Patches "mmc: host: Add CQE interface" and
"mmc: core: Turn off CQE before sending commands" have been applied to Ulf's
next branch.  "mmc: host: Add CQE interface" needs to be dropped in favour
of the new version.

HW CMDQ offers 25% - 50% better random multi-threaded I/O.  I see a slight
2% drop in sequential read speed but no change to sequential write.

We need to start with the legacy block API because people want to backport
CQ to earlier kernels (we really need to get features upstream more
quickly), but blk-mq has been evolving a lot (e.g. elevator support), so
backporters face having either something quite different from upstream or
trying to backport great chunks of the block layer.

We also don't know how blk-mq will perform so it would be prudent to start
with support for both the legacy API and blk-mq (as scsi does) so that we
can find out first.

RFC patches to support blk-mq can be found here:
https://marc.info/?l=linux-block=150349582124880


Changes since V5:
Re-based
  mmc: core: Add mmc_retune_hold_now()
Dropped because it has been applied
  mmc: core: Add members to mmc_request and mmc_data for CQE's
Dropped because it has been applied
  mmc: core: Move mmc_start_areq() declaration
New patch at Ulf's request
  mmc: block: Fix block status codes
Another un-related patch
  mmc: host: Add CQE interface
Move recovery_notifier() callback to struct mmc_request
  mmc: core: Add support for handling CQE requests
Roll __mmc_cqe_request_done() into mmc_cqe_request_done()
Move function declarations requested by Ulf
  mmc: core: Remove unused MMC_CAP2_PACKED_CMD
Dropped because it has been applied
  mmc: block: Add CQE support
Add explanation to commit message
Adjustment for changed recovery_notifier() callback
  mmc: cqhci: support for command queue enabled host
Adjustment for changed recovery_notifier() callback
  mmc: sdhci-pci: Add CQHCI support for Intel GLK
Add DCMD capability for Intel controllers except GLK

Changes since V4:
  mmc: core: Add mmc_retune_hold_now()
Add explanation to commit message.
  mmc: host: Add CQE interface
Add comments to callback declarations.
  mmc: core: Turn off CQE before sending commands
Add explanation to commit message.
  mmc: core: Add support for handling CQE requests
Add comments as requested by Ulf.
  mmc: core: Remove unused MMC_CAP2_PACKED_CMD
New patch.
  mmc: mmc: Enable Command Queuing
Adjust for removal of MMC_CAP2_PACKED_CMD.
Add a comment about Packed Commands.
  mmc: mmc: Enable CQE's
Remove un-necessary check for MMC_CAP2_CQE
  mmc: block: Use local variables in mmc_blk_data_prep()
New patch.
  mmc: block: Prepare CQE data
Adjust due to "mmc: block: Use local variables in mmc_blk_data_prep()"
Remove priority setting.
Add explanation to commit message.
  mmc: cqhci: support for command queue enabled host
Fix transfer descriptor setting in cqhci_set_tran_desc() for 32-bit DMA

Changes since V3:
Adjusted ...blk_end_request...() for new block status codes
Fixed CQHCI transaction descriptor for "no DCMD" case

Changes since V2:
Dropped patches that have been applied.
Re-based
Added "mmc: sdhci-pci: Add CQHCI support for Intel GLK"

Changes since V1:

"Share mmc request array between partitions" is dependent
on changes in "Introduce queue semantics", so added that
and block fixes:

Added "Fix is_waiting_last_req set incorrectly"
Added "Fix cmd error reset failure path"
Added "Use local var for mqrq_cur"
Added "Introduce queue semantics"

Changes since RFC:

Re-based on next.
Added comment about command queue priority.
Added some acks and reviews.


Adrian Hunter (11):
  mmc: core: Move mmc_start_areq() declaration
  mmc: block: Fix block status codes
  mmc: host: Add CQE interface
  mmc: core: Turn off CQE before sending commands
  mmc: core: Add support for handling CQE requests
  mmc: mmc: Enable Command Queuing
  mmc: mmc: Enable CQE's
  mmc: block: Use local variables in mmc_blk_data_prep()
  mmc: block: Prepare CQE data
  mmc: block: Add CQE support
  mmc: sdhci-pci: Add CQHCI support for Intel GLK

Venkat Gopalakrishnan (1):
  mmc: cqhci: support for command queue enabled host

 drivers/mmc/core/block.c  |  246 +++-
 drivers/mmc/core/block.h  |7 +
 drivers/mmc/core/bus.c|7 +
 drivers/mmc/core/core.c   |  172 +-
 drivers/mmc/core/core.h   |   10 +
 drivers/mmc/core/mmc.c|   29 +
 drivers/mmc/core/queue.c  |  270 

Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-25 Thread Ming Lei
On Thu, Aug 24, 2017 at 02:38:36PM +0800, Ming Lei wrote:
> On Wed, Aug 23, 2017 at 01:56:50PM -0600, Jens Axboe wrote:
> > On Sat, Aug 05 2017, Ming Lei wrote:
> > > During dispatching, we moved all requests from hctx->dispatch to
> > > one temporary list, then dispatch them one by one from this list.
> > > Unfortunately duirng this period, run queue from other contexts
> > > may think the queue is idle, then start to dequeue from sw/scheduler
> > > queue and still try to dispatch because ->dispatch is empty. This way
> > > hurts sequential I/O performance because requests are dequeued when
> > > lld queue is busy.
> > > 
> > > This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> > > make sure that request isn't dequeued until ->dispatch is
> > > flushed.
> > 
> > I don't like how this patch introduces a bunch of locked setting of a
> > flag under the hctx lock. Especially since I think we can easily avoid
> > it.
> 
> Actually the lock isn't needed for setting the flag, will move it out
> in V3.

My fault, looks we can't move it out of the lock, because the new
added rqs can be flushed with the bit cleared together just
between adding list to ->dispatch and setting BLK_MQ_S_DISPATCH_BUSY,
then the bit is never cleared and I/O hang is caused.

> 
> > 
> > > - } else if (!has_sched_dispatch & !q->queue_depth) {
> > > + blk_mq_dispatch_rq_list(q, _list);
> > > +
> > > + /*
> > > +  * We may clear DISPATCH_BUSY just after it
> > > +  * is set from another context, the only cost
> > > +  * is that one request is dequeued a bit early,
> > > +  * we can survive that. Given the window is
> > > +  * too small, no need to worry about performance
> > > +  * effect.
> > > +  */
> > > + if (list_empty_careful(>dispatch))
> > > + clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> > 
> > This is basically the only place where we modify it without holding the
> > hctx lock. Can we move it into blk_mq_dispatch_rq_list()? The list is
> 
> The problem is that blk_mq_dispatch_rq_list() don't know if it is
> handling requests from hctx->dispatch or sw/scheduler queue. We only
> need to clear the bit after hctx->dispatch is flushed. So the clearing
> can't be moved into blk_mq_dispatch_rq_list().
> 
> > generally empty, unless for the case where we splice residuals back. If
> > we splice them back, we grab the lock anyway.
> > 
> > The other places it's set under the hctx lock, yet we end up using an
> > atomic operation to do it.

In theory, it is better to hold the lock to clear the bit, but
with cost of one extra lock acquiring no matter moving it to
blk_mq_dispatch_rq_list() or not.

We can move clear_bit() into blk_mq_dispatch_rq_list() and
pass one parameter to indicate if it is handling requests
from ->dispatch or not, the following code is needed at
the end of blk_mq_dispatch_rq_list():

if (list_empty(list)) {
if (rq_from_dispatch_list) {
spin_lock(>lock);
if (list_empty_careful(>dispatch))
clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
spin_unlock(>lock);
}
}

If we clear the bit lockless, the BUSY bit may be cleared early, then
dequeue early, that is what we can accept because the race window is
so small.

-- 
Ming


Re: [PATCH 2/4] fs: support IOCB_NOWAIT in generic_file_buffered_read

2017-08-25 Thread Christoph Hellwig
On Thu, Aug 24, 2017 at 04:31:41PM +0200, Jan Kara wrote:
> And wait_on_page_locked_killable() above does not seem to be handled in
> your patch. I would just check IOCB_NOWAIT in !PageUptodate(page) branch
> above and bail - which also removes the need for the two checks below...

Yes, that makes sense.