Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Jens Axboe
On 12/15/2016 12:29 PM, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

I fixed this up. blk_mq_free_request() is now what we still use as an
exported interface, and that just calls blk_mq_sched_put_request().  The
old (__)blk_mq_free_request() are now (__)blk_mq_finish_request() and
internal only.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Jens Axboe
On 12/15/2016 12:29 PM, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

I fixed this up. blk_mq_free_request() is now what we still use as an
exported interface, and that just calls blk_mq_sched_put_request().  The
old (__)blk_mq_free_request() are now (__)blk_mq_finish_request() and
internal only.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Jens Axboe
On Thu, Dec 15 2016, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

You are right, that is a concern. I'll think of some ways to lessen that
risk, it's much better to have build time breakage than runtime.

> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 8d1cec8e25d1..d10a246a3bc7 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int 
> > cpu)
> >   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
> >   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
> >   *
> > - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> > - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> > - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> > - * is ignored.
> > + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is 
> > set
> > + * in pending bitmap and tries to retrieve requests in 
> > hctx->ctxs[0]->rq_list.
> > + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list 
> > is
> > + * ignored.
> >   */
> 
> This belongs in patch 4 where flush_busy_ctxs() got renamed.

Fixed, thanks!

> >  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
> >  {
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index e59f5ca520a2..a5ddc860b220 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct 
> > blk_mq_tag_set *set,
> >   */
> >  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request 
> > *rq,
> > bool at_head);
> > +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx 
> > *ctx,
> > +   struct list_head *list);
> > +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);
> 
> Looks like the declaration of blk_mq_process_sw_list() survived a rebase
> even though it doesn't exist anymore.

Indeed, now killed.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Jens Axboe
On Thu, Dec 15 2016, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

You are right, that is a concern. I'll think of some ways to lessen that
risk, it's much better to have build time breakage than runtime.

> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 8d1cec8e25d1..d10a246a3bc7 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int 
> > cpu)
> >   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
> >   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
> >   *
> > - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> > - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> > - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> > - * is ignored.
> > + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is 
> > set
> > + * in pending bitmap and tries to retrieve requests in 
> > hctx->ctxs[0]->rq_list.
> > + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list 
> > is
> > + * ignored.
> >   */
> 
> This belongs in patch 4 where flush_busy_ctxs() got renamed.

Fixed, thanks!

> >  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
> >  {
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index e59f5ca520a2..a5ddc860b220 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct 
> > blk_mq_tag_set *set,
> >   */
> >  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request 
> > *rq,
> > bool at_head);
> > +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx 
> > *ctx,
> > +   struct list_head *list);
> > +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);
> 
> Looks like the declaration of blk_mq_process_sw_list() survived a rebase
> even though it doesn't exist anymore.

Indeed, now killed.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Omar Sandoval
Hey, Jens, a couple of minor nits below.

One bigger note: adding the blk_mq_sched_*() helpers and keeping the
blk_mq_*() helpers that they replaced seems risky. For example,
blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
we kept blk_mq_free_request(). There are definitely some codepaths that
are still using blk_mq_free_request() that are now wrong
(__nvme_submit_user_cmd() is the most obvious one I saw).

Can we get rid of the old, non-sched functions? Or maybe even make old
interface do the sched stuff instead of adding blk_mq_sched_*()?

On Wed, Dec 14, 2016 at 10:26:06PM -0700, Jens Axboe wrote:
> Signed-off-by: Jens Axboe 
> ---
>  block/Makefile   |   2 +-
>  block/blk-core.c |   7 +-
>  block/blk-exec.c |   3 +-
>  block/blk-flush.c|   7 +-
>  block/blk-mq-sched.c | 375 
> +++
>  block/blk-mq-sched.h | 190 
>  block/blk-mq-tag.c   |   1 +
>  block/blk-mq.c   | 192 ++--
>  block/blk-mq.h   |   3 +
>  block/elevator.c | 186 +--
>  include/linux/blk-mq.h   |   3 +-
>  include/linux/elevator.h |  29 
>  12 files changed, 833 insertions(+), 165 deletions(-)
>  create mode 100644 block/blk-mq-sched.c
>  create mode 100644 block/blk-mq-sched.h
> 

[snip]

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8d1cec8e25d1..d10a246a3bc7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
>   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
>   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
>   *
> - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> - * is ignored.
> + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is 
> set
> + * in pending bitmap and tries to retrieve requests in 
> hctx->ctxs[0]->rq_list.
> + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
> + * ignored.
>   */

This belongs in patch 4 where flush_busy_ctxs() got renamed.

>  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
>  {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e59f5ca520a2..a5ddc860b220 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct 
> blk_mq_tag_set *set,
>   */
>  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   bool at_head);
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx 
> *ctx,
> + struct list_head *list);
> +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);

Looks like the declaration of blk_mq_process_sw_list() survived a rebase
even though it doesn't exist anymore.

[snip]


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Omar Sandoval
Hey, Jens, a couple of minor nits below.

One bigger note: adding the blk_mq_sched_*() helpers and keeping the
blk_mq_*() helpers that they replaced seems risky. For example,
blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
we kept blk_mq_free_request(). There are definitely some codepaths that
are still using blk_mq_free_request() that are now wrong
(__nvme_submit_user_cmd() is the most obvious one I saw).

Can we get rid of the old, non-sched functions? Or maybe even make old
interface do the sched stuff instead of adding blk_mq_sched_*()?

On Wed, Dec 14, 2016 at 10:26:06PM -0700, Jens Axboe wrote:
> Signed-off-by: Jens Axboe 
> ---
>  block/Makefile   |   2 +-
>  block/blk-core.c |   7 +-
>  block/blk-exec.c |   3 +-
>  block/blk-flush.c|   7 +-
>  block/blk-mq-sched.c | 375 
> +++
>  block/blk-mq-sched.h | 190 
>  block/blk-mq-tag.c   |   1 +
>  block/blk-mq.c   | 192 ++--
>  block/blk-mq.h   |   3 +
>  block/elevator.c | 186 +--
>  include/linux/blk-mq.h   |   3 +-
>  include/linux/elevator.h |  29 
>  12 files changed, 833 insertions(+), 165 deletions(-)
>  create mode 100644 block/blk-mq-sched.c
>  create mode 100644 block/blk-mq-sched.h
> 

[snip]

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8d1cec8e25d1..d10a246a3bc7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
>   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
>   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
>   *
> - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> - * is ignored.
> + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is 
> set
> + * in pending bitmap and tries to retrieve requests in 
> hctx->ctxs[0]->rq_list.
> + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
> + * ignored.
>   */

This belongs in patch 4 where flush_busy_ctxs() got renamed.

>  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
>  {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e59f5ca520a2..a5ddc860b220 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct 
> blk_mq_tag_set *set,
>   */
>  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   bool at_head);
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx 
> *ctx,
> + struct list_head *list);
> +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);

Looks like the declaration of blk_mq_process_sw_list() survived a rebase
even though it doesn't exist anymore.

[snip]


[PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-14 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/Makefile   |   2 +-
 block/blk-core.c |   7 +-
 block/blk-exec.c |   3 +-
 block/blk-flush.c|   7 +-
 block/blk-mq-sched.c | 375 +++
 block/blk-mq-sched.h | 190 
 block/blk-mq-tag.c   |   1 +
 block/blk-mq.c   | 192 ++--
 block/blk-mq.h   |   3 +
 block/elevator.c | 186 +--
 include/linux/blk-mq.h   |   3 +-
 include/linux/elevator.h |  29 
 12 files changed, 833 insertions(+), 165 deletions(-)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
-   blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 92baea07acbc..cb1e864cb23d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-sched.h"
 #include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -1413,7 +1414,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
return;
 
if (q->mq_ops) {
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
return;
}
 
@@ -1449,7 +1450,7 @@ void blk_put_request(struct request *req)
struct request_queue *q = req->q;
 
if (q->mq_ops)
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
else {
unsigned long flags;
 
@@ -2127,7 +2128,7 @@ int blk_insert_cloned_request(struct request_queue *q, 
struct request *rq)
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
-   blk_mq_insert_request(rq, false, true, false);
+   blk_mq_sched_insert_request(rq, false, true, false);
return 0;
}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
 #include 
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 /*
  * for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_insert_request(rq, at_head, true, false);
+   blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 20b7c7a02f1c..6a7c29d2eb3c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -453,9 +454,9 @@ void blk_insert_flush(struct request *rq)
 */
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-   if (q->mq_ops) {
-   blk_mq_insert_request(rq, false, true, false);
-   } else
+   if (q->mq_ops)
+   blk_mq_sched_insert_request(rq, false, true, false);
+   else
list_add_tail(>queuelist, >queue_head);
return;
}
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index ..02ad17258666
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,375 @@
+#include 
+#include 
+#include 
+
+#include 
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static const struct blk_mq_ops mq_sched_tag_ops = {
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+   blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+   unsigned int numa_node)
+{
+   struct blk_mq_tag_set set = {
+   .ops= _sched_tag_ops,
+   .nr_hw_queues   = 1,
+   .queue_depth= depth,
+   .numa_node  = numa_node,
+   };
+

[PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-14 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/Makefile   |   2 +-
 block/blk-core.c |   7 +-
 block/blk-exec.c |   3 +-
 block/blk-flush.c|   7 +-
 block/blk-mq-sched.c | 375 +++
 block/blk-mq-sched.h | 190 
 block/blk-mq-tag.c   |   1 +
 block/blk-mq.c   | 192 ++--
 block/blk-mq.h   |   3 +
 block/elevator.c | 186 +--
 include/linux/blk-mq.h   |   3 +-
 include/linux/elevator.h |  29 
 12 files changed, 833 insertions(+), 165 deletions(-)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
-   blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 92baea07acbc..cb1e864cb23d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-sched.h"
 #include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -1413,7 +1414,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
return;
 
if (q->mq_ops) {
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
return;
}
 
@@ -1449,7 +1450,7 @@ void blk_put_request(struct request *req)
struct request_queue *q = req->q;
 
if (q->mq_ops)
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
else {
unsigned long flags;
 
@@ -2127,7 +2128,7 @@ int blk_insert_cloned_request(struct request_queue *q, 
struct request *rq)
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
-   blk_mq_insert_request(rq, false, true, false);
+   blk_mq_sched_insert_request(rq, false, true, false);
return 0;
}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
 #include 
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 /*
  * for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_insert_request(rq, at_head, true, false);
+   blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 20b7c7a02f1c..6a7c29d2eb3c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -453,9 +454,9 @@ void blk_insert_flush(struct request *rq)
 */
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-   if (q->mq_ops) {
-   blk_mq_insert_request(rq, false, true, false);
-   } else
+   if (q->mq_ops)
+   blk_mq_sched_insert_request(rq, false, true, false);
+   else
list_add_tail(>queuelist, >queue_head);
return;
}
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index ..02ad17258666
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,375 @@
+#include 
+#include 
+#include 
+
+#include 
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static const struct blk_mq_ops mq_sched_tag_ops = {
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+   blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+   unsigned int numa_node)
+{
+   struct blk_mq_tag_set set = {
+   .ops= _sched_tag_ops,
+   .nr_hw_queues   = 1,
+   .queue_depth= depth,
+   .numa_node  = numa_node,
+   };
+
+   return 

Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-14 Thread Jens Axboe
On 12/14/2016 03:31 AM, Bart Van Assche wrote:
> On 12/13/2016 04:14 PM, Jens Axboe wrote:
>> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
 +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
 +struct blk_mq_alloc_data 
 *data,
 +struct blk_mq_tags *tags,
 +atomic_t *wait_index)
 +{
>>>
>>> Using the word "shadow" in the function name suggests to me that there 
>>> is a shadow request for every request and a request for every shadow 
>>> request. However, my understanding from the code is that there can be 
>>> requests without shadow requests (for e.g. a flush) and shadow requests 
>>> without requests. Shouldn't the name of this function reflect that, e.g. 
>>> by using "sched" or "elv" in the function name instead of "shadow"?
>>
>> Shadow might not be the best name. Most do have shadows though, it's
>> only the rare exception like the flush, that you mention. I'll see if I
>> can come up with a better name.
> 
> Hello Jens,
> 
> One aspect of this patch series that might turn out to be a maintenance
> burden is the copying between original and shadow requests. It is easy
> to overlook that rq_copy() has to be updated if a field would ever be
> added to struct request. Additionally, having to allocate two requests
> structures per I/O instead of one will have a runtime overhead. Do you
> think the following approach would work?
> - Instead of using two request structures per I/O, only use a single
>   request structure.
> - Instead of storing one tag in the request structure, store two tags
>   in that structure. One tag comes from the I/O scheduler tag set
>   (size: nr_requests) and the other from the tag set associated with
>   the block driver (size: HBA queue depth).
> - Only add a request to the hctx dispatch list after a block driver tag
>   has been assigned. This means that an I/O scheduler must keep a
>   request structure on a list it manages itself as long as no block
>   driver tag has been assigned.
> - sysfs_list_show() is modified such that it shows both tags.

I have considered doing exactly that, decided to go down the other path.
I may still revisit, it's not that I'm a huge fan of the shadow requests
and the necessary copying. We don't update the request that often, so I
don't think it's going to be a big maintenance burden. But it'd be hard
to claim that it's super pretty...

I'll play with the idea.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-14 Thread Jens Axboe
On 12/14/2016 03:31 AM, Bart Van Assche wrote:
> On 12/13/2016 04:14 PM, Jens Axboe wrote:
>> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
 +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
 +struct blk_mq_alloc_data 
 *data,
 +struct blk_mq_tags *tags,
 +atomic_t *wait_index)
 +{
>>>
>>> Using the word "shadow" in the function name suggests to me that there 
>>> is a shadow request for every request and a request for every shadow 
>>> request. However, my understanding from the code is that there can be 
>>> requests without shadow requests (for e.g. a flush) and shadow requests 
>>> without requests. Shouldn't the name of this function reflect that, e.g. 
>>> by using "sched" or "elv" in the function name instead of "shadow"?
>>
>> Shadow might not be the best name. Most do have shadows though, it's
>> only the rare exception like the flush, that you mention. I'll see if I
>> can come up with a better name.
> 
> Hello Jens,
> 
> One aspect of this patch series that might turn out to be a maintenance
> burden is the copying between original and shadow requests. It is easy
> to overlook that rq_copy() has to be updated if a field would ever be
> added to struct request. Additionally, having to allocate two requests
> structures per I/O instead of one will have a runtime overhead. Do you
> think the following approach would work?
> - Instead of using two request structures per I/O, only use a single
>   request structure.
> - Instead of storing one tag in the request structure, store two tags
>   in that structure. One tag comes from the I/O scheduler tag set
>   (size: nr_requests) and the other from the tag set associated with
>   the block driver (size: HBA queue depth).
> - Only add a request to the hctx dispatch list after a block driver tag
>   has been assigned. This means that an I/O scheduler must keep a
>   request structure on a list it manages itself as long as no block
>   driver tag has been assigned.
> - sysfs_list_show() is modified such that it shows both tags.

I have considered doing exactly that, decided to go down the other path.
I may still revisit, it's not that I'm a huge fan of the shadow requests
and the necessary copying. We don't update the request that often, so I
don't think it's going to be a big maintenance burden. But it'd be hard
to claim that it's super pretty...

I'll play with the idea.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-14 Thread Bart Van Assche
On 12/13/2016 04:14 PM, Jens Axboe wrote:
> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>>> + struct blk_mq_alloc_data 
>>> *data,
>>> + struct blk_mq_tags *tags,
>>> + atomic_t *wait_index)
>>> +{
>>
>> Using the word "shadow" in the function name suggests to me that there 
>> is a shadow request for every request and a request for every shadow 
>> request. However, my understanding from the code is that there can be 
>> requests without shadow requests (for e.g. a flush) and shadow requests 
>> without requests. Shouldn't the name of this function reflect that, e.g. 
>> by using "sched" or "elv" in the function name instead of "shadow"?
> 
> Shadow might not be the best name. Most do have shadows though, it's
> only the rare exception like the flush, that you mention. I'll see if I
> can come up with a better name.

Hello Jens,

One aspect of this patch series that might turn out to be a maintenance
burden is the copying between original and shadow requests. It is easy
to overlook that rq_copy() has to be updated if a field would ever be
added to struct request. Additionally, having to allocate two requests
structures per I/O instead of one will have a runtime overhead. Do you
think the following approach would work?
- Instead of using two request structures per I/O, only use a single
  request structure.
- Instead of storing one tag in the request structure, store two tags
  in that structure. One tag comes from the I/O scheduler tag set
  (size: nr_requests) and the other from the tag set associated with
  the block driver (size: HBA queue depth).
- Only add a request to the hctx dispatch list after a block driver tag
  has been assigned. This means that an I/O scheduler must keep a
  request structure on a list it manages itself as long as no block
  driver tag has been assigned.
- sysfs_list_show() is modified such that it shows both tags.

Thanks,

Bart.


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-14 Thread Bart Van Assche
On 12/13/2016 04:14 PM, Jens Axboe wrote:
> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>>> + struct blk_mq_alloc_data 
>>> *data,
>>> + struct blk_mq_tags *tags,
>>> + atomic_t *wait_index)
>>> +{
>>
>> Using the word "shadow" in the function name suggests to me that there 
>> is a shadow request for every request and a request for every shadow 
>> request. However, my understanding from the code is that there can be 
>> requests without shadow requests (for e.g. a flush) and shadow requests 
>> without requests. Shouldn't the name of this function reflect that, e.g. 
>> by using "sched" or "elv" in the function name instead of "shadow"?
> 
> Shadow might not be the best name. Most do have shadows though, it's
> only the rare exception like the flush, that you mention. I'll see if I
> can come up with a better name.

Hello Jens,

One aspect of this patch series that might turn out to be a maintenance
burden is the copying between original and shadow requests. It is easy
to overlook that rq_copy() has to be updated if a field would ever be
added to struct request. Additionally, having to allocate two requests
structures per I/O instead of one will have a runtime overhead. Do you
think the following approach would work?
- Instead of using two request structures per I/O, only use a single
  request structure.
- Instead of storing one tag in the request structure, store two tags
  in that structure. One tag comes from the I/O scheduler tag set
  (size: nr_requests) and the other from the tag set associated with
  the block driver (size: HBA queue depth).
- Only add a request to the hctx dispatch list after a block driver tag
  has been assigned. This means that an I/O scheduler must keep a
  request structure on a list it manages itself as long as no block
  driver tag has been assigned.
- sysfs_list_show() is modified such that it shows both tags.

Thanks,

Bart.


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Jens Axboe
On Tue, Dec 13 2016, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
> >+static inline void blk_mq_sched_put_request(struct request *rq)
> >+{
> >+struct request_queue *q = rq->q;
> >+struct elevator_queue *e = q->elevator;
> >+
> >+if (e && e->type->mq_ops.put_request)
> >+e->type->mq_ops.put_request(rq);
> >+else
> >+blk_mq_free_request(rq);
> >+}
> 
> blk_mq_free_request() always triggers a call of blk_queue_exit().
> dd_put_request() only triggers a call of blk_queue_exit() if it is not a
> shadow request. Is that on purpose?

If the scheduler doesn't define get/put requests, then the lifetime
follows the normal setup. If we do define them, then dd_put_request()
only wants to put the request if it's one where we did setup a shadow.

> >+static inline struct request *
> >+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
> >+ struct blk_mq_alloc_data *data)
> >+{
> >+struct elevator_queue *e = q->elevator;
> >+struct blk_mq_hw_ctx *hctx;
> >+struct blk_mq_ctx *ctx;
> >+struct request *rq;
> >+
> >+blk_queue_enter_live(q);
> >+ctx = blk_mq_get_ctx(q);
> >+hctx = blk_mq_map_queue(q, ctx->cpu);
> >+
> >+blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> >+
> >+if (e && e->type->mq_ops.get_request)
> >+rq = e->type->mq_ops.get_request(q, op, data);
> >+else
> >+rq = __blk_mq_alloc_request(data, op);
> >+
> >+if (rq)
> >+data->hctx->queued++;
> >+
> >+return rq;
> >+
> >+}
> 
> Some but not all callers of blk_mq_sched_get_request() call blk_queue_exit()
> if this function returns NULL. Please consider to move the blk_queue_exit()
> call from the blk_mq_alloc_request() error path into this function. I think
> that will make it a lot easier to verify whether or not the
> blk_queue_enter() / blk_queue_exit() calls are balanced properly.

Agree, I'll make the change, it'll be easier to read then.

> Additionally, since blk_queue_enter() / blk_queue_exit() calls by
> blk_mq_sched_get_request() and blk_mq_sched_put_request() must be balanced
> and since the latter function only calls blk_queue_exit() for non-shadow
> requests, shouldn't blk_mq_sched_get_request() call blk_queue_enter_live()
> only if __blk_mq_alloc_request() is called?

I'll double check that part, there might be a bug or at least a chance
to clean this up a bit. I did verify most of this at some point, and
tested it with the scheduler switching. That part falls apart pretty
quickly, if the references aren't matched exactly.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Jens Axboe
On Tue, Dec 13 2016, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
> >+static inline void blk_mq_sched_put_request(struct request *rq)
> >+{
> >+struct request_queue *q = rq->q;
> >+struct elevator_queue *e = q->elevator;
> >+
> >+if (e && e->type->mq_ops.put_request)
> >+e->type->mq_ops.put_request(rq);
> >+else
> >+blk_mq_free_request(rq);
> >+}
> 
> blk_mq_free_request() always triggers a call of blk_queue_exit().
> dd_put_request() only triggers a call of blk_queue_exit() if it is not a
> shadow request. Is that on purpose?

If the scheduler doesn't define get/put requests, then the lifetime
follows the normal setup. If we do define them, then dd_put_request()
only wants to put the request if it's one where we did setup a shadow.

> >+static inline struct request *
> >+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
> >+ struct blk_mq_alloc_data *data)
> >+{
> >+struct elevator_queue *e = q->elevator;
> >+struct blk_mq_hw_ctx *hctx;
> >+struct blk_mq_ctx *ctx;
> >+struct request *rq;
> >+
> >+blk_queue_enter_live(q);
> >+ctx = blk_mq_get_ctx(q);
> >+hctx = blk_mq_map_queue(q, ctx->cpu);
> >+
> >+blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> >+
> >+if (e && e->type->mq_ops.get_request)
> >+rq = e->type->mq_ops.get_request(q, op, data);
> >+else
> >+rq = __blk_mq_alloc_request(data, op);
> >+
> >+if (rq)
> >+data->hctx->queued++;
> >+
> >+return rq;
> >+
> >+}
> 
> Some but not all callers of blk_mq_sched_get_request() call blk_queue_exit()
> if this function returns NULL. Please consider to move the blk_queue_exit()
> call from the blk_mq_alloc_request() error path into this function. I think
> that will make it a lot easier to verify whether or not the
> blk_queue_enter() / blk_queue_exit() calls are balanced properly.

Agree, I'll make the change, it'll be easier to read then.

> Additionally, since blk_queue_enter() / blk_queue_exit() calls by
> blk_mq_sched_get_request() and blk_mq_sched_put_request() must be balanced
> and since the latter function only calls blk_queue_exit() for non-shadow
> requests, shouldn't blk_mq_sched_get_request() call blk_queue_enter_live()
> only if __blk_mq_alloc_request() is called?

I'll double check that part, there might be a bug or at least a chance
to clean this up a bit. I did verify most of this at some point, and
tested it with the scheduler switching. That part falls apart pretty
quickly, if the references aren't matched exactly.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Jens Axboe
On 12/13/2016 06:56 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +/*
>> + * Empty set
>> + */
>> +static struct blk_mq_ops mq_sched_tag_ops = {
>> +.queue_rq   = NULL,
>> +};
> 
> Hello Jens,
> 
> Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient? 
> Can this data structure be declared 'const' if the blk_mq_ops pointers 
> in struct blk_mq_tag_set and struct request_queue are also declared const?

Yes, the static should be enough to ensure that it's all zeroes. I did
have this as const, but then realized I'd have to change a few other
places too. I'll make that change, hopefully it'll just work out.

>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>> +  struct blk_mq_alloc_data 
>> *data,
>> +  struct blk_mq_tags *tags,
>> +  atomic_t *wait_index)
>> +{
> 
> Using the word "shadow" in the function name suggests to me that there 
> is a shadow request for every request and a request for every shadow 
> request. However, my understanding from the code is that there can be 
> requests without shadow requests (for e.g. a flush) and shadow requests 
> without requests. Shouldn't the name of this function reflect that, e.g. 
> by using "sched" or "elv" in the function name instead of "shadow"?

Shadow might not be the best name. Most do have shadows though, it's
only the rare exception like the flush, that you mention. I'll see if I
can come up with a better name.

>> +struct request *
>> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
>> + struct request *(*get_sched_rq)(struct 
>> blk_mq_hw_ctx *))
> 
> This function dequeues a request from the I/O scheduler queue, allocates 
> a request, copies the relevant request structure members into that 
> request and makes the request refer to the shadow request. Isn't the 
> request dispatching more important than associating the request with the 
> shadow request? If so, how about making the function name reflect that?

Sure, I can update the naming. Will need to anyway, if we get rid of the
shadow naming.

>> +{
>> +struct blk_mq_alloc_data data;
>> +struct request *sched_rq, *rq;
>> +
>> +data.q = hctx->queue;
>> +data.flags = BLK_MQ_REQ_NOWAIT;
>> +data.ctx = blk_mq_get_ctx(hctx->queue);
>> +data.hctx = hctx;
>> +
>> +rq = __blk_mq_alloc_request(, 0);
>> +blk_mq_put_ctx(data.ctx);
>> +
>> +if (!rq) {
>> +blk_mq_stop_hw_queue(hctx);
>> +return NULL;
>> +}
>> +
>> +sched_rq = get_sched_rq(hctx);
>> +
>> +if (!sched_rq) {
>> +blk_queue_enter_live(hctx->queue);
>> +__blk_mq_free_request(hctx, data.ctx, rq);
>> +return NULL;
>> +}
> 
> The mq deadline scheduler calls this function with get_sched_rq == 
> __dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the 
> request that was removed from the scheduler queue be pushed back onto 
> that queue? Additionally, are you sure it's necessary to call 
> blk_queue_enter_live() from the error path?

If __blk_mq_alloc_request() fails, we haven't pulled a request from the
scheduler yet. The extra ref is needed because __blk_mq_alloc_request()
doesn't get a ref on the request, however __blk_mq_free_request() does
put one.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Jens Axboe
On 12/13/2016 06:56 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +/*
>> + * Empty set
>> + */
>> +static struct blk_mq_ops mq_sched_tag_ops = {
>> +.queue_rq   = NULL,
>> +};
> 
> Hello Jens,
> 
> Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient? 
> Can this data structure be declared 'const' if the blk_mq_ops pointers 
> in struct blk_mq_tag_set and struct request_queue are also declared const?

Yes, the static should be enough to ensure that it's all zeroes. I did
have this as const, but then realized I'd have to change a few other
places too. I'll make that change, hopefully it'll just work out.

>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>> +  struct blk_mq_alloc_data 
>> *data,
>> +  struct blk_mq_tags *tags,
>> +  atomic_t *wait_index)
>> +{
> 
> Using the word "shadow" in the function name suggests to me that there 
> is a shadow request for every request and a request for every shadow 
> request. However, my understanding from the code is that there can be 
> requests without shadow requests (for e.g. a flush) and shadow requests 
> without requests. Shouldn't the name of this function reflect that, e.g. 
> by using "sched" or "elv" in the function name instead of "shadow"?

Shadow might not be the best name. Most do have shadows though, it's
only the rare exception like the flush, that you mention. I'll see if I
can come up with a better name.

>> +struct request *
>> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
>> + struct request *(*get_sched_rq)(struct 
>> blk_mq_hw_ctx *))
> 
> This function dequeues a request from the I/O scheduler queue, allocates 
> a request, copies the relevant request structure members into that 
> request and makes the request refer to the shadow request. Isn't the 
> request dispatching more important than associating the request with the 
> shadow request? If so, how about making the function name reflect that?

Sure, I can update the naming. Will need to anyway, if we get rid of the
shadow naming.

>> +{
>> +struct blk_mq_alloc_data data;
>> +struct request *sched_rq, *rq;
>> +
>> +data.q = hctx->queue;
>> +data.flags = BLK_MQ_REQ_NOWAIT;
>> +data.ctx = blk_mq_get_ctx(hctx->queue);
>> +data.hctx = hctx;
>> +
>> +rq = __blk_mq_alloc_request(, 0);
>> +blk_mq_put_ctx(data.ctx);
>> +
>> +if (!rq) {
>> +blk_mq_stop_hw_queue(hctx);
>> +return NULL;
>> +}
>> +
>> +sched_rq = get_sched_rq(hctx);
>> +
>> +if (!sched_rq) {
>> +blk_queue_enter_live(hctx->queue);
>> +__blk_mq_free_request(hctx, data.ctx, rq);
>> +return NULL;
>> +}
> 
> The mq deadline scheduler calls this function with get_sched_rq == 
> __dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the 
> request that was removed from the scheduler queue be pushed back onto 
> that queue? Additionally, are you sure it's necessary to call 
> blk_queue_enter_live() from the error path?

If __blk_mq_alloc_request() fails, we haven't pulled a request from the
scheduler yet. The extra ref is needed because __blk_mq_alloc_request()
doesn't get a ref on the request, however __blk_mq_free_request() does
put one.

-- 
Jens Axboe



Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Bart Van Assche

On 12/08/2016 09:13 PM, Jens Axboe wrote:

+static inline void blk_mq_sched_put_request(struct request *rq)
+{
+   struct request_queue *q = rq->q;
+   struct elevator_queue *e = q->elevator;
+
+   if (e && e->type->mq_ops.put_request)
+   e->type->mq_ops.put_request(rq);
+   else
+   blk_mq_free_request(rq);
+}


blk_mq_free_request() always triggers a call of blk_queue_exit(). 
dd_put_request() only triggers a call of blk_queue_exit() if it is not a 
shadow request. Is that on purpose?



+static inline struct request *
+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
+struct blk_mq_alloc_data *data)
+{
+   struct elevator_queue *e = q->elevator;
+   struct blk_mq_hw_ctx *hctx;
+   struct blk_mq_ctx *ctx;
+   struct request *rq;
+
+   blk_queue_enter_live(q);
+   ctx = blk_mq_get_ctx(q);
+   hctx = blk_mq_map_queue(q, ctx->cpu);
+
+   blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+   if (e && e->type->mq_ops.get_request)
+   rq = e->type->mq_ops.get_request(q, op, data);
+   else
+   rq = __blk_mq_alloc_request(data, op);
+
+   if (rq)
+   data->hctx->queued++;
+
+   return rq;
+
+}


Some but not all callers of blk_mq_sched_get_request() call 
blk_queue_exit() if this function returns NULL. Please consider to move 
the blk_queue_exit() call from the blk_mq_alloc_request() error path 
into this function. I think that will make it a lot easier to verify 
whether or not the blk_queue_enter() / blk_queue_exit() calls are 
balanced properly.


Additionally, since blk_queue_enter() / blk_queue_exit() calls by 
blk_mq_sched_get_request() and blk_mq_sched_put_request() must be 
balanced and since the latter function only calls blk_queue_exit() for 
non-shadow requests, shouldn't blk_mq_sched_get_request() call 
blk_queue_enter_live() only if __blk_mq_alloc_request() is called?


Thanks,

Bart.


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Bart Van Assche

On 12/08/2016 09:13 PM, Jens Axboe wrote:

+static inline void blk_mq_sched_put_request(struct request *rq)
+{
+   struct request_queue *q = rq->q;
+   struct elevator_queue *e = q->elevator;
+
+   if (e && e->type->mq_ops.put_request)
+   e->type->mq_ops.put_request(rq);
+   else
+   blk_mq_free_request(rq);
+}


blk_mq_free_request() always triggers a call of blk_queue_exit(). 
dd_put_request() only triggers a call of blk_queue_exit() if it is not a 
shadow request. Is that on purpose?



+static inline struct request *
+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
+struct blk_mq_alloc_data *data)
+{
+   struct elevator_queue *e = q->elevator;
+   struct blk_mq_hw_ctx *hctx;
+   struct blk_mq_ctx *ctx;
+   struct request *rq;
+
+   blk_queue_enter_live(q);
+   ctx = blk_mq_get_ctx(q);
+   hctx = blk_mq_map_queue(q, ctx->cpu);
+
+   blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+   if (e && e->type->mq_ops.get_request)
+   rq = e->type->mq_ops.get_request(q, op, data);
+   else
+   rq = __blk_mq_alloc_request(data, op);
+
+   if (rq)
+   data->hctx->queued++;
+
+   return rq;
+
+}


Some but not all callers of blk_mq_sched_get_request() call 
blk_queue_exit() if this function returns NULL. Please consider to move 
the blk_queue_exit() call from the blk_mq_alloc_request() error path 
into this function. I think that will make it a lot easier to verify 
whether or not the blk_queue_enter() / blk_queue_exit() calls are 
balanced properly.


Additionally, since blk_queue_enter() / blk_queue_exit() calls by 
blk_mq_sched_get_request() and blk_mq_sched_put_request() must be 
balanced and since the latter function only calls blk_queue_exit() for 
non-shadow requests, shouldn't blk_mq_sched_get_request() call 
blk_queue_enter_live() only if __blk_mq_alloc_request() is called?


Thanks,

Bart.


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Bart Van Assche

On 12/08/2016 09:13 PM, Jens Axboe wrote:

+/*
+ * Empty set
+ */
+static struct blk_mq_ops mq_sched_tag_ops = {
+   .queue_rq   = NULL,
+};


Hello Jens,

Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient? 
Can this data structure be declared 'const' if the blk_mq_ops pointers 
in struct blk_mq_tag_set and struct request_queue are also declared const?



+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data 
*data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index)
+{


Using the word "shadow" in the function name suggests to me that there 
is a shadow request for every request and a request for every shadow 
request. However, my understanding from the code is that there can be 
requests without shadow requests (for e.g. a flush) and shadow requests 
without requests. Shouldn't the name of this function reflect that, e.g. 
by using "sched" or "elv" in the function name instead of "shadow"?



+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+struct request *(*get_sched_rq)(struct 
blk_mq_hw_ctx *))


This function dequeues a request from the I/O scheduler queue, allocates 
a request, copies the relevant request structure members into that 
request and makes the request refer to the shadow request. Isn't the 
request dispatching more important than associating the request with the 
shadow request? If so, how about making the function name reflect that?



+{
+   struct blk_mq_alloc_data data;
+   struct request *sched_rq, *rq;
+
+   data.q = hctx->queue;
+   data.flags = BLK_MQ_REQ_NOWAIT;
+   data.ctx = blk_mq_get_ctx(hctx->queue);
+   data.hctx = hctx;
+
+   rq = __blk_mq_alloc_request(, 0);
+   blk_mq_put_ctx(data.ctx);
+
+   if (!rq) {
+   blk_mq_stop_hw_queue(hctx);
+   return NULL;
+   }
+
+   sched_rq = get_sched_rq(hctx);
+
+   if (!sched_rq) {
+   blk_queue_enter_live(hctx->queue);
+   __blk_mq_free_request(hctx, data.ctx, rq);
+   return NULL;
+   }


The mq deadline scheduler calls this function with get_sched_rq == 
__dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the 
request that was removed from the scheduler queue be pushed back onto 
that queue? Additionally, are you sure it's necessary to call 
blk_queue_enter_live() from the error path?


Bart.


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-13 Thread Bart Van Assche

On 12/08/2016 09:13 PM, Jens Axboe wrote:

+/*
+ * Empty set
+ */
+static struct blk_mq_ops mq_sched_tag_ops = {
+   .queue_rq   = NULL,
+};


Hello Jens,

Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient? 
Can this data structure be declared 'const' if the blk_mq_ops pointers 
in struct blk_mq_tag_set and struct request_queue are also declared const?



+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data 
*data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index)
+{


Using the word "shadow" in the function name suggests to me that there 
is a shadow request for every request and a request for every shadow 
request. However, my understanding from the code is that there can be 
requests without shadow requests (for e.g. a flush) and shadow requests 
without requests. Shouldn't the name of this function reflect that, e.g. 
by using "sched" or "elv" in the function name instead of "shadow"?



+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+struct request *(*get_sched_rq)(struct 
blk_mq_hw_ctx *))


This function dequeues a request from the I/O scheduler queue, allocates 
a request, copies the relevant request structure members into that 
request and makes the request refer to the shadow request. Isn't the 
request dispatching more important than associating the request with the 
shadow request? If so, how about making the function name reflect that?



+{
+   struct blk_mq_alloc_data data;
+   struct request *sched_rq, *rq;
+
+   data.q = hctx->queue;
+   data.flags = BLK_MQ_REQ_NOWAIT;
+   data.ctx = blk_mq_get_ctx(hctx->queue);
+   data.hctx = hctx;
+
+   rq = __blk_mq_alloc_request(, 0);
+   blk_mq_put_ctx(data.ctx);
+
+   if (!rq) {
+   blk_mq_stop_hw_queue(hctx);
+   return NULL;
+   }
+
+   sched_rq = get_sched_rq(hctx);
+
+   if (!sched_rq) {
+   blk_queue_enter_live(hctx->queue);
+   __blk_mq_free_request(hctx, data.ctx, rq);
+   return NULL;
+   }


The mq deadline scheduler calls this function with get_sched_rq == 
__dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the 
request that was removed from the scheduler queue be pushed back onto 
that queue? Additionally, are you sure it's necessary to call 
blk_queue_enter_live() from the error path?


Bart.


[PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-08 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/Makefile   |   2 +-
 block/blk-core.c |   9 +-
 block/blk-exec.c |   3 +-
 block/blk-flush.c|   7 +-
 block/blk-merge.c|   3 +
 block/blk-mq-sched.c | 246 +++
 block/blk-mq-sched.h | 187 +++
 block/blk-mq-tag.c   |   1 +
 block/blk-mq.c   | 150 +++--
 block/blk-mq.h   |  34 +++
 block/elevator.c | 181 ++
 include/linux/blk-mq.h   |   2 +-
 include/linux/elevator.h |  29 +-
 13 files changed, 713 insertions(+), 141 deletions(-)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
-   blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b7ec5958055..3f83414d6986 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-sched.h"
 #include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -1428,7 +1429,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
return;
 
if (q->mq_ops) {
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
return;
}
 
@@ -1464,7 +1465,7 @@ void blk_put_request(struct request *req)
struct request_queue *q = req->q;
 
if (q->mq_ops)
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
else {
unsigned long flags;
 
@@ -1528,6 +1529,7 @@ bool bio_attempt_back_merge(struct request_queue *q, 
struct request *req,
blk_account_io_start(req, false);
return true;
 }
+EXPORT_SYMBOL_GPL(bio_attempt_back_merge);
 
 bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 struct bio *bio)
@@ -1552,6 +1554,7 @@ bool bio_attempt_front_merge(struct request_queue *q, 
struct request *req,
blk_account_io_start(req, false);
return true;
 }
+EXPORT_SYMBOL_GPL(bio_attempt_front_merge);
 
 /**
  * blk_attempt_plug_merge - try to merge with %current's plugged list
@@ -2173,7 +2176,7 @@ int blk_insert_cloned_request(struct request_queue *q, 
struct request *rq)
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
-   blk_mq_insert_request(rq, false, true, false);
+   blk_mq_sched_insert_request(rq, false, true, false);
return 0;
}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
 #include 
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 /*
  * for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_insert_request(rq, at_head, true, false);
+   blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 27a42dab5a36..63b91697d167 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -425,9 +426,9 @@ void blk_insert_flush(struct request *rq)
 */
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-   if (q->mq_ops) {
-   blk_mq_insert_request(rq, false, true, false);
-   } else
+   if (q->mq_ops)
+   blk_mq_sched_insert_request(rq, false, true, false);
+   else
list_add_tail(>queuelist, >queue_head);
return;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1002afdfee99..01247812e13f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -766,6 +766,7 @@ int attempt_back_merge(struct request_queue *q, struct 
request *rq)
 
return 0;
 }

[PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-08 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/Makefile   |   2 +-
 block/blk-core.c |   9 +-
 block/blk-exec.c |   3 +-
 block/blk-flush.c|   7 +-
 block/blk-merge.c|   3 +
 block/blk-mq-sched.c | 246 +++
 block/blk-mq-sched.h | 187 +++
 block/blk-mq-tag.c   |   1 +
 block/blk-mq.c   | 150 +++--
 block/blk-mq.h   |  34 +++
 block/elevator.c | 181 ++
 include/linux/blk-mq.h   |   2 +-
 include/linux/elevator.h |  29 +-
 13 files changed, 713 insertions(+), 141 deletions(-)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
-   blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b7ec5958055..3f83414d6986 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-mq-sched.h"
 #include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -1428,7 +1429,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
return;
 
if (q->mq_ops) {
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
return;
}
 
@@ -1464,7 +1465,7 @@ void blk_put_request(struct request *req)
struct request_queue *q = req->q;
 
if (q->mq_ops)
-   blk_mq_free_request(req);
+   blk_mq_sched_put_request(req);
else {
unsigned long flags;
 
@@ -1528,6 +1529,7 @@ bool bio_attempt_back_merge(struct request_queue *q, 
struct request *req,
blk_account_io_start(req, false);
return true;
 }
+EXPORT_SYMBOL_GPL(bio_attempt_back_merge);
 
 bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 struct bio *bio)
@@ -1552,6 +1554,7 @@ bool bio_attempt_front_merge(struct request_queue *q, 
struct request *req,
blk_account_io_start(req, false);
return true;
 }
+EXPORT_SYMBOL_GPL(bio_attempt_front_merge);
 
 /**
  * blk_attempt_plug_merge - try to merge with %current's plugged list
@@ -2173,7 +2176,7 @@ int blk_insert_cloned_request(struct request_queue *q, 
struct request *rq)
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
-   blk_mq_insert_request(rq, false, true, false);
+   blk_mq_sched_insert_request(rq, false, true, false);
return 0;
}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
 #include 
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 /*
  * for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_insert_request(rq, at_head, true, false);
+   blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 27a42dab5a36..63b91697d167 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
 
 /* FLUSH/FUA sequences */
 enum {
@@ -425,9 +426,9 @@ void blk_insert_flush(struct request *rq)
 */
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-   if (q->mq_ops) {
-   blk_mq_insert_request(rq, false, true, false);
-   } else
+   if (q->mq_ops)
+   blk_mq_sched_insert_request(rq, false, true, false);
+   else
list_add_tail(>queuelist, >queue_head);
return;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1002afdfee99..01247812e13f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -766,6 +766,7 @@ int attempt_back_merge(struct request_queue *q, struct 
request *rq)
 
return 0;
 }

[PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-07 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/blk-mq-sched.c | 243 +++
 block/blk-mq-sched.h | 168 +++
 2 files changed, 411 insertions(+)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index ..8317b26990f8
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,243 @@
+#include 
+#include 
+
+#include 
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static struct blk_mq_ops mq_sched_tag_ops = {
+   .queue_rq   = NULL,
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+   blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+   unsigned int numa_node)
+{
+   struct blk_mq_tag_set set = {
+   .ops= _sched_tag_ops,
+   .nr_hw_queues   = 1,
+   .queue_depth= depth,
+   .numa_node  = numa_node,
+   };
+
+   return blk_mq_init_rq_map(, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+void (*exit)(struct blk_mq_hw_ctx *))
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (exit)
+   exit(hctx);
+   kfree(hctx->sched_data);
+   hctx->sched_data = NULL;
+   }
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+   void (*init)(struct blk_mq_hw_ctx *))
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
hctx->numa_node);
+   if (!hctx->sched_data)
+   goto error;
+
+   if (init)
+   init(hctx);
+   }
+
+   return 0;
+error:
+   blk_mq_sched_free_hctx_data(q, NULL);
+   return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data 
*data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index)
+{
+   struct sbq_wait_state *ws;
+   DEFINE_WAIT(wait);
+   struct request *rq;
+   int tag;
+
+   tag = __sbitmap_queue_get(>bitmap_tags);
+   if (tag != -1)
+   goto done;
+
+   if (data->flags & BLK_MQ_REQ_NOWAIT)
+   return NULL;
+
+   ws = sbq_wait_ptr(>bitmap_tags, wait_index);
+   do {
+   prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
+
+   tag = __sbitmap_queue_get(>bitmap_tags);
+   if (tag != -1)
+   break;
+
+   blk_mq_run_hw_queue(data->hctx, false);
+
+   tag = __sbitmap_queue_get(>bitmap_tags);
+   if (tag != -1)
+   break;
+
+   blk_mq_put_ctx(data->ctx);
+   io_schedule();
+
+   data->ctx = blk_mq_get_ctx(data->q);
+   data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+   finish_wait(>wait, );
+   ws = sbq_wait_ptr(>bitmap_tags, wait_index);
+   } while (1);
+
+   finish_wait(>wait, );
+done:
+   rq = tags->rqs[tag];
+   rq->tag = tag;
+   return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+ struct request *rq)
+{
+   sbitmap_queue_clear(>bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name) ((dst)->name = (src)->name)
+   FIELD_COPY(rq, src, cpu);
+   FIELD_COPY(rq, src, cmd_type);
+   FIELD_COPY(rq, src, cmd_flags);
+   rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | 
RQF_DONTPREP));
+   rq->rq_flags &= ~RQF_IO_STAT;
+   FIELD_COPY(rq, src, __data_len);
+   FIELD_COPY(rq, src, __sector);
+   FIELD_COPY(rq, src, bio);
+   FIELD_COPY(rq, src, biotail);
+   FIELD_COPY(rq, src, rq_disk);
+   FIELD_COPY(rq, src, part);
+   FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+   FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+   FIELD_COPY(rq, src, ioprio);
+   

[PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-07 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/blk-mq-sched.c | 243 +++
 block/blk-mq-sched.h | 168 +++
 2 files changed, 411 insertions(+)
 create mode 100644 block/blk-mq-sched.c
 create mode 100644 block/blk-mq-sched.h

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index ..8317b26990f8
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,243 @@
+#include 
+#include 
+
+#include 
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static struct blk_mq_ops mq_sched_tag_ops = {
+   .queue_rq   = NULL,
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+   blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+   unsigned int numa_node)
+{
+   struct blk_mq_tag_set set = {
+   .ops= _sched_tag_ops,
+   .nr_hw_queues   = 1,
+   .queue_depth= depth,
+   .numa_node  = numa_node,
+   };
+
+   return blk_mq_init_rq_map(, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+void (*exit)(struct blk_mq_hw_ctx *))
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (exit)
+   exit(hctx);
+   kfree(hctx->sched_data);
+   hctx->sched_data = NULL;
+   }
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+   void (*init)(struct blk_mq_hw_ctx *))
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
hctx->numa_node);
+   if (!hctx->sched_data)
+   goto error;
+
+   if (init)
+   init(hctx);
+   }
+
+   return 0;
+error:
+   blk_mq_sched_free_hctx_data(q, NULL);
+   return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data 
*data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index)
+{
+   struct sbq_wait_state *ws;
+   DEFINE_WAIT(wait);
+   struct request *rq;
+   int tag;
+
+   tag = __sbitmap_queue_get(>bitmap_tags);
+   if (tag != -1)
+   goto done;
+
+   if (data->flags & BLK_MQ_REQ_NOWAIT)
+   return NULL;
+
+   ws = sbq_wait_ptr(>bitmap_tags, wait_index);
+   do {
+   prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
+
+   tag = __sbitmap_queue_get(>bitmap_tags);
+   if (tag != -1)
+   break;
+
+   blk_mq_run_hw_queue(data->hctx, false);
+
+   tag = __sbitmap_queue_get(>bitmap_tags);
+   if (tag != -1)
+   break;
+
+   blk_mq_put_ctx(data->ctx);
+   io_schedule();
+
+   data->ctx = blk_mq_get_ctx(data->q);
+   data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+   finish_wait(>wait, );
+   ws = sbq_wait_ptr(>bitmap_tags, wait_index);
+   } while (1);
+
+   finish_wait(>wait, );
+done:
+   rq = tags->rqs[tag];
+   rq->tag = tag;
+   return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+ struct request *rq)
+{
+   sbitmap_queue_clear(>bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name) ((dst)->name = (src)->name)
+   FIELD_COPY(rq, src, cpu);
+   FIELD_COPY(rq, src, cmd_type);
+   FIELD_COPY(rq, src, cmd_flags);
+   rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | 
RQF_DONTPREP));
+   rq->rq_flags &= ~RQF_IO_STAT;
+   FIELD_COPY(rq, src, __data_len);
+   FIELD_COPY(rq, src, __sector);
+   FIELD_COPY(rq, src, bio);
+   FIELD_COPY(rq, src, biotail);
+   FIELD_COPY(rq, src, rq_disk);
+   FIELD_COPY(rq, src, part);
+   FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+   FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+   FIELD_COPY(rq, src, ioprio);
+   FIELD_COPY(rq, src,