[PATCHSET v3] blk-mq scheduling framework

2016-12-14 Thread Jens Axboe
This is version 3 of the blk-mq scheduling framework. Version 2
was posted here:

https://marc.info/?l=linux-block=148122805026762=2

It's fully stable. In fact I'm running it on my laptop [1]. That may
or may not have been part of a dare. In any case, it's been stable
on that too, and has survived lengthy testing on dedicated test
boxes.

[1] $ cat /sys/block/nvme0n1/queue/scheduler
[mq-deadline] none

I'm still mentally debating whether to shift this over to have
duplicate request tags, one for the scheduler and one for the issue
side. We run into various issues if we do that, but we also get
rid of the shadow request field copying. I think both approaches
have their downsides. I originally considered both, and though that
the shadow request would potentially be the cleanest.

I've rebased this against Linus master branch, since a bunch of
the prep patches are now in, and the general block changes are in
as well.

The patches can be pulled here:

git://git.kernel.dk/linux-block blk-mq-sched.3

Changes since v2:

- Fix the Kconfig single/multi queue sched entry. Suggested by Bart.

- Move the queue ref put into the failure path of the request getting,
  so the caller doesn't have to know about it. Suggested by Bart.

- Add support for IO context management. Needed for the BFQ port.

- Change the anonymous elevator ops union to a named one, since
  old (looking at you, gcc 4.4) compilers don't support named
  initialization of anon unions.

- Constify the blk_mq_ops structure pointers.

- Add generic merging code, so mq-deadline (and others) don't have to
  handle/duplicate that.

- Switched the dispatch hook to list based, so we can move more entries
  at the time, if we want/need to. From Omar.

- Add support for schedulers to continue using the software queues.
  From Omar.

- Ensure that it works with blk-wbt.

- Fix a failure case if we fail registering the MQ elevator. We'd
  fall back to trying noop, which we'd find, but that would not
  work for MQ devices. Fall back to 'none' instead.

- Verified queue ref management.

- Fixed a bunch of bugs, and added a bunch of cleanups.

 block/Kconfig.iosched|   37 ++
 block/Makefile   |3 
 block/blk-core.c |   23 -
 block/blk-exec.c |3 
 block/blk-flush.c|7 
 block/blk-ioc.c  |8 
 block/blk-merge.c|4 
 block/blk-mq-sched.c |  394 +
 block/blk-mq-sched.h |  192 ++
 block/blk-mq-tag.c   |1 
 block/blk-mq.c   |  226 +++-
 block/blk-mq.h   |   28 ++
 block/blk.h  |   26 +
 block/cfq-iosched.c  |2 
 block/deadline-iosched.c |2 
 block/elevator.c |  229 
 block/mq-deadline.c  |  638 +++
 block/noop-iosched.c |2 
 drivers/nvme/host/pci.c  |1 
 include/linux/blk-mq.h   |6 
 include/linux/blkdev.h   |2 
 include/linux/elevator.h |   33 ++
 22 files changed, 1635 insertions(+), 232 deletions(-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] block: move existing elevator ops to union

2016-12-14 Thread Jens Axboe
Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe 
---
 block/blk-ioc.c  |  8 +++
 block/blk-merge.c|  4 ++--
 block/blk.h  | 10 
 block/cfq-iosched.c  |  2 +-
 block/deadline-iosched.c |  2 +-
 block/elevator.c | 60 
 block/noop-iosched.c |  2 +-
 include/linux/elevator.h |  4 +++-
 8 files changed, 47 insertions(+), 45 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 381cb50a673c..ab372092a57d 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -43,8 +43,8 @@ static void ioc_exit_icq(struct io_cq *icq)
if (icq->flags & ICQ_EXITED)
return;
 
-   if (et->ops.elevator_exit_icq_fn)
-   et->ops.elevator_exit_icq_fn(icq);
+   if (et->ops.sq.elevator_exit_icq_fn)
+   et->ops.sq.elevator_exit_icq_fn(icq);
 
icq->flags |= ICQ_EXITED;
 }
@@ -383,8 +383,8 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct 
request_queue *q,
if (likely(!radix_tree_insert(>icq_tree, q->id, icq))) {
hlist_add_head(>ioc_node, >icq_list);
list_add(>q_node, >icq_list);
-   if (et->ops.elevator_init_icq_fn)
-   et->ops.elevator_init_icq_fn(icq);
+   if (et->ops.sq.elevator_init_icq_fn)
+   et->ops.sq.elevator_init_icq_fn(icq);
} else {
kmem_cache_free(et->icq_cache, icq);
icq = ioc_lookup_icq(ioc, q);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 182398cb1524..480570b691dc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -763,8 +763,8 @@ int blk_attempt_req_merge(struct request_queue *q, struct 
request *rq,
 {
struct elevator_queue *e = q->elevator;
 
-   if (e->type->ops.elevator_allow_rq_merge_fn)
-   if (!e->type->ops.elevator_allow_rq_merge_fn(q, rq, next))
+   if (e->type->ops.sq.elevator_allow_rq_merge_fn)
+   if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
return 0;
 
return attempt_merge(q, rq, next);
diff --git a/block/blk.h b/block/blk.h
index 041185e5f129..f46c0ac8ae3d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -167,7 +167,7 @@ static inline struct request *__elv_next_request(struct 
request_queue *q)
return NULL;
}
if (unlikely(blk_queue_bypass(q)) ||
-   !q->elevator->type->ops.elevator_dispatch_fn(q, 0))
+   !q->elevator->type->ops.sq.elevator_dispatch_fn(q, 0))
return NULL;
}
 }
@@ -176,16 +176,16 @@ static inline void elv_activate_rq(struct request_queue 
*q, struct request *rq)
 {
struct elevator_queue *e = q->elevator;
 
-   if (e->type->ops.elevator_activate_req_fn)
-   e->type->ops.elevator_activate_req_fn(q, rq);
+   if (e->type->ops.sq.elevator_activate_req_fn)
+   e->type->ops.sq.elevator_activate_req_fn(q, rq);
 }
 
 static inline void elv_deactivate_rq(struct request_queue *q, struct request 
*rq)
 {
struct elevator_queue *e = q->elevator;
 
-   if (e->type->ops.elevator_deactivate_req_fn)
-   e->type->ops.elevator_deactivate_req_fn(q, rq);
+   if (e->type->ops.sq.elevator_deactivate_req_fn)
+   e->type->ops.sq.elevator_deactivate_req_fn(q, rq);
 }
 
 #ifdef CONFIG_FAIL_IO_TIMEOUT
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c73a6fcaeb9d..37aeb20fa454 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -4837,7 +4837,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 };
 
 static struct elevator_type iosched_cfq = {
-   .ops = {
+   .ops.sq = {
.elevator_merge_fn =cfq_merge,
.elevator_merged_fn =   cfq_merged_request,
.elevator_merge_req_fn =cfq_merged_requests,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 55e0bb6d7da7..05fc0ea25a98 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -439,7 +439,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 };
 
 static struct elevator_type iosched_deadline = {
-   .ops = {
+   .ops.sq = {
.elevator_merge_fn =deadline_merge,
.elevator_merged_fn =   deadline_merged_request,
.elevator_merge_req_fn =deadline_merged_requests,
diff --git a/block/elevator.c b/block/elevator.c
index 40f0c04e5ad3..022a26830297 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -58,8 +58,8 @@ static int elv_iosched_allow_bio_merge(struct request *rq, 
struct bio *bio)
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;
 
-   if (e->type->ops.elevator_allow_bio_merge_fn)

[PATCH 3/7] block: move rq_ioc() to blk.h

2016-12-14 Thread Jens Axboe
We want to use it outside of blk-core.c.

Signed-off-by: Jens Axboe 
---
 block/blk-core.c | 16 
 block/blk.h  | 16 
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b64..92baea07acbc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1040,22 +1040,6 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
 }
 
 /**
- * rq_ioc - determine io_context for request allocation
- * @bio: request being allocated is for this bio (can be %NULL)
- *
- * Determine io_context to use for request allocation for @bio.  May return
- * %NULL if %current->io_context doesn't exist.
- */
-static struct io_context *rq_ioc(struct bio *bio)
-{
-#ifdef CONFIG_BLK_CGROUP
-   if (bio && bio->bi_ioc)
-   return bio->bi_ioc;
-#endif
-   return current->io_context;
-}
-
-/**
  * __get_request - get a free request
  * @rl: request list to allocate from
  * @op: operation and flags
diff --git a/block/blk.h b/block/blk.h
index f46c0ac8ae3d..9a716b5925a4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -264,6 +264,22 @@ void ioc_clear_queue(struct request_queue *q);
 int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
 
 /**
+ * rq_ioc - determine io_context for request allocation
+ * @bio: request being allocated is for this bio (can be %NULL)
+ *
+ * Determine io_context to use for request allocation for @bio.  May return
+ * %NULL if %current->io_context doesn't exist.
+ */
+static inline struct io_context *rq_ioc(struct bio *bio)
+{
+#ifdef CONFIG_BLK_CGROUP
+   if (bio && bio->bi_ioc)
+   return bio->bi_ioc;
+#endif
+   return current->io_context;
+}
+
+/**
  * create_io_context - try to create task->io_context
  * @gfp_mask: allocation mask
  * @node: allocation node
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2016-12-14 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/Kconfig.iosched |   6 +
 block/Makefile|   1 +
 block/mq-deadline.c   | 638 ++
 3 files changed, 645 insertions(+)
 create mode 100644 block/mq-deadline.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9c4c48..490ef2850fae 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,6 +32,12 @@ config IOSCHED_CFQ
 
  This is the default I/O scheduler.
 
+config MQ_IOSCHED_DEADLINE
+   tristate "MQ deadline I/O scheduler"
+   default y
+   ---help---
+ MQ version of the deadline IO scheduler.
+
 config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
diff --git a/block/Makefile b/block/Makefile
index 2eee9e1bb6db..3ee0abd7205a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)  += blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)  += cfq-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_DEADLINE)  += mq-deadline.o
 
 obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)   += cmdline-parser.o
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
new file mode 100644
index ..6114779cdc68
--- /dev/null
+++ b/block/mq-deadline.c
@@ -0,0 +1,638 @@
+/*
+ *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
+ *  for the blk-mq scheduling framework
+ *
+ *  Copyright (C) 2016 Jens Axboe 
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
+
+static inline bool dd_rq_is_shadow(struct request *rq)
+{
+   return rq->rq_flags & RQF_ALLOCED;
+}
+
+/*
+ * See Documentation/block/deadline-iosched.txt
+ */
+static const int read_expire = HZ / 2;  /* max time before a read is 
submitted. */
+static const int write_expire = 5 * HZ; /* ditto for writes, these limits are 
SOFT! */
+static const int writes_starved = 2;/* max times reads can starve a write 
*/
+static const int fifo_batch = 16;   /* # of sequential requests treated as 
one
+by the above parameters. For throughput. */
+
+struct deadline_data {
+   /*
+* run time data
+*/
+
+   /*
+* requests (deadline_rq s) are present on both sort_list and fifo_list
+*/
+   struct rb_root sort_list[2];
+   struct list_head fifo_list[2];
+
+   /*
+* next in sort order. read, write or both are NULL
+*/
+   struct request *next_rq[2];
+   unsigned int batching;  /* number of sequential requests made */
+   unsigned int starved;   /* times reads have starved writes */
+
+   /*
+* settings that change how the i/o scheduler behaves
+*/
+   int fifo_expire[2];
+   int fifo_batch;
+   int writes_starved;
+   int front_merges;
+
+   spinlock_t lock;
+   struct list_head dispatch;
+   struct blk_mq_tags *tags;
+   atomic_t wait_index;
+};
+
+static inline struct rb_root *
+deadline_rb_root(struct deadline_data *dd, struct request *rq)
+{
+   return >sort_list[rq_data_dir(rq)];
+}
+
+/*
+ * get the request after `rq' in sector-sorted order
+ */
+static inline struct request *
+deadline_latter_request(struct request *rq)
+{
+   struct rb_node *node = rb_next(>rb_node);
+
+   if (node)
+   return rb_entry_rq(node);
+
+   return NULL;
+}
+
+static void
+deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+   struct rb_root *root = deadline_rb_root(dd, rq);
+
+   elv_rb_add(root, rq);
+}
+
+static inline void
+deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+   const int data_dir = rq_data_dir(rq);
+
+   if (dd->next_rq[data_dir] == rq)
+   dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+   elv_rb_del(deadline_rb_root(dd, rq), rq);
+}
+
+/*
+ * remove rq from rbtree and fifo.
+ */
+static void deadline_remove_request(struct request_queue *q, struct request 
*rq)
+{
+   struct deadline_data *dd = q->elevator->elevator_data;
+
+   list_del_init(>queuelist);
+   deadline_del_rq_rb(dd, rq);
+
+   elv_rqhash_del(q, rq);
+   if (q->last_merge == rq)
+   q->last_merge = NULL;
+}
+
+static void dd_request_merged(struct request_queue *q, struct request *req,
+ int type)
+{
+   struct deadline_data *dd = q->elevator->elevator_data;
+
+   /*
+* if the merge was a front merge, we need to reposition request
+*/
+   if (type == ELEVATOR_FRONT_MERGE) {
+   elv_rb_del(deadline_rb_root(dd, req), 

[PATCH 7/7] blk-mq-sched: allow setting of default IO scheduler

2016-12-14 Thread Jens Axboe
Signed-off-by: Jens Axboe 
---
 block/Kconfig.iosched   | 43 +--
 block/blk-mq-sched.c| 19 +++
 block/blk-mq-sched.h|  2 ++
 block/blk-mq.c  |  3 +++
 block/elevator.c|  5 -
 drivers/nvme/host/pci.c |  1 +
 include/linux/blk-mq.h  |  1 +
 7 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 490ef2850fae..96216cf18560 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,12 +32,6 @@ config IOSCHED_CFQ
 
  This is the default I/O scheduler.
 
-config MQ_IOSCHED_DEADLINE
-   tristate "MQ deadline I/O scheduler"
-   default y
-   ---help---
- MQ version of the deadline IO scheduler.
-
 config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
@@ -69,6 +63,43 @@ config DEFAULT_IOSCHED
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
 
+config MQ_IOSCHED_DEADLINE
+   tristate "MQ deadline I/O scheduler"
+   default y
+   ---help---
+ MQ version of the deadline IO scheduler.
+
+config MQ_IOSCHED_NONE
+   bool
+   default y
+
+choice
+   prompt "Default MQ I/O scheduler"
+   default MQ_IOSCHED_NONE
+   help
+ Select the I/O scheduler which will be used by default for all
+ blk-mq managed block devices.
+
+   config DEFAULT_MQ_DEADLINE
+   bool "MQ Deadline" if MQ_IOSCHED_DEADLINE=y
+
+   config DEFAULT_MQ_NONE
+   bool "None"
+
+endchoice
+
+config DEFAULT_MQ_IOSCHED
+   string
+   default "mq-deadline" if DEFAULT_MQ_DEADLINE
+   default "none" if DEFAULT_MQ_NONE
+
+config MQ_IOSCHED_ONLY_SQ
+   bool "Enable blk-mq IO scheduler only for single queue devices"
+   default y
+   help
+ Say Y here, if you only want to enable IO scheduling on block
+ devices that have a single queue registered.
+
 endmenu
 
 endif
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 02ad17258666..606d519b42ee 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -373,3 +373,22 @@ void blk_mq_sched_request_inserted(struct request *rq)
trace_block_rq_insert(rq->q, rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
+
+int blk_mq_sched_init(struct request_queue *q)
+{
+   int ret;
+
+#if defined(CONFIG_DEFAULT_MQ_NONE)
+   return 0;
+#endif
+#if defined(CONFIG_MQ_IOSCHED_ONLY_SQ)
+   if (q->nr_hw_queues > 1)
+   return 0;
+#endif
+
+   mutex_lock(>sysfs_lock);
+   ret = elevator_init(q, NULL);
+   mutex_unlock(>sysfs_lock);
+
+   return ret;
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index b68dccc0190e..e398412d3fcf 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -28,6 +28,8 @@ void blk_mq_sched_request_inserted(struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 
+int blk_mq_sched_init(struct request_queue *q);
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
 static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d10a246a3bc7..48c28e1cb42a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2101,6 +2101,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(>requeue_list);
spin_lock_init(>requeue_lock);
 
+   if (!(set->flags & BLK_MQ_F_NO_SCHED))
+   blk_mq_sched_init(q);
+
if (q->nr_hw_queues > 1)
blk_queue_make_request(q, blk_mq_make_request);
else
diff --git a/block/elevator.c b/block/elevator.c
index 6d39197768c1..7ad906689833 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -219,7 +219,10 @@ int elevator_init(struct request_queue *q, char *name)
}
 
if (!e) {
-   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+   if (q->mq_ops)
+   e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+   else
+   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d6e6bce93d0c..063410d9b3cc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1188,6 +1188,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->admin_tagset.timeout = ADMIN_TIMEOUT;
dev->admin_tagset.numa_node = dev_to_node(dev->dev);
dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+   dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
dev->admin_tagset.driver_data = dev;
 
if 

Re: [RFC] block: check partition alignment

2016-12-14 Thread Damien Le Moal

> sd.c ensures that the logical block size (sector size in sd.c) is a
> power of 2 between 512 and 4096. So you can use:
> 
> if (p.start & (bdev_physical_block_size(bdev) - 1))

Sorry, that was a little too short as a complete proof:
sd.c ensures that the logical block size (sector size in sd.c) is a
power of 2 between 512 and 4096, and the physical block size is a power
of 2 number of logical blocks. So the physical block size is also always
a power of 2.

> 
> Or use div_u64_rem to avoid an error on 32 bits builds.
> 
> Best regards.
> 

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Block IO fixes for 4.10-rc1

2016-12-14 Thread Jens Axboe
Hi Linus,

A few fixes that I collected as post-merge. I was going to wait a bit
with sending this out, but the O_DIRECT fix should really go in sooner
rather than later.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus



Andy Lutomirski (1):
  nvme/pci: Log PCI_STATUS when the controller dies

Gabriel Krisman Bertazi (2):
  blk-mq: Avoid memory reclaim when remapping queues
  blk-mq: Fix failed allocation path when mapping queues

NeilBrown (1):
  block_dev: don't test bdev->bd_contains when it is not stable

Shaohua Li (1):
  block_dev: don't update file access position for sync direct IO

 block/blk-mq.c  | 32 
 drivers/nvme/host/pci.c | 22 +++---
 fs/block_dev.c  |  7 ++-
 3 files changed, 45 insertions(+), 16 deletions(-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] fs meta data unmap

2016-12-14 Thread Jens Axboe
Hi Linus,

A series from Jan Kara, providing a more efficient way for unmapping
meta data from in the buffer cache than doing it block-by-block. Provide
a general helper that existing callers can use.

Please pull!


  git://git.kernel.dk/linux-block.git for-4.10/fs-unmap



Jan Kara (6):
  fs: Provide function to unmap metadata for a range of blocks
  direct-io: Use clean_bdev_aliases() instead of handmade iteration
  ext4: Use clean_bdev_aliases() instead of iteration
  ext2: Use clean_bdev_aliases() instead of iteration
  fs: Add helper to clean bdev aliases under a bh and use it
  fs: Remove unmap_underlying_metadata

 fs/buffer.c | 104 +++-
 fs/direct-io.c  |  28 +++-
 fs/ext2/inode.c |   9 ++--
 fs/ext4/extents.c   |  13 +-
 fs/ext4/inode.c |  18 +++-
 fs/ext4/page-io.c   |   2 +-
 fs/mpage.c  |   3 +-
 fs/ntfs/aops.c  |   2 +-
 fs/ntfs/file.c  |   5 +--
 fs/ocfs2/aops.c |   2 +-
 fs/ufs/balloc.c |   3 +-
 fs/ufs/inode.c  |   3 +-
 include/linux/buffer_head.h |   7 ++-
 13 files changed, 104 insertions(+), 95 deletions(-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] blk-mq: Fix failed allocation path when mapping queues

2016-12-14 Thread Jens Axboe
On 12/14/2016 01:48 PM, Gabriel Krisman Bertazi wrote:
> From: Gabriel Krisman Bertazi 
> 
> In blk_mq_map_swqueue, there is a memory optimization that frees the
> tags of a queue that has gone unmapped.  Later, if that hctx is remapped
> after another topology change, the tags need to be reallocated.
> 
> If this allocation fails, a simple WARN_ON triggers, but the block layer
> ends up with an active hctx without any corresponding set of tags.
> Then, any income IO to that hctx can trigger an Oops.
> 
> I can reproduce it consistently by running IO, flipping CPUs on and off
> and eventually injecting a memory allocation failure in that path.
> 
> In the fix below, if the system experiences a failed allocation of any
> hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
> should always keep it's tags.  There is a minor performance hit, since
> our mapping just got worse after the error path, but this is
> the simplest solution to handle this error path.  The performance hit
> will disappear after another successful remap.
> 
> I considered dropping the memory optimization all together, but it
> seemed a bad trade-off to handle this very specific error case.

Thanks, this looks fine to me now. Both of your patches are queued
up for inclusion in 4.10.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] block: check partition alignment

2016-12-14 Thread Christoph Hellwig
> To prevent partitions that are not aligned to the physical blocksize
> of a device check for the alignment in the blkpg_ioctl.

We'd also need to reject this when reading partitions from disk, right?

> + /* check if partition is aligned to blocksize */
> + if (p.start % bdev_physical_block_size(bdev) != 0)

And this should be bdev_logical_block_size, as the logical block size
is the only thing that matters for the OS - exposing the physical block
size is just an optional hint to prevent users from doing stupid
things (like creating unaligned partitions :))
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] block: check partition alignment

2016-12-14 Thread Stefan Haberland
Partitions that are not aligned to the blocksize of a device may cause
invalid I/O requests because the blocklayer cares only about alignment
within the partition when building requests on partitions.

device
|4096|4096|4096|
partition offset 512byte
|-512-|4096|4096|4096|

When reading/writing one 4k block of the partition this maps to
reading/writing with an offset of 512 byte of the device leading to
unaligned requests for the device which in turn may cause unexpected
behavior of the device driver.

For DASD devices we have to translate the block number into a cylinder,
head, record format. The unaligned requests lead to wrong calculation
and therefore to misdirected I/O. In a "good" case this leads to I/O
errors because the underlying hardware detects the wrong addressing.
In a worst case scenario this might destroy data on the device.

To prevent partitions that are not aligned to the physical blocksize
of a device check for the alignment in the blkpg_ioctl.

Signed-off-by: Stefan Haberland 
---
 block/ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index 755119c..8b83afee 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -45,6 +45,9 @@ static int blkpg_ioctl(struct block_device *bdev, struct 
blkpg_ioctl_arg __user
|| pstart < 0 || plength < 0 || partno > 
65535)
return -EINVAL;
}
+   /* check if partition is aligned to blocksize */
+   if (p.start % bdev_physical_block_size(bdev) != 0)
+   return -EINVAL;
 
mutex_lock(>bd_mutex);
 
-- 
2.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-14 Thread Jerome Glisse
On Wed, Dec 14, 2016 at 03:23:13PM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > > > > > > From kernel point of view such memory is almost like any other, 
> > > > > > > > it
> > > > > > > > has a struct page and most of the mm code is non the wiser, nor 
> > > > > > > > need
> > > > > > > > to be about it. CPU access trigger a migration back to regular 
> > > > > > > > CPU
> > > > > > > > accessible page.
> > > > > > > 
> > > > > > > That sounds ... complex. Page migration on page cache access 
> > > > > > > inside
> > > > > > > the filesytem IO path locking during read()/write() sounds like
> > > > > > > a great way to cause deadlocks
> > > > > > 
> > > > > > There are few restriction on device page, no one can do GUP on them 
> > > > > > and
> > > > > > thus no one can pin them. Hence they can always be migrated back. 
> > > > > > Yes
> > > > > > each fs need modification, most of it (if not all) is isolated in 
> > > > > > common
> > > > > > filemap helpers.
> > > > > 
> > > > > Sure, but you haven't answered my question: how do you propose we
> > > > > address the issue of placing all the mm locks required for migration
> > > > > under the filesystem IO path locks?
> > > > 
> > > > Two different plans (which are non exclusive of each other). First is 
> > > > to use
> > > > workqueue and have read/write wait on the workqueue to be done 
> > > > migrating the
> > > > page back.
> > > 
> > > Pushing something to a workqueue and then waiting on the workqueue
> > > to complete the work doesn't change lock ordering problems - it
> > > just hides them away and makes them harder to debug.
> > 
> > Migration doesn't need many lock below is a list and i don't see any lock 
> > issue
> > in respect to ->read or ->write.
> > 
> >  lock_page(page);
> >  spin_lock_irq(>tree_lock);
> >  lock_buffer(bh); // if page has buffer_head
> >  i_mmap_lock_read(mapping);
> >  vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
> > // page table lock for each entry
> >  }
> 
> We can't take the page or mapping tree locks that while we hold
> various filesystem locks.
> 
> e.g. The IO path lock order is, in places:
> 
> inode->i_rwsem
>   get page from page cache
>   lock_page(page)
>   inode->allocation lock
> zero page data
> 
> Filesystems are allowed to do this, because the IO path has
> guaranteed them access to the page cache data on the page that is
> locked. Your ZONE_DEVICE proposal breaks this guarantee - we might
> have a locked page, but we don't have access to it's data.
> 
> Further, in various filesystems once the allocation lock is taken
> (e.g. the i_lock in XFS) we're not allowed to lock pages or the
> mapping tree as that leads to deadlocks with truncate, hole punch,
> etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that
> requires migration before the zeroing can occur, we can't perform
> migration here.
> 
> Why are we even considering migration in situations where we already
> hold the ZONE_DEVICE page locked, hold other filesystem locks inside
> the page lock, and have an open dirty filesystem transaction as well?
> 
> Even if migration si possible and succeeds, the struct page in the
> mapping tree for the file offset we are operating on is going to be
> different after migration. That implies we need to completely
> restart the operation. But given that we've already made changes,
> backing out at this point is ...  complex and may not even be
> possible.

So i skim through xfs code and i still think this is doable. So in the
above sequence:

  inode->i_rwsem
  page = find_get_page();
  if (device_unaddressable(page)) {
 page = migratepage();
  }
  ...

Now there is thing like filemap_write_and_wait...() but thus can be
handled by the bio bounce buffer like i said ie a the block layer we
allocate temporary page, page are already read only on the device as
device obey regular thing like page_mkclean(). So page content is
stable.

The migrate page is using buffer_migrate_page() and i don't see any
deadlock there. So i am not seeing any problem in doing migrate early
on right after page lookup.


> 
> i.e. we have an architectural assumption that page contents are
> always accessable when we have a locked struct page, and your
> proposal would appear to violate that assumption...

And it is, data might be in device memory but you can use bounce
page to access it and you can write protect it on the device so
that it doesn't change.

Looking at xfs, it never does a kmap() directly, only through some
of the generic code and thus are place where we can use bounce 

Re: [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues

2016-12-14 Thread Jens Axboe
On 12/06/2016 08:31 AM, Gabriel Krisman Bertazi wrote:
> While stressing memory and IO at the same time we changed SMT settings,
> we were able to consistently trigger deadlocks in the mm system, which
> froze the entire machine.
> 
> I think that under memory stress conditions, the large allocations
> performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
> waiting on the block layer remmaping completion, thus deadlocking the
> system.  The trace below was collected after the machine stalled,
> waiting for the hotplug event completion.
> 
> The simplest fix for this is to make allocations in this path
> non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
> issue anymore.

This looks fine.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-14 Thread Jens Axboe
On 12/14/2016 03:31 AM, Bart Van Assche wrote:
> On 12/13/2016 04:14 PM, Jens Axboe wrote:
>> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
 +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
 +struct blk_mq_alloc_data 
 *data,
 +struct blk_mq_tags *tags,
 +atomic_t *wait_index)
 +{
>>>
>>> Using the word "shadow" in the function name suggests to me that there 
>>> is a shadow request for every request and a request for every shadow 
>>> request. However, my understanding from the code is that there can be 
>>> requests without shadow requests (for e.g. a flush) and shadow requests 
>>> without requests. Shouldn't the name of this function reflect that, e.g. 
>>> by using "sched" or "elv" in the function name instead of "shadow"?
>>
>> Shadow might not be the best name. Most do have shadows though, it's
>> only the rare exception like the flush, that you mention. I'll see if I
>> can come up with a better name.
> 
> Hello Jens,
> 
> One aspect of this patch series that might turn out to be a maintenance
> burden is the copying between original and shadow requests. It is easy
> to overlook that rq_copy() has to be updated if a field would ever be
> added to struct request. Additionally, having to allocate two requests
> structures per I/O instead of one will have a runtime overhead. Do you
> think the following approach would work?
> - Instead of using two request structures per I/O, only use a single
>   request structure.
> - Instead of storing one tag in the request structure, store two tags
>   in that structure. One tag comes from the I/O scheduler tag set
>   (size: nr_requests) and the other from the tag set associated with
>   the block driver (size: HBA queue depth).
> - Only add a request to the hctx dispatch list after a block driver tag
>   has been assigned. This means that an I/O scheduler must keep a
>   request structure on a list it manages itself as long as no block
>   driver tag has been assigned.
> - sysfs_list_show() is modified such that it shows both tags.

I have considered doing exactly that, decided to go down the other path.
I may still revisit, it's not that I'm a huge fan of the shadow requests
and the necessary copying. We don't update the request that often, so I
don't think it's going to be a big maintenance burden. But it'd be hard
to claim that it's super pretty...

I'll play with the idea.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues

2016-12-14 Thread Douglas Miller

On 12/13/2016 06:39 PM, Gabriel Krisman Bertazi wrote:

On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:

In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jen's for-next branch.

Hi,

I saw this patchset missed the first PR for 4.10, though I'd really like
to see it merged in this cycle, if possible.  If you see anything
particularly concerning about this that I missed, please let me know,
but I fear this has been around for a while without much feedback.

Thanks,

I want to repeat what Gabriel says. We are now seeing evidence of the 
condition often. This really needs to get fix soon.


Thanks,

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-14 Thread Jan Kara
On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > I would like to discuss un-addressable device memory in the context of
> > > > > filesystem and block device. Specificaly how to handle write-back, 
> > > > > read,
> > > > > ... when a filesystem page is migrated to device memory that CPU can 
> > > > > not
> > > > > access.
> > > > 
> > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > becomes non-DAX capable?
> > > > 
> > > > If you are not talking about pmem and DAX, then exactly what does
> > > > "when a filesystem page is migrated to device memory that CPU can
> > > > not access" mean? What "filesystem page" are we talking about that
> > > > can get migrated from main RAM to something the CPU can't access?
> > > 
> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > board memory that can not be expose transparently to the CPU. I am
> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > https://lwn.net/Articles/706856/
> > 
> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> 
> Well not only target, it can be source too. But the device can read
> and write any system memory and dma to/from that memory to its on
> board memory.
> 
> > 
> > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > "regular" filesystem back by a "regular" block device. I want to be
> > > able to migrate mmaped area of such filesystem to device memory while
> > > the device is actively using that memory.
> > 
> > "migrate mmapped area of such filesystem" means what, exactly?
> 
> fd = open("/path/to/some/file")
> ptr = mmap(fd, ...);
> gpu_compute_something(ptr);
> 
> > 
> > Are you talking about file data contents that have been copied into
> > the page cache and mmapped into a user process address space?
> > IOWs, migrating ZONE_NORMAL page cache page content and state
> > to a new ZONE_DEVICE page, and then migrating back again somehow?
> 
> Take any existing application that mmap a file and allow to migrate
> chunk of that mmaped file to device memory without the application
> even knowing about it. So nothing special in respect to that mmaped
> file. It is a regular file on your filesystem.

OK, so I share most of Dave's concerns about this. But let's talk about
what we can do and what you need and we may find something usable. First
let me understand what is doable / what are the costs on your side.

So we have a page cache page that you'd like to migrate to the device.
Fine. You are willing to sacrifice direct IO - even better. We can fall
back to buffered IO in that case (well, except for XFS which does not do it
but that's a minor detail). One thing I'm not sure about: When a page is
migrated to the device, is its contents available and is just possibly stale
or will something bad happen if we try to access (or even modify) page data?

And by migration you really mean page migration? Be aware that migration of
pagecache pages may be a problem for some pages of some filesystems on its
own - e. g. page migration may fail because there is a filesystem transaction
outstanding modifying that page. For userspace these will be really hard
to understand sporadic errors because it's really filesystem internal
thing. So far page migration was widely used only for free space
defragmentation and for that purpose if page is not migratable for a minute
who cares.

So won't it be easier to leave the pagecache page where it is and *copy* it
to the device? Can the device notify us *before* it is going to modify a
page, not just after it has modified it? Possibly if we just give it the
page read-only and it will have to ask CPU to get write permission? If yes,
then I belive this could work and even fs support should be doable.

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block_dev: don't update file access position for sync direct IO

2016-12-14 Thread Christoph Hellwig
On Tue, Dec 13, 2016 at 09:08:18PM -0700, Jens Axboe wrote:
> That's not great... Thanks, added.

Ooops, yeah - we're still going through ->direct_IO for block devices.
I'll take a stab at removing that, as it's just a pointless indirect
call.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] media/cobalt: use pci_irq_allocate_vectors

2016-12-14 Thread Hans Verkuil

On 14/12/16 11:29, Christoph Hellwig wrote:

Hi Hans,

just checked the current Linux tree and cobalt still uses the old
pci_enable_msi_range call.  Did you queue this patch up for 4.10?
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Completely forgot this. Is it OK to queue it for 4.11? Or is it blocking
other follow-up work you want to do for 4.10?

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] media/cobalt: use pci_irq_allocate_vectors

2016-12-14 Thread Christoph Hellwig
Hi Hans,

just checked the current Linux tree and cobalt still uses the old
pci_enable_msi_range call.  Did you queue this patch up for 4.10?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2016-12-14 Thread Bart Van Assche
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static inline bool dd_rq_is_shadow(struct request *rq)
> +{
> + return rq->rq_flags & RQF_ALLOCED;
> +}

Hello Jens,

Something minor: because req_flags_t has been defined using __bitwise
(typedef __u32 __bitwise req_flags_t) sparse complains for the above
function about converting req_flags_t into bool. How about changing the
body of that function into "return (rq->rq_flags & RQF_ALLOCED) != 0" to
keep sparse happy?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html