[PATCHSET v3] blk-mq scheduling framework
This is version 3 of the blk-mq scheduling framework. Version 2 was posted here: https://marc.info/?l=linux-block=148122805026762=2 It's fully stable. In fact I'm running it on my laptop [1]. That may or may not have been part of a dare. In any case, it's been stable on that too, and has survived lengthy testing on dedicated test boxes. [1] $ cat /sys/block/nvme0n1/queue/scheduler [mq-deadline] none I'm still mentally debating whether to shift this over to have duplicate request tags, one for the scheduler and one for the issue side. We run into various issues if we do that, but we also get rid of the shadow request field copying. I think both approaches have their downsides. I originally considered both, and though that the shadow request would potentially be the cleanest. I've rebased this against Linus master branch, since a bunch of the prep patches are now in, and the general block changes are in as well. The patches can be pulled here: git://git.kernel.dk/linux-block blk-mq-sched.3 Changes since v2: - Fix the Kconfig single/multi queue sched entry. Suggested by Bart. - Move the queue ref put into the failure path of the request getting, so the caller doesn't have to know about it. Suggested by Bart. - Add support for IO context management. Needed for the BFQ port. - Change the anonymous elevator ops union to a named one, since old (looking at you, gcc 4.4) compilers don't support named initialization of anon unions. - Constify the blk_mq_ops structure pointers. - Add generic merging code, so mq-deadline (and others) don't have to handle/duplicate that. - Switched the dispatch hook to list based, so we can move more entries at the time, if we want/need to. From Omar. - Add support for schedulers to continue using the software queues. From Omar. - Ensure that it works with blk-wbt. - Fix a failure case if we fail registering the MQ elevator. We'd fall back to trying noop, which we'd find, but that would not work for MQ devices. Fall back to 'none' instead. - Verified queue ref management. - Fixed a bunch of bugs, and added a bunch of cleanups. block/Kconfig.iosched| 37 ++ block/Makefile |3 block/blk-core.c | 23 - block/blk-exec.c |3 block/blk-flush.c|7 block/blk-ioc.c |8 block/blk-merge.c|4 block/blk-mq-sched.c | 394 + block/blk-mq-sched.h | 192 ++ block/blk-mq-tag.c |1 block/blk-mq.c | 226 +++- block/blk-mq.h | 28 ++ block/blk.h | 26 + block/cfq-iosched.c |2 block/deadline-iosched.c |2 block/elevator.c | 229 block/mq-deadline.c | 638 +++ block/noop-iosched.c |2 drivers/nvme/host/pci.c |1 include/linux/blk-mq.h |6 include/linux/blkdev.h |2 include/linux/elevator.h | 33 ++ 22 files changed, 1635 insertions(+), 232 deletions(-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] block: move existing elevator ops to union
Prep patch for adding MQ ops as well, since doing anon unions with named initializers doesn't work on older compilers. Signed-off-by: Jens Axboe--- block/blk-ioc.c | 8 +++ block/blk-merge.c| 4 ++-- block/blk.h | 10 block/cfq-iosched.c | 2 +- block/deadline-iosched.c | 2 +- block/elevator.c | 60 block/noop-iosched.c | 2 +- include/linux/elevator.h | 4 +++- 8 files changed, 47 insertions(+), 45 deletions(-) diff --git a/block/blk-ioc.c b/block/blk-ioc.c index 381cb50a673c..ab372092a57d 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -43,8 +43,8 @@ static void ioc_exit_icq(struct io_cq *icq) if (icq->flags & ICQ_EXITED) return; - if (et->ops.elevator_exit_icq_fn) - et->ops.elevator_exit_icq_fn(icq); + if (et->ops.sq.elevator_exit_icq_fn) + et->ops.sq.elevator_exit_icq_fn(icq); icq->flags |= ICQ_EXITED; } @@ -383,8 +383,8 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q, if (likely(!radix_tree_insert(>icq_tree, q->id, icq))) { hlist_add_head(>ioc_node, >icq_list); list_add(>q_node, >icq_list); - if (et->ops.elevator_init_icq_fn) - et->ops.elevator_init_icq_fn(icq); + if (et->ops.sq.elevator_init_icq_fn) + et->ops.sq.elevator_init_icq_fn(icq); } else { kmem_cache_free(et->icq_cache, icq); icq = ioc_lookup_icq(ioc, q); diff --git a/block/blk-merge.c b/block/blk-merge.c index 182398cb1524..480570b691dc 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -763,8 +763,8 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq, { struct elevator_queue *e = q->elevator; - if (e->type->ops.elevator_allow_rq_merge_fn) - if (!e->type->ops.elevator_allow_rq_merge_fn(q, rq, next)) + if (e->type->ops.sq.elevator_allow_rq_merge_fn) + if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next)) return 0; return attempt_merge(q, rq, next); diff --git a/block/blk.h b/block/blk.h index 041185e5f129..f46c0ac8ae3d 100644 --- a/block/blk.h +++ b/block/blk.h @@ -167,7 +167,7 @@ static inline struct request *__elv_next_request(struct request_queue *q) return NULL; } if (unlikely(blk_queue_bypass(q)) || - !q->elevator->type->ops.elevator_dispatch_fn(q, 0)) + !q->elevator->type->ops.sq.elevator_dispatch_fn(q, 0)) return NULL; } } @@ -176,16 +176,16 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq) { struct elevator_queue *e = q->elevator; - if (e->type->ops.elevator_activate_req_fn) - e->type->ops.elevator_activate_req_fn(q, rq); + if (e->type->ops.sq.elevator_activate_req_fn) + e->type->ops.sq.elevator_activate_req_fn(q, rq); } static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq) { struct elevator_queue *e = q->elevator; - if (e->type->ops.elevator_deactivate_req_fn) - e->type->ops.elevator_deactivate_req_fn(q, rq); + if (e->type->ops.sq.elevator_deactivate_req_fn) + e->type->ops.sq.elevator_deactivate_req_fn(q, rq); } #ifdef CONFIG_FAIL_IO_TIMEOUT diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index c73a6fcaeb9d..37aeb20fa454 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -4837,7 +4837,7 @@ static struct elv_fs_entry cfq_attrs[] = { }; static struct elevator_type iosched_cfq = { - .ops = { + .ops.sq = { .elevator_merge_fn =cfq_merge, .elevator_merged_fn = cfq_merged_request, .elevator_merge_req_fn =cfq_merged_requests, diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c index 55e0bb6d7da7..05fc0ea25a98 100644 --- a/block/deadline-iosched.c +++ b/block/deadline-iosched.c @@ -439,7 +439,7 @@ static struct elv_fs_entry deadline_attrs[] = { }; static struct elevator_type iosched_deadline = { - .ops = { + .ops.sq = { .elevator_merge_fn =deadline_merge, .elevator_merged_fn = deadline_merged_request, .elevator_merge_req_fn =deadline_merged_requests, diff --git a/block/elevator.c b/block/elevator.c index 40f0c04e5ad3..022a26830297 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -58,8 +58,8 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio) struct request_queue *q = rq->q; struct elevator_queue *e = q->elevator; - if (e->type->ops.elevator_allow_bio_merge_fn)
[PATCH 3/7] block: move rq_ioc() to blk.h
We want to use it outside of blk-core.c. Signed-off-by: Jens Axboe--- block/blk-core.c | 16 block/blk.h | 16 2 files changed, 16 insertions(+), 16 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 61ba08c58b64..92baea07acbc 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1040,22 +1040,6 @@ static bool blk_rq_should_init_elevator(struct bio *bio) } /** - * rq_ioc - determine io_context for request allocation - * @bio: request being allocated is for this bio (can be %NULL) - * - * Determine io_context to use for request allocation for @bio. May return - * %NULL if %current->io_context doesn't exist. - */ -static struct io_context *rq_ioc(struct bio *bio) -{ -#ifdef CONFIG_BLK_CGROUP - if (bio && bio->bi_ioc) - return bio->bi_ioc; -#endif - return current->io_context; -} - -/** * __get_request - get a free request * @rl: request list to allocate from * @op: operation and flags diff --git a/block/blk.h b/block/blk.h index f46c0ac8ae3d..9a716b5925a4 100644 --- a/block/blk.h +++ b/block/blk.h @@ -264,6 +264,22 @@ void ioc_clear_queue(struct request_queue *q); int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node); /** + * rq_ioc - determine io_context for request allocation + * @bio: request being allocated is for this bio (can be %NULL) + * + * Determine io_context to use for request allocation for @bio. May return + * %NULL if %current->io_context doesn't exist. + */ +static inline struct io_context *rq_ioc(struct bio *bio) +{ +#ifdef CONFIG_BLK_CGROUP + if (bio && bio->bi_ioc) + return bio->bi_ioc; +#endif + return current->io_context; +} + +/** * create_io_context - try to create task->io_context * @gfp_mask: allocation mask * @node: allocation node -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
Signed-off-by: Jens Axboe--- block/Kconfig.iosched | 6 + block/Makefile| 1 + block/mq-deadline.c | 638 ++ 3 files changed, 645 insertions(+) create mode 100644 block/mq-deadline.c diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index 421bef9c4c48..490ef2850fae 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -32,6 +32,12 @@ config IOSCHED_CFQ This is the default I/O scheduler. +config MQ_IOSCHED_DEADLINE + tristate "MQ deadline I/O scheduler" + default y + ---help--- + MQ version of the deadline IO scheduler. + config CFQ_GROUP_IOSCHED bool "CFQ Group Scheduling support" depends on IOSCHED_CFQ && BLK_CGROUP diff --git a/block/Makefile b/block/Makefile index 2eee9e1bb6db..3ee0abd7205a 100644 --- a/block/Makefile +++ b/block/Makefile @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o +obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o diff --git a/block/mq-deadline.c b/block/mq-deadline.c new file mode 100644 index ..6114779cdc68 --- /dev/null +++ b/block/mq-deadline.c @@ -0,0 +1,638 @@ +/* + * MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler, + * for the blk-mq scheduling framework + * + * Copyright (C) 2016 Jens Axboe + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "blk.h" +#include "blk-mq.h" +#include "blk-mq-tag.h" +#include "blk-mq-sched.h" + +static inline bool dd_rq_is_shadow(struct request *rq) +{ + return rq->rq_flags & RQF_ALLOCED; +} + +/* + * See Documentation/block/deadline-iosched.txt + */ +static const int read_expire = HZ / 2; /* max time before a read is submitted. */ +static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */ +static const int writes_starved = 2;/* max times reads can starve a write */ +static const int fifo_batch = 16; /* # of sequential requests treated as one +by the above parameters. For throughput. */ + +struct deadline_data { + /* +* run time data +*/ + + /* +* requests (deadline_rq s) are present on both sort_list and fifo_list +*/ + struct rb_root sort_list[2]; + struct list_head fifo_list[2]; + + /* +* next in sort order. read, write or both are NULL +*/ + struct request *next_rq[2]; + unsigned int batching; /* number of sequential requests made */ + unsigned int starved; /* times reads have starved writes */ + + /* +* settings that change how the i/o scheduler behaves +*/ + int fifo_expire[2]; + int fifo_batch; + int writes_starved; + int front_merges; + + spinlock_t lock; + struct list_head dispatch; + struct blk_mq_tags *tags; + atomic_t wait_index; +}; + +static inline struct rb_root * +deadline_rb_root(struct deadline_data *dd, struct request *rq) +{ + return >sort_list[rq_data_dir(rq)]; +} + +/* + * get the request after `rq' in sector-sorted order + */ +static inline struct request * +deadline_latter_request(struct request *rq) +{ + struct rb_node *node = rb_next(>rb_node); + + if (node) + return rb_entry_rq(node); + + return NULL; +} + +static void +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq) +{ + struct rb_root *root = deadline_rb_root(dd, rq); + + elv_rb_add(root, rq); +} + +static inline void +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq) +{ + const int data_dir = rq_data_dir(rq); + + if (dd->next_rq[data_dir] == rq) + dd->next_rq[data_dir] = deadline_latter_request(rq); + + elv_rb_del(deadline_rb_root(dd, rq), rq); +} + +/* + * remove rq from rbtree and fifo. + */ +static void deadline_remove_request(struct request_queue *q, struct request *rq) +{ + struct deadline_data *dd = q->elevator->elevator_data; + + list_del_init(>queuelist); + deadline_del_rq_rb(dd, rq); + + elv_rqhash_del(q, rq); + if (q->last_merge == rq) + q->last_merge = NULL; +} + +static void dd_request_merged(struct request_queue *q, struct request *req, + int type) +{ + struct deadline_data *dd = q->elevator->elevator_data; + + /* +* if the merge was a front merge, we need to reposition request +*/ + if (type == ELEVATOR_FRONT_MERGE) { + elv_rb_del(deadline_rb_root(dd, req),
[PATCH 7/7] blk-mq-sched: allow setting of default IO scheduler
Signed-off-by: Jens Axboe--- block/Kconfig.iosched | 43 +-- block/blk-mq-sched.c| 19 +++ block/blk-mq-sched.h| 2 ++ block/blk-mq.c | 3 +++ block/elevator.c| 5 - drivers/nvme/host/pci.c | 1 + include/linux/blk-mq.h | 1 + 7 files changed, 67 insertions(+), 7 deletions(-) diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index 490ef2850fae..96216cf18560 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -32,12 +32,6 @@ config IOSCHED_CFQ This is the default I/O scheduler. -config MQ_IOSCHED_DEADLINE - tristate "MQ deadline I/O scheduler" - default y - ---help--- - MQ version of the deadline IO scheduler. - config CFQ_GROUP_IOSCHED bool "CFQ Group Scheduling support" depends on IOSCHED_CFQ && BLK_CGROUP @@ -69,6 +63,43 @@ config DEFAULT_IOSCHED default "cfq" if DEFAULT_CFQ default "noop" if DEFAULT_NOOP +config MQ_IOSCHED_DEADLINE + tristate "MQ deadline I/O scheduler" + default y + ---help--- + MQ version of the deadline IO scheduler. + +config MQ_IOSCHED_NONE + bool + default y + +choice + prompt "Default MQ I/O scheduler" + default MQ_IOSCHED_NONE + help + Select the I/O scheduler which will be used by default for all + blk-mq managed block devices. + + config DEFAULT_MQ_DEADLINE + bool "MQ Deadline" if MQ_IOSCHED_DEADLINE=y + + config DEFAULT_MQ_NONE + bool "None" + +endchoice + +config DEFAULT_MQ_IOSCHED + string + default "mq-deadline" if DEFAULT_MQ_DEADLINE + default "none" if DEFAULT_MQ_NONE + +config MQ_IOSCHED_ONLY_SQ + bool "Enable blk-mq IO scheduler only for single queue devices" + default y + help + Say Y here, if you only want to enable IO scheduling on block + devices that have a single queue registered. + endmenu endif diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 02ad17258666..606d519b42ee 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -373,3 +373,22 @@ void blk_mq_sched_request_inserted(struct request *rq) trace_block_rq_insert(rq->q, rq); } EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted); + +int blk_mq_sched_init(struct request_queue *q) +{ + int ret; + +#if defined(CONFIG_DEFAULT_MQ_NONE) + return 0; +#endif +#if defined(CONFIG_MQ_IOSCHED_ONLY_SQ) + if (q->nr_hw_queues > 1) + return 0; +#endif + + mutex_lock(>sysfs_lock); + ret = elevator_init(q, NULL); + mutex_unlock(>sysfs_lock); + + return ret; +} diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index b68dccc0190e..e398412d3fcf 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -28,6 +28,8 @@ void blk_mq_sched_request_inserted(struct request *rq); bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio); bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio); +int blk_mq_sched_init(struct request_queue *q); + void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx); static inline bool diff --git a/block/blk-mq.c b/block/blk-mq.c index d10a246a3bc7..48c28e1cb42a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2101,6 +2101,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, INIT_LIST_HEAD(>requeue_list); spin_lock_init(>requeue_lock); + if (!(set->flags & BLK_MQ_F_NO_SCHED)) + blk_mq_sched_init(q); + if (q->nr_hw_queues > 1) blk_queue_make_request(q, blk_mq_make_request); else diff --git a/block/elevator.c b/block/elevator.c index 6d39197768c1..7ad906689833 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -219,7 +219,10 @@ int elevator_init(struct request_queue *q, char *name) } if (!e) { - e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); + if (q->mq_ops) + e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false); + else + e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); if (!e) { printk(KERN_ERR "Default I/O scheduler not found. " \ diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d6e6bce93d0c..063410d9b3cc 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1188,6 +1188,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) dev->admin_tagset.timeout = ADMIN_TIMEOUT; dev->admin_tagset.numa_node = dev_to_node(dev->dev); dev->admin_tagset.cmd_size = nvme_cmd_size(dev); + dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED; dev->admin_tagset.driver_data = dev; if
Re: [RFC] block: check partition alignment
> sd.c ensures that the logical block size (sector size in sd.c) is a > power of 2 between 512 and 4096. So you can use: > > if (p.start & (bdev_physical_block_size(bdev) - 1)) Sorry, that was a little too short as a complete proof: sd.c ensures that the logical block size (sector size in sd.c) is a power of 2 between 512 and 4096, and the physical block size is a power of 2 number of logical blocks. So the physical block size is also always a power of 2. > > Or use div_u64_rem to avoid an error on 32 bits builds. > > Best regards. > -- Damien Le Moal, Ph.D. Sr. Manager, System Software Research Group, Western Digital Corporation damien.lem...@wdc.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Block IO fixes for 4.10-rc1
Hi Linus, A few fixes that I collected as post-merge. I was going to wait a bit with sending this out, but the O_DIRECT fix should really go in sooner rather than later. Please pull! git://git.kernel.dk/linux-block.git for-linus Andy Lutomirski (1): nvme/pci: Log PCI_STATUS when the controller dies Gabriel Krisman Bertazi (2): blk-mq: Avoid memory reclaim when remapping queues blk-mq: Fix failed allocation path when mapping queues NeilBrown (1): block_dev: don't test bdev->bd_contains when it is not stable Shaohua Li (1): block_dev: don't update file access position for sync direct IO block/blk-mq.c | 32 drivers/nvme/host/pci.c | 22 +++--- fs/block_dev.c | 7 ++- 3 files changed, 45 insertions(+), 16 deletions(-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] fs meta data unmap
Hi Linus, A series from Jan Kara, providing a more efficient way for unmapping meta data from in the buffer cache than doing it block-by-block. Provide a general helper that existing callers can use. Please pull! git://git.kernel.dk/linux-block.git for-4.10/fs-unmap Jan Kara (6): fs: Provide function to unmap metadata for a range of blocks direct-io: Use clean_bdev_aliases() instead of handmade iteration ext4: Use clean_bdev_aliases() instead of iteration ext2: Use clean_bdev_aliases() instead of iteration fs: Add helper to clean bdev aliases under a bh and use it fs: Remove unmap_underlying_metadata fs/buffer.c | 104 +++- fs/direct-io.c | 28 +++- fs/ext2/inode.c | 9 ++-- fs/ext4/extents.c | 13 +- fs/ext4/inode.c | 18 +++- fs/ext4/page-io.c | 2 +- fs/mpage.c | 3 +- fs/ntfs/aops.c | 2 +- fs/ntfs/file.c | 5 +-- fs/ocfs2/aops.c | 2 +- fs/ufs/balloc.c | 3 +- fs/ufs/inode.c | 3 +- include/linux/buffer_head.h | 7 ++- 13 files changed, 104 insertions(+), 95 deletions(-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] blk-mq: Fix failed allocation path when mapping queues
On 12/14/2016 01:48 PM, Gabriel Krisman Bertazi wrote: > From: Gabriel Krisman Bertazi> > In blk_mq_map_swqueue, there is a memory optimization that frees the > tags of a queue that has gone unmapped. Later, if that hctx is remapped > after another topology change, the tags need to be reallocated. > > If this allocation fails, a simple WARN_ON triggers, but the block layer > ends up with an active hctx without any corresponding set of tags. > Then, any income IO to that hctx can trigger an Oops. > > I can reproduce it consistently by running IO, flipping CPUs on and off > and eventually injecting a memory allocation failure in that path. > > In the fix below, if the system experiences a failed allocation of any > hctx's tags, we remap all the ctxs of that queue to the hctx_0, which > should always keep it's tags. There is a minor performance hit, since > our mapping just got worse after the error path, but this is > the simplest solution to handle this error path. The performance hit > will disappear after another successful remap. > > I considered dropping the memory optimization all together, but it > seemed a bad trade-off to handle this very specific error case. Thanks, this looks fine to me now. Both of your patches are queued up for inclusion in 4.10. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] block: check partition alignment
> To prevent partitions that are not aligned to the physical blocksize > of a device check for the alignment in the blkpg_ioctl. We'd also need to reject this when reading partitions from disk, right? > + /* check if partition is aligned to blocksize */ > + if (p.start % bdev_physical_block_size(bdev) != 0) And this should be bdev_logical_block_size, as the logical block size is the only thing that matters for the OS - exposing the physical block size is just an optional hint to prevent users from doing stupid things (like creating unaligned partitions :)) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] block: check partition alignment
Partitions that are not aligned to the blocksize of a device may cause invalid I/O requests because the blocklayer cares only about alignment within the partition when building requests on partitions. device |4096|4096|4096| partition offset 512byte |-512-|4096|4096|4096| When reading/writing one 4k block of the partition this maps to reading/writing with an offset of 512 byte of the device leading to unaligned requests for the device which in turn may cause unexpected behavior of the device driver. For DASD devices we have to translate the block number into a cylinder, head, record format. The unaligned requests lead to wrong calculation and therefore to misdirected I/O. In a "good" case this leads to I/O errors because the underlying hardware detects the wrong addressing. In a worst case scenario this might destroy data on the device. To prevent partitions that are not aligned to the physical blocksize of a device check for the alignment in the blkpg_ioctl. Signed-off-by: Stefan Haberland--- block/ioctl.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/ioctl.c b/block/ioctl.c index 755119c..8b83afee 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -45,6 +45,9 @@ static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user || pstart < 0 || plength < 0 || partno > 65535) return -EINVAL; } + /* check if partition is aligned to blocksize */ + if (p.start % bdev_physical_block_size(bdev) != 0) + return -EINVAL; mutex_lock(>bd_mutex); -- 2.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
On Wed, Dec 14, 2016 at 03:23:13PM +1100, Dave Chinner wrote: > On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote: > > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote: > > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote: > > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote: > > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote: > > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > > > > > > > From kernel point of view such memory is almost like any other, > > > > > > > > it > > > > > > > > has a struct page and most of the mm code is non the wiser, nor > > > > > > > > need > > > > > > > > to be about it. CPU access trigger a migration back to regular > > > > > > > > CPU > > > > > > > > accessible page. > > > > > > > > > > > > > > That sounds ... complex. Page migration on page cache access > > > > > > > inside > > > > > > > the filesytem IO path locking during read()/write() sounds like > > > > > > > a great way to cause deadlocks > > > > > > > > > > > > There are few restriction on device page, no one can do GUP on them > > > > > > and > > > > > > thus no one can pin them. Hence they can always be migrated back. > > > > > > Yes > > > > > > each fs need modification, most of it (if not all) is isolated in > > > > > > common > > > > > > filemap helpers. > > > > > > > > > > Sure, but you haven't answered my question: how do you propose we > > > > > address the issue of placing all the mm locks required for migration > > > > > under the filesystem IO path locks? > > > > > > > > Two different plans (which are non exclusive of each other). First is > > > > to use > > > > workqueue and have read/write wait on the workqueue to be done > > > > migrating the > > > > page back. > > > > > > Pushing something to a workqueue and then waiting on the workqueue > > > to complete the work doesn't change lock ordering problems - it > > > just hides them away and makes them harder to debug. > > > > Migration doesn't need many lock below is a list and i don't see any lock > > issue > > in respect to ->read or ->write. > > > > lock_page(page); > > spin_lock_irq(>tree_lock); > > lock_buffer(bh); // if page has buffer_head > > i_mmap_lock_read(mapping); > > vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { > > // page table lock for each entry > > } > > We can't take the page or mapping tree locks that while we hold > various filesystem locks. > > e.g. The IO path lock order is, in places: > > inode->i_rwsem > get page from page cache > lock_page(page) > inode->allocation lock > zero page data > > Filesystems are allowed to do this, because the IO path has > guaranteed them access to the page cache data on the page that is > locked. Your ZONE_DEVICE proposal breaks this guarantee - we might > have a locked page, but we don't have access to it's data. > > Further, in various filesystems once the allocation lock is taken > (e.g. the i_lock in XFS) we're not allowed to lock pages or the > mapping tree as that leads to deadlocks with truncate, hole punch, > etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that > requires migration before the zeroing can occur, we can't perform > migration here. > > Why are we even considering migration in situations where we already > hold the ZONE_DEVICE page locked, hold other filesystem locks inside > the page lock, and have an open dirty filesystem transaction as well? > > Even if migration si possible and succeeds, the struct page in the > mapping tree for the file offset we are operating on is going to be > different after migration. That implies we need to completely > restart the operation. But given that we've already made changes, > backing out at this point is ... complex and may not even be > possible. So i skim through xfs code and i still think this is doable. So in the above sequence: inode->i_rwsem page = find_get_page(); if (device_unaddressable(page)) { page = migratepage(); } ... Now there is thing like filemap_write_and_wait...() but thus can be handled by the bio bounce buffer like i said ie a the block layer we allocate temporary page, page are already read only on the device as device obey regular thing like page_mkclean(). So page content is stable. The migrate page is using buffer_migrate_page() and i don't see any deadlock there. So i am not seeing any problem in doing migrate early on right after page lookup. > > i.e. we have an architectural assumption that page contents are > always accessable when we have a locked struct page, and your > proposal would appear to violate that assumption... And it is, data might be in device memory but you can use bounce page to access it and you can write protect it on the device so that it doesn't change. Looking at xfs, it never does a kmap() directly, only through some of the generic code and thus are place where we can use bounce
Re: [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues
On 12/06/2016 08:31 AM, Gabriel Krisman Bertazi wrote: > While stressing memory and IO at the same time we changed SMT settings, > we were able to consistently trigger deadlocks in the mm system, which > froze the entire machine. > > I think that under memory stress conditions, the large allocations > performed by blk_mq_init_rq_map may trigger a reclaim, which stalls > waiting on the block layer remmaping completion, thus deadlocking the > system. The trace below was collected after the machine stalled, > waiting for the hotplug event completion. > > The simplest fix for this is to make allocations in this path > non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the > issue anymore. This looks fine. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers
On 12/14/2016 03:31 AM, Bart Van Assche wrote: > On 12/13/2016 04:14 PM, Jens Axboe wrote: >> On 12/13/2016 06:56 AM, Bart Van Assche wrote: >>> On 12/08/2016 09:13 PM, Jens Axboe wrote: +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q, +struct blk_mq_alloc_data *data, +struct blk_mq_tags *tags, +atomic_t *wait_index) +{ >>> >>> Using the word "shadow" in the function name suggests to me that there >>> is a shadow request for every request and a request for every shadow >>> request. However, my understanding from the code is that there can be >>> requests without shadow requests (for e.g. a flush) and shadow requests >>> without requests. Shouldn't the name of this function reflect that, e.g. >>> by using "sched" or "elv" in the function name instead of "shadow"? >> >> Shadow might not be the best name. Most do have shadows though, it's >> only the rare exception like the flush, that you mention. I'll see if I >> can come up with a better name. > > Hello Jens, > > One aspect of this patch series that might turn out to be a maintenance > burden is the copying between original and shadow requests. It is easy > to overlook that rq_copy() has to be updated if a field would ever be > added to struct request. Additionally, having to allocate two requests > structures per I/O instead of one will have a runtime overhead. Do you > think the following approach would work? > - Instead of using two request structures per I/O, only use a single > request structure. > - Instead of storing one tag in the request structure, store two tags > in that structure. One tag comes from the I/O scheduler tag set > (size: nr_requests) and the other from the tag set associated with > the block driver (size: HBA queue depth). > - Only add a request to the hctx dispatch list after a block driver tag > has been assigned. This means that an I/O scheduler must keep a > request structure on a list it manages itself as long as no block > driver tag has been assigned. > - sysfs_list_show() is modified such that it shows both tags. I have considered doing exactly that, decided to go down the other path. I may still revisit, it's not that I'm a huge fan of the shadow requests and the necessary copying. We don't update the request that often, so I don't think it's going to be a big maintenance burden. But it'd be hard to claim that it's super pretty... I'll play with the idea. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
On 12/13/2016 06:39 PM, Gabriel Krisman Bertazi wrote: On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote: In blk_mq_map_swqueue, there is a memory optimization that frees the tags of a queue that has gone unmapped. Later, if that hctx is remapped after another topology change, the tags need to be reallocated. If this allocation fails, a simple WARN_ON triggers, but the block layer ends up with an active hctx without any corresponding set of tags. Then, any income IO to that hctx can trigger an Oops. I can reproduce it consistently by running IO, flipping CPUs on and off and eventually injecting a memory allocation failure in that path. In the fix below, if the system experiences a failed allocation of any hctx's tags, we remap all the ctxs of that queue to the hctx_0, which should always keep it's tags. There is a minor performance hit, since our mapping just got worse after the error path, but this is the simplest solution to handle this error path. The performance hit will disappear after another successful remap. I considered dropping the memory optimization all together, but it seemed a bad trade-off to handle this very specific error case. This should apply cleanly on top of Jen's for-next branch. Hi, I saw this patchset missed the first PR for 4.10, though I'd really like to see it merged in this cycle, if possible. If you see anything particularly concerning about this that I missed, please let me know, but I fear this has been around for a while without much feedback. Thanks, I want to repeat what Gabriel says. We are now seeing evidence of the condition often. This really needs to get fix soon. Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
On Tue 13-12-16 16:24:33, Jerome Glisse wrote: > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > > > > I would like to discuss un-addressable device memory in the context of > > > > > filesystem and block device. Specificaly how to handle write-back, > > > > > read, > > > > > ... when a filesystem page is migrated to device memory that CPU can > > > > > not > > > > > access. > > > > > > > > You mean pmem that is DAX-capable that suddenly, without warning, > > > > becomes non-DAX capable? > > > > > > > > If you are not talking about pmem and DAX, then exactly what does > > > > "when a filesystem page is migrated to device memory that CPU can > > > > not access" mean? What "filesystem page" are we talking about that > > > > can get migrated from main RAM to something the CPU can't access? > > > > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on > > > board memory that can not be expose transparently to the CPU. I am > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm > > > https://lwn.net/Articles/706856/ > > > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable? > > Well not only target, it can be source too. But the device can read > and write any system memory and dma to/from that memory to its on > board memory. > > > > > > So in my case i am only considering non DAX/PMEM filesystem ie any > > > "regular" filesystem back by a "regular" block device. I want to be > > > able to migrate mmaped area of such filesystem to device memory while > > > the device is actively using that memory. > > > > "migrate mmapped area of such filesystem" means what, exactly? > > fd = open("/path/to/some/file") > ptr = mmap(fd, ...); > gpu_compute_something(ptr); > > > > > Are you talking about file data contents that have been copied into > > the page cache and mmapped into a user process address space? > > IOWs, migrating ZONE_NORMAL page cache page content and state > > to a new ZONE_DEVICE page, and then migrating back again somehow? > > Take any existing application that mmap a file and allow to migrate > chunk of that mmaped file to device memory without the application > even knowing about it. So nothing special in respect to that mmaped > file. It is a regular file on your filesystem. OK, so I share most of Dave's concerns about this. But let's talk about what we can do and what you need and we may find something usable. First let me understand what is doable / what are the costs on your side. So we have a page cache page that you'd like to migrate to the device. Fine. You are willing to sacrifice direct IO - even better. We can fall back to buffered IO in that case (well, except for XFS which does not do it but that's a minor detail). One thing I'm not sure about: When a page is migrated to the device, is its contents available and is just possibly stale or will something bad happen if we try to access (or even modify) page data? And by migration you really mean page migration? Be aware that migration of pagecache pages may be a problem for some pages of some filesystems on its own - e. g. page migration may fail because there is a filesystem transaction outstanding modifying that page. For userspace these will be really hard to understand sporadic errors because it's really filesystem internal thing. So far page migration was widely used only for free space defragmentation and for that purpose if page is not migratable for a minute who cares. So won't it be easier to leave the pagecache page where it is and *copy* it to the device? Can the device notify us *before* it is going to modify a page, not just after it has modified it? Possibly if we just give it the page read-only and it will have to ask CPU to get write permission? If yes, then I belive this could work and even fs support should be doable. Honza -- Jan KaraSUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block_dev: don't update file access position for sync direct IO
On Tue, Dec 13, 2016 at 09:08:18PM -0700, Jens Axboe wrote: > That's not great... Thanks, added. Ooops, yeah - we're still going through ->direct_IO for block devices. I'll take a stab at removing that, as it's just a pointless indirect call. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] media/cobalt: use pci_irq_allocate_vectors
On 14/12/16 11:29, Christoph Hellwig wrote: Hi Hans, just checked the current Linux tree and cobalt still uses the old pci_enable_msi_range call. Did you queue this patch up for 4.10? -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Completely forgot this. Is it OK to queue it for 4.11? Or is it blocking other follow-up work you want to do for 4.10? Regards, Hans -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] media/cobalt: use pci_irq_allocate_vectors
Hi Hans, just checked the current Linux tree and cobalt still uses the old pci_enable_msi_range call. Did you queue this patch up for 4.10? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] mq-deadline: add blk-mq adaptation of the deadline IO scheduler
On 12/08/2016 09:13 PM, Jens Axboe wrote: > +static inline bool dd_rq_is_shadow(struct request *rq) > +{ > + return rq->rq_flags & RQF_ALLOCED; > +} Hello Jens, Something minor: because req_flags_t has been defined using __bitwise (typedef __u32 __bitwise req_flags_t) sparse complains for the above function about converting req_flags_t into bool. How about changing the body of that function into "return (rq->rq_flags & RQF_ALLOCED) != 0" to keep sparse happy? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html