Re: [PATCH 1/4] block: add scalable completion tracking of requests
On Tue, Nov 01 2016, Johannes Thumshirn wrote: > On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote: > > For legacy block, we simply track them in the request queue. For > > blk-mq, we track them on a per-sw queue basis, which we can then > > sum up through the hardware queues and finally to a per device > > state. > > > > The stats are tracked in, roughly, 0.1s interval windows. > > > > Add sysfs files to display the stats. > > > > Signed-off-by: Jens Axboe> > --- > > [...] > > > > > /* incremented at completion time */ > > unsigned long cacheline_aligned_in_smp rq_completed[2]; > > + struct blk_rq_stat stat[2]; > > Can you add an enum or define for the directions? Just 0 and 1 aren't very > intuitive. Good point, I added that and updated both the stats patch and the subsequent blk-mq poll code. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote: > On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote: > > Avoid to access .bi_vcnt directly, because it may be not what > > the driver expected any more after supporting multipage bvec. > > > > Signed-off-by: Ming Lei> > It would be really nice to have a comment in the code why it's > even checking for multiple segments. Or ideally refactor the code to not care about multiple segments at all. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 28/60] block: introduce QUEUE_FLAG_SPLIT_MP
On Mon, Oct 31, 2016 at 08:39:15AM -0700, Christoph Hellwig wrote: > On Sat, Oct 29, 2016 at 04:08:27PM +0800, Ming Lei wrote: > > Some drivers(such as dm) should be capable of dealing with multipage > > bvec, but the incoming bio may be too big, such as, a new singlepage bvec > > bio can't be cloned from the bio, or can't be allocated to singlepage > > bvec with same size. > > > > At least crypt dm, log writes and bcache have this kind of issue. > > We already have the segment_size limitation for request based drivers. > I'd rather extent it to bio drivers if really needed. > > But then again we should look into not having this limitation. E.g. > for bcache I'd be really surprised if it's that limited, given that > Kent came up with this whole multipage bvec scheme. AFAIK the only issue is with drivers that may have to bounce bios - pages that were contiguous in the original bio won't necessarily be contiguous in the bounced bio, thus bouncing might require more than BIO_MAX_SEGMENTS bvecs. I don't know what Ming's referring to by "singlepage bvec bios". Anyways, bouncing comes up in multiple places so we probably need to come up with a generic solution for that. Other than that, there shouldn't be any issues or limitations - if you're not bouncing, there's no need to clone the bvecs. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 45/60] block: bio: introduce bio_for_each_segment_all_rd() and its write pair
On Mon, Oct 31, 2016 at 08:11:23AM -0700, Christoph Hellwig wrote: > On Mon, Oct 31, 2016 at 09:59:43AM -0400, Theodore Ts'o wrote: > > What is _rd and _wt supposed to stand for? > > I think it's read and write, but I think the naming is highly > unfortunate. I started dabbling around with the patches a bit, > and to keep my sanity a started reaming it to _pages and _bvec > which is the real semantics - the _rd or _pages gives you a synthetic > bvec for each page, and the other one gives you the full bvec. My original naming was bio_for_each_segment() and bio_for_each_page(). -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 07/14] blk-mq: Introduce blk_mq_quiesce_queue()
On Wed, Nov 2, 2016 at 12:02 AM, Sagi Grimbergwrote: > Reviewed-by: Sagi Grimberg Reviewed-by: Ming Lei -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 45/60] block: bio: introduce bio_for_each_segment_all_rd() and its write pair
On Tue, Nov 1, 2016 at 10:17 PM, Theodore Ts'owrote: > On Tue, Nov 01, 2016 at 07:51:27AM +0800, Ming Lei wrote: >> Sorry for forgetting to mention one important point: >> >> - after multipage bvec is introduced, the iterated bvec pointer >> still points to singlge page bvec, which is generated in-flight >> and is readonly actually. That is the motivation about the introduction >> of bio_for_each_segment_all_rd(). >> >> So maybe bio_for_each_page_all_ro() is better? >> >> For _wt(), we still can keep it as bio_for_each_segment(), which also >> reflects that now the iterated bvec points to one whole segment if >> we name _rd as bio_for_each_page_all_ro(). > > I'm agnostic as to what the right names are --- my big concern is > there is an explosion of bio_for_each_page_* functions, and that there There isn't big users of bio_for_each_segment_all(), see: [ming@linux-2.6]$git grep -n bio_for_each_segment_all ./fs/ | wc -l 23 I guess there isn't execuses to switch to that after this patchset. >From view of API, bio_for_each_segment_all() is ugly and exposes the bvec table to users, and the main reason we keep it is that it can avoid one bvec copy in one loop. And it can be replaced easily by bio_for_each_segment(). > isn't good documentation about (a) when to use each of these > functions, and (b) why. I was goinig through the patch series, and it > was hard for me to figure out why, and I was looking through all of > the patches. Once all of the patches are merged in, I am concerned > this is going to be massive trapdoor that will snare a large number of > unwitting developers. I understand your concern, and let me explain the whole story a bit: 1) in current linus tree, we have the following two bio iterator helpers, for which we still don't provide any document: bio_for_each_segment(bvl, bio, iter) bio_for_each_segment_all(bvl, bio, i) - the former is used to traverse each 'segment' in the bio range descibed by the 'iter'(just like [start, size]); the latter is used to traverse each 'segment' in the whole bio, so there isn't 'iter' passed in. - in the former helper, typeof('bvl') is 'struct bvec', and the 'segment' is copied to 'bvl'; in the latter helper, typeof('bvl') is 'struct bvec *', and it just points to one bvec directly in the table(bio->bi_io_vec) one by one. - we can use the former helper to implement the latter easily and provide a more friendly interface, and the main reason we keep it is that _all can avoid bvec copy in each loop, so it might be a bit efficient. - even segment is used in the helper's name, but each 'bvl' in the helper just describes one single page, so actually they should have been named as the following: bio_for_each_page(bvl, bio, iter) bio_for_each_page(bvl, bio, iter) 2) this patchset introduces multipage bvec, which will store one real segment in each 'bvec' of the table(bio->bi_io_vec), and one segment may include more than one page - bio_for_each_segment() is kept as current interface to retrieve one page in each 'bvl', that is just for making current users happy, and it will be replaced with bio_for_each_page() finally, which should be a follow-up work of this patchset - the story of introduction of bio_for_each_segment_all_rd(bvl, bio, i): we can't simply make 'bvl' point to each bvec in the table direclty any more, because now each bvec in the table store one real segment instead of one page. So in this patchst the _rd() is implemented by bio_for_each_segment(), and we can't change/write to the bvec in the table any more using the pointer of 'bvl' via this helper. > > As far as my preference, from an abstract perspective, if one version > (the read-write variant, I presume) is always safe, while one (the > read-only variant) is faster, if you can work under restricted > circumstances, naming the safe version so it is the "default", and > more dangerous one with the name that makes it a bit more obvious what > you have to do in order to use it safely, and then very clearly > document both in sources, and in the Documentation directory, what the > issues are and what you have to do in order to use the faster version. I will add detailed documents about these helpers in next version: - bio_for_each_segment() - bio_for_each_segment_all() - bio_for_each_page_all_ro()(renamed from bio_for_each_segment_all_rd()) Thanks, Ming > > Cheers, > > - Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] block: add scalable completion tracking of requests
On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote: > For legacy block, we simply track them in the request queue. For > blk-mq, we track them on a per-sw queue basis, which we can then > sum up through the hardware queues and finally to a per device > state. > > The stats are tracked in, roughly, 0.1s interval windows. > > Add sysfs files to display the stats. > > Signed-off-by: Jens Axboe> --- [...] > > /* incremented at completion time */ > unsigned long cacheline_aligned_in_smp rq_completed[2]; > + struct blk_rq_stat stat[2]; Can you add an enum or define for the directions? Just 0 and 1 aren't very intuitive. Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] block: add code to track actual device queue depth
For blk-mq, ->nr_requests does track queue depth, at least at init time. But for the older queue paths, it's simply a soft setting. On top of that, it's generally larger than the hardware setting on purpose, to allow backup of requests for merging. Fill a hole in struct request with a 'queue_depth' member, that drivers can call to more closely inform the block layer of the real queue depth. Signed-off-by: Jens Axboe--- block/blk-settings.c | 12 drivers/scsi/scsi.c| 3 +++ include/linux/blkdev.h | 11 +++ 3 files changed, 26 insertions(+) diff --git a/block/blk-settings.c b/block/blk-settings.c index 55369a65dea2..9cf053759363 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -837,6 +837,18 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable) EXPORT_SYMBOL_GPL(blk_queue_flush_queueable); /** + * blk_set_queue_depth - tell the block layer about the device queue depth + * @q: the request queue for the device + * @depth: queue depth + * + */ +void blk_set_queue_depth(struct request_queue *q, unsigned int depth) +{ + q->queue_depth = depth; +} +EXPORT_SYMBOL(blk_set_queue_depth); + +/** * blk_queue_write_cache - configure queue's write cache * @q: the request queue for the device * @wc:write back cache on or off diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 1deb6adc411f..75455d4dab68 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth) wmb(); } + if (sdev->request_queue) + blk_set_queue_depth(sdev->request_queue, depth); + return sdev->queue_depth; } EXPORT_SYMBOL(scsi_change_queue_depth); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 8396da2bb698..0c677fb35ce4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -405,6 +405,8 @@ struct request_queue { struct blk_mq_ctx __percpu *queue_ctx; unsigned intnr_queues; + unsigned intqueue_depth; + /* hw dispatch queues */ struct blk_mq_hw_ctx**queue_hw_ctx; unsigned intnr_hw_queues; @@ -777,6 +779,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b) return false; } +static inline unsigned int blk_queue_depth(struct request_queue *q) +{ + if (q->queue_depth) + return q->queue_depth; + + return q->nr_requests; +} + /* * q->prep_rq_fn return values */ @@ -1093,6 +1103,7 @@ extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min); extern void blk_queue_io_min(struct request_queue *q, unsigned int min); extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt); extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt); +extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth); extern void blk_set_default_limits(struct queue_limits *lim); extern void blk_set_stacking_limits(struct queue_limits *lim); extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] writeback: mark background writeback as such
If we're doing background type writes, then use the appropriate background write flags for that. Signed-off-by: Jens Axboe--- include/linux/writeback.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 50c96ee8108f..c78f9f0920b5 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -107,6 +107,8 @@ static inline int wbc_to_write_flags(struct writeback_control *wbc) { if (wbc->sync_mode == WB_SYNC_ALL) return REQ_SYNC; + else if (wbc->for_kupdate || wbc->for_background) + return REQ_BACKGROUND; return 0; } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()
Note in the bdi_writeback structure whenever a task ends up sleeping waiting for progress. We can use that information in the lower layers to increase the priority of writes. Signed-off-by: Jens Axboe--- include/linux/backing-dev-defs.h | 2 ++ mm/backing-dev.c | 1 + mm/page-writeback.c | 1 + 3 files changed, 4 insertions(+) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index c357f27d5483..dc5f76d7f648 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -116,6 +116,8 @@ struct bdi_writeback { struct list_head work_list; struct delayed_work dwork; /* work item used for writeback */ + unsigned long dirty_sleep; /* last wait */ + struct list_head bdi_node; /* anchored at bdi->wb_list */ #ifdef CONFIG_CGROUP_WRITEBACK diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 8fde443f36d7..3bfed5ab2475 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -310,6 +310,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, spin_lock_init(>work_lock); INIT_LIST_HEAD(>work_list); INIT_DELAYED_WORK(>dwork, wb_workfn); + wb->dirty_sleep = jiffies; wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp); if (!wb->congested) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 439cc63ad903..52e2f8e3b472 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1778,6 +1778,7 @@ static void balance_dirty_pages(struct address_space *mapping, pause, start_time); __set_current_state(TASK_KILLABLE); + wb->dirty_sleep = now; io_schedule_timeout(pause); current->dirty_paused_when = now + pause; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] block: add WRITE_BACKGROUND
This adds a new request flag, REQ_BACKGROUND, that callers can use to tell the block layer that this is background (non-urgent) IO. Signed-off-by: Jens Axboe--- include/linux/blk_types.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index bb921028e7c5..562ac46cb790 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -177,6 +177,7 @@ enum req_flag_bits { __REQ_FUA, /* forced unit access */ __REQ_PREFLUSH, /* request for cache flush */ __REQ_RAHEAD, /* read ahead, can fail anytime */ + __REQ_BACKGROUND, /* background IO */ __REQ_NR_BITS, /* stops here */ }; @@ -192,6 +193,7 @@ enum req_flag_bits { #define REQ_FUA(1ULL << __REQ_FUA) #define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH) #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) +#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHSET] Throttled buffered writeback
I have addressed the (small) review comments from Christoph, and rebased it on top of for-4.10/block, since that now has the flag unification and fs side cleanups as well. This impacted the prep patches, and the wbt code. I'd really like to get this merged for 4.10. It's block specific at this point, and defaults to just being enabled for blk-mq managed devices. Let me know if there are any objections. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] block: add scalable completion tracking of requests
For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. Signed-off-by: Jens Axboe--- block/Makefile| 2 +- block/blk-core.c | 4 + block/blk-mq-sysfs.c | 47 ++ block/blk-mq.c| 14 +++ block/blk-mq.h| 3 + block/blk-stat.c | 226 ++ block/blk-stat.h | 37 block/blk-sysfs.c | 26 ++ include/linux/blk_types.h | 16 include/linux/blkdev.h| 4 + 10 files changed, 378 insertions(+), 1 deletion(-) create mode 100644 block/blk-stat.c create mode 100644 block/blk-stat.h diff --git a/block/Makefile b/block/Makefile index 934dac73fb37..2528c596f7ec 100644 --- a/block/Makefile +++ b/block/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ - blk-lib.o blk-mq.o blk-mq-tag.o \ + blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ badblocks.o partitions/ diff --git a/block/blk-core.c b/block/blk-core.c index 0bfaa54d3e9f..ca77c725b4e5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req) { blk_dequeue_request(req); + blk_stat_set_issue_time(>issue_stat); + /* * We are now handing the request to the hardware, initialize * resid_len to full count and add the timeout handler. @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) trace_block_rq_complete(req->q, req, nr_bytes); + blk_stat_add(>q->rq_stats[rq_data_dir(req)], req); + if (!req->bio) return false; diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 01fb455d3377..633c79a538ea 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page) return ret; } +static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx) +{ + struct blk_mq_ctx *ctx; + unsigned int i; + + hctx_for_each_ctx(hctx, ctx, i) { + blk_stat_init(>stat[0]); + blk_stat_init(>stat[1]); + } +} + +static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx, + const char *page, size_t count) +{ + blk_mq_stat_clear(hctx); + return count; +} + +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre) +{ + return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n", + pre, (long long) stat->nr_samples, + (long long) stat->mean, (long long) stat->min, + (long long) stat->max); +} + +static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page) +{ + struct blk_rq_stat stat[2]; + ssize_t ret; + + blk_stat_init([0]); + blk_stat_init([1]); + + blk_hctx_stat_get(hctx, stat); + + ret = print_stat(page, [0], "read :"); + ret += print_stat(page + ret, [1], "write:"); + return ret; +} + static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = { .attr = {.name = "dispatched", .mode = S_IRUGO }, .show = blk_mq_sysfs_dispatched_show, @@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = { .show = blk_mq_hw_sysfs_poll_show, .store = blk_mq_hw_sysfs_poll_store, }; +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = { + .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR }, + .show = blk_mq_hw_sysfs_stat_show, + .store = blk_mq_hw_sysfs_stat_store, +}; static struct attribute *default_hw_ctx_attrs[] = { _mq_hw_sysfs_queued.attr, @@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = { _mq_hw_sysfs_cpus.attr, _mq_hw_sysfs_active.attr, _mq_hw_sysfs_poll.attr, + _mq_hw_sysfs_stat.attr, NULL, }; diff --git a/block/blk-mq.c b/block/blk-mq.c index 2da1a0ee3318..4555a76d22a7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -30,6 +30,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-tag.h" +#include "blk-stat.h" static DEFINE_MUTEX(all_q_mutex); static LIST_HEAD(all_q_list); @@ -376,10 +377,19 @@ static void
[PATCH 2/8] writeback: add wbc_to_write_flags()
Add wbc_to_write_flags(), which returns the write modifier flags to use, based on a struct writeback_control. No functional changes in this patch, but it prepares us for factoring other wbc fields for write type. Signed-off-by: Jens AxboeReviewed-by: Jan Kara --- fs/buffer.c | 2 +- fs/f2fs/data.c| 2 +- fs/f2fs/node.c| 2 +- fs/gfs2/meta_io.c | 3 +-- fs/mpage.c| 2 +- fs/xfs/xfs_aops.c | 8 ++-- include/linux/writeback.h | 9 + mm/page_io.c | 5 + 8 files changed, 17 insertions(+), 16 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index bc7c2bb30a9b..af5776da814a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1697,7 +1697,7 @@ int __block_write_full_page(struct inode *inode, struct page *page, struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr_underway = 0; - int write_flags = (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0); + int write_flags = wbc_to_write_flags(wbc); head = create_page_buffers(page, inode, (1 << BH_Dirty)|(1 << BH_Uptodate)); diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index b80bf10603d7..9e5561fa4cb6 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1249,7 +1249,7 @@ static int f2fs_write_data_page(struct page *page, .sbi = sbi, .type = DATA, .op = REQ_OP_WRITE, - .op_flags = (wbc->sync_mode == WB_SYNC_ALL) ? REQ_SYNC : 0, + .op_flags = wbc_to_write_flags(wbc), .page = page, .encrypted_page = NULL, }; diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 932f3f8bb57b..d1e29deb4598 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1570,7 +1570,7 @@ static int f2fs_write_node_page(struct page *page, .sbi = sbi, .type = NODE, .op = REQ_OP_WRITE, - .op_flags = (wbc->sync_mode == WB_SYNC_ALL) ? REQ_SYNC : 0, + .op_flags = wbc_to_write_flags(wbc), .page = page, .encrypted_page = NULL, }; diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c index e562b1191c9c..49db8ef13fdf 100644 --- a/fs/gfs2/meta_io.c +++ b/fs/gfs2/meta_io.c @@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb { struct buffer_head *bh, *head; int nr_underway = 0; - int write_flags = REQ_META | REQ_PRIO | - (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0); + int write_flags = REQ_META | REQ_PRIO | wbc_to_write_flags(wbc); BUG_ON(!PageLocked(page)); BUG_ON(!page_has_buffers(page)); diff --git a/fs/mpage.c b/fs/mpage.c index f35e2819d0c6..98fc11aa7e0b 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -489,7 +489,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, struct buffer_head map_bh; loff_t i_size = i_size_read(inode); int ret = 0; - int op_flags = (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0); + int op_flags = wbc_to_write_flags(wbc); if (page_has_buffers(page)) { struct buffer_head *head = page_buffers(page); diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 594e02c485b2..6be5204a06d3 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -495,9 +495,7 @@ xfs_submit_ioend( ioend->io_bio->bi_private = ioend; ioend->io_bio->bi_end_io = xfs_end_bio; - ioend->io_bio->bi_opf = REQ_OP_WRITE; - if (wbc->sync_mode == WB_SYNC_ALL) - ioend->io_bio->bi_opf |= REQ_SYNC; + ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc); /* * If we are failing the IO now, just mark the ioend with an @@ -569,9 +567,7 @@ xfs_chain_bio( bio_chain(ioend->io_bio, new); bio_get(ioend->io_bio); /* for xfs_destroy_ioend */ - ioend->io_bio->bi_opf = REQ_OP_WRITE; - if (wbc->sync_mode == WB_SYNC_ALL) - ioend->io_bio->bi_opf |= REQ_SYNC; + ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc); submit_bio(ioend->io_bio); ioend->io_bio = new; } diff --git a/include/linux/writeback.h b/include/linux/writeback.h index e4c38703bf4e..50c96ee8108f 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -9,6 +9,7 @@ #include #include #include +#include struct bio; @@ -102,6 +103,14 @@ struct writeback_control { #endif }; +static inline int wbc_to_write_flags(struct writeback_control *wbc) +{ + if (wbc->sync_mode == WB_SYNC_ALL) + return REQ_SYNC; + + return 0; +} + /* * A wb_domain represents a domain that wb's (bdi_writeback's) belong to * and are measured against each other in. There always is one global diff --git a/mm/page_io.c b/mm/page_io.c index
[PATCH 8/8] block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers two sysfs entries. The first one, 'wb_window_usec', defines the window of monitoring. The second one, 'wb_lat_usec', sets the latency target for the window. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch these settings. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe--- Documentation/block/queue-sysfs.txt | 13 block/Kconfig | 24 +++ block/blk-core.c| 18 - block/blk-mq.c | 27 +++- block/blk-settings.c| 4 ++ block/blk-sysfs.c | 134 block/cfq-iosched.c | 14 include/linux/blkdev.h | 3 + 8 files changed, 233 insertions(+), 4 deletions(-) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index 2a3904030dea..2847219ebd8c 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -169,5 +169,18 @@ This is the number of bytes the device can write in a single write-same command. A value of '0' means write-same is not supported by this device. +wb_lat_usec (RW) + +If the device is registered for writeback throttling, then this file shows +the target minimum read latency. If this latency is exceeded in a given +window of time (see wb_window_usec), then the writeback throttling will start +scaling back writes. + +wb_window_usec (RW) +--- +If the device is registered for writeback throttling, then this file shows +the value of the monitoring window in which we'll look at the target +latency. See wb_lat_usec. + Jens Axboe , February 2009 diff --git a/block/Kconfig b/block/Kconfig index 6b0ad08f0677..9f5d4dd7d751 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -120,6 +120,30 @@ config BLK_CMDLINE_PARSER See Documentation/block/cmdline-partition.txt for more information. +config BLK_WBT + bool "Enable support for block device writeback throttling" + default n + ---help--- + Enabling this option enables the block layer to throttle buffered + writeback from the VM, making it more smooth and having less + impact on foreground operations. + +config BLK_WBT_SQ + bool "Single queue writeback throttling" + default n + depends on BLK_WBT + ---help--- + Enable writeback throttling by default on legacy single queue devices + +config BLK_WBT_MQ + bool "Multiqueue writeback throttling" + default y + depends on BLK_WBT + ---help--- + Enable writeback throttling by default on multiqueue devices. + Multiqueue currently doesn't have support for IO scheduling, + enabling this option is recommended. + menu "Partition Types" source "block/partitions/Kconfig" diff --git a/block/blk-core.c b/block/blk-core.c index ca77c725b4e5..c68e92acf21a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -39,6 +39,7 @@ #include "blk.h" #include
[PATCH 7/8] blk-wbt: add general throttling mechanism
We can hook this up to the block layer, to help throttle buffered writes. wbt registers a few trace points that can be used to track what is happening in the system: wbt_lat: 259:0: latency 2446318 wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1, wmean=518866, wmin=15522, wmax=5330353, wsamples=57 wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32 This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat dumps the current read/write stats for that window, and wbt_step shows a step down event where we now scale back writes. Each trace includes the device, 259:0 in this case. Signed-off-by: Jens Axboe--- block/Makefile | 1 + block/blk-wbt.c| 704 + block/blk-wbt.h| 166 +++ include/trace/events/wbt.h | 153 ++ 4 files changed, 1024 insertions(+) create mode 100644 block/blk-wbt.c create mode 100644 block/blk-wbt.h create mode 100644 include/trace/events/wbt.h diff --git a/block/Makefile b/block/Makefile index 2528c596f7ec..a827f988c4e6 100644 --- a/block/Makefile +++ b/block/Makefile @@ -24,3 +24,4 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o +obj-$(CONFIG_BLK_WBT) += blk-wbt.o diff --git a/block/blk-wbt.c b/block/blk-wbt.c new file mode 100644 index ..1b1d67aae1d3 --- /dev/null +++ b/block/blk-wbt.c @@ -0,0 +1,704 @@ +/* + * buffered writeback throttling. loosely based on CoDel. We can't drop + * packets for IO scheduling, so the logic is something like this: + * + * - Monitor latencies in a defined window of time. + * - If the minimum latency in the above window exceeds some target, increment + * scaling step and scale down queue depth by a factor of 2x. The monitoring + * window is then shrunk to 100 / sqrt(scaling step + 1). + * - For any window where we don't have solid data on what the latencies + * look like, retain status quo. + * - If latencies look good, decrement scaling step. + * - If we're only doing writes, allow the scaling step to go negative. This + * will temporarily boost write performance, snapping back to a stable + * scaling step of 0 if reads show up or the heavy writers finish. Unlike + * positive scaling steps where we shrink the monitoring window, a negative + * scaling step retains the default step==0 window size. + * + * Copyright (C) 2016 Jens Axboe + * + */ +#include +#include +#include +#include +#include + +#include "blk-wbt.h" + +#define CREATE_TRACE_POINTS +#include + +enum { + /* +* Default setting, we'll scale up (to 75% of QD max) or down (min 1) +* from here depending on device stats +*/ + RWB_DEF_DEPTH = 16, + + /* +* 100msec window +*/ + RWB_WINDOW_NSEC = 100 * 1000 * 1000ULL, + + /* +* Disregard stats, if we don't meet this minimum +*/ + RWB_MIN_WRITE_SAMPLES = 3, + + /* +* If we have this number of consecutive windows with not enough +* information to scale up or down, scale up. +*/ + RWB_UNKNOWN_BUMP= 5, +}; + +static inline bool rwb_enabled(struct rq_wb *rwb) +{ + return rwb && rwb->wb_normal != 0; +} + +/* + * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded, + * false if 'v' + 1 would be bigger than 'below'. + */ +static bool atomic_inc_below(atomic_t *v, int below) +{ + int cur = atomic_read(v); + + for (;;) { + int old; + + if (cur >= below) + return false; + old = atomic_cmpxchg(v, cur, cur + 1); + if (old == cur) + break; + cur = old; + } + + return true; +} + +static void wb_timestamp(struct rq_wb *rwb, unsigned long *var) +{ + if (rwb_enabled(rwb)) { + const unsigned long cur = jiffies; + + if (cur != *var) + *var = cur; + } +} + +/* + * If a task was rate throttled in balance_dirty_pages() within the last + * second or so, use that to indicate a higher cleaning rate. + */ +static bool wb_recent_wait(struct rq_wb *rwb) +{ + struct bdi_writeback *wb = >bdi->wb; + + return time_before(jiffies, wb->dirty_sleep + HZ); +} + +static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd) +{ + return >rq_wait[is_kswapd]; +} + +static void rwb_wake_all(struct rq_wb *rwb) +{ + int i; + + for (i = 0; i < WBT_NUM_RWQ; i++) { + struct rq_wait *rqw = >rq_wait[i]; + + if (waitqueue_active(>wait)) + wake_up_all(>wait); + } +} + +void __wbt_done(struct rq_wb *rwb, enum
[PATCH 1/4] block: add scalable completion tracking of requests
For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. Signed-off-by: Jens Axboe--- block/Makefile| 2 +- block/blk-core.c | 4 + block/blk-mq-sysfs.c | 47 ++ block/blk-mq.c| 14 +++ block/blk-mq.h| 3 + block/blk-stat.c | 226 ++ block/blk-stat.h | 37 block/blk-sysfs.c | 26 ++ include/linux/blk_types.h | 16 include/linux/blkdev.h| 4 + 10 files changed, 378 insertions(+), 1 deletion(-) create mode 100644 block/blk-stat.c create mode 100644 block/blk-stat.h diff --git a/block/Makefile b/block/Makefile index 934dac73fb37..2528c596f7ec 100644 --- a/block/Makefile +++ b/block/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ - blk-lib.o blk-mq.o blk-mq-tag.o \ + blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ badblocks.o partitions/ diff --git a/block/blk-core.c b/block/blk-core.c index 0bfaa54d3e9f..ca77c725b4e5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req) { blk_dequeue_request(req); + blk_stat_set_issue_time(>issue_stat); + /* * We are now handing the request to the hardware, initialize * resid_len to full count and add the timeout handler. @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) trace_block_rq_complete(req->q, req, nr_bytes); + blk_stat_add(>q->rq_stats[rq_data_dir(req)], req); + if (!req->bio) return false; diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 01fb455d3377..633c79a538ea 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page) return ret; } +static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx) +{ + struct blk_mq_ctx *ctx; + unsigned int i; + + hctx_for_each_ctx(hctx, ctx, i) { + blk_stat_init(>stat[0]); + blk_stat_init(>stat[1]); + } +} + +static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx, + const char *page, size_t count) +{ + blk_mq_stat_clear(hctx); + return count; +} + +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre) +{ + return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n", + pre, (long long) stat->nr_samples, + (long long) stat->mean, (long long) stat->min, + (long long) stat->max); +} + +static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page) +{ + struct blk_rq_stat stat[2]; + ssize_t ret; + + blk_stat_init([0]); + blk_stat_init([1]); + + blk_hctx_stat_get(hctx, stat); + + ret = print_stat(page, [0], "read :"); + ret += print_stat(page + ret, [1], "write:"); + return ret; +} + static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = { .attr = {.name = "dispatched", .mode = S_IRUGO }, .show = blk_mq_sysfs_dispatched_show, @@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = { .show = blk_mq_hw_sysfs_poll_show, .store = blk_mq_hw_sysfs_poll_store, }; +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = { + .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR }, + .show = blk_mq_hw_sysfs_stat_show, + .store = blk_mq_hw_sysfs_stat_store, +}; static struct attribute *default_hw_ctx_attrs[] = { _mq_hw_sysfs_queued.attr, @@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = { _mq_hw_sysfs_cpus.attr, _mq_hw_sysfs_active.attr, _mq_hw_sysfs_poll.attr, + _mq_hw_sysfs_stat.attr, NULL, }; diff --git a/block/blk-mq.c b/block/blk-mq.c index 2da1a0ee3318..4555a76d22a7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -30,6 +30,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-tag.h" +#include "blk-stat.h" static DEFINE_MUTEX(all_q_mutex); static LIST_HEAD(all_q_list); @@ -376,10 +377,19 @@ static void
[PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
This patch enables a hybrid polling mode. Instead of polling after IO submission, we can induce an artificial delay, and then poll after that. For example, if the IO is presumed to complete in 8 usecs from now, we can sleep for 4 usecs, wake up, and then do our polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, we can achieve big latency reductions while still using the same (or less) amount of CPU. Signed-off-by: Jens Axboe--- block/blk-mq.c | 38 ++ block/blk-sysfs.c | 29 + block/blk.h| 1 + include/linux/blkdev.h | 1 + 4 files changed, 69 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index 4ef35588c299..caa55bec9411 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -302,6 +302,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, rq->rq_flags = 0; clear_bit(REQ_ATOM_STARTED, >atomic_flags); + clear_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags); blk_mq_put_tag(hctx, ctx, tag); blk_queue_exit(q); } @@ -2352,11 +2353,48 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) } EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues); +static void blk_mq_poll_hybrid_sleep(struct request_queue *q, +struct request *rq) +{ + struct hrtimer_sleeper hs; + ktime_t kt; + + if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags)) + return; + + set_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags); + + /* +* This will be replaced with the stats tracking code, using +* 'avg_completion_time / 2' as the pre-sleep target. +*/ + kt = ktime_set(0, q->poll_nsec); + + hrtimer_init_on_stack(, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrtimer_set_expires(, kt); + + hrtimer_init_sleeper(, current); + do { + if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags)) + break; + set_current_state(TASK_INTERRUPTIBLE); + hrtimer_start_expires(, HRTIMER_MODE_REL); + if (hs.task) + io_schedule(); + hrtimer_cancel(); + } while (hs.task && !signal_pending(current)); + + __set_current_state(TASK_RUNNING); + destroy_hrtimer_on_stack(); +} + bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq) { struct request_queue *q = hctx->queue; long state; + blk_mq_poll_hybrid_sleep(q, rq); + hctx->poll_considered++; state = current->state; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 5bb4648f434a..467b81c6713c 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -336,6 +336,28 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count) return ret; } +static ssize_t queue_poll_delay_show(struct request_queue *q, char *page) +{ + return queue_var_show(q->poll_nsec / 1000, page); +} + +static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page, + size_t count) +{ + unsigned long poll_usec; + ssize_t ret; + + if (!q->mq_ops || !q->mq_ops->poll) + return -EINVAL; + + ret = queue_var_store(_usec, page, count); + if (ret < 0) + return ret; + + q->poll_nsec = poll_usec * 1000; + return ret; +} + static ssize_t queue_poll_show(struct request_queue *q, char *page) { return queue_var_show(test_bit(QUEUE_FLAG_POLL, >queue_flags), page); @@ -562,6 +584,12 @@ static struct queue_sysfs_entry queue_poll_entry = { .store = queue_poll_store, }; +static struct queue_sysfs_entry queue_poll_delay_entry = { + .attr = {.name = "io_poll_delay", .mode = S_IRUGO | S_IWUSR }, + .show = queue_poll_delay_show, + .store = queue_poll_delay_store, +}; + static struct queue_sysfs_entry queue_wc_entry = { .attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR }, .show = queue_wc_show, @@ -608,6 +636,7 @@ static struct attribute *default_attrs[] = { _wc_entry.attr, _dax_entry.attr, _stats_entry.attr, + _poll_delay_entry.attr, NULL, }; diff --git a/block/blk.h b/block/blk.h index aa132dea598c..041185e5f129 100644 --- a/block/blk.h +++ b/block/blk.h @@ -111,6 +111,7 @@ void blk_account_io_done(struct request *req); enum rq_atomic_flags { REQ_ATOM_COMPLETE = 0, REQ_ATOM_STARTED, + REQ_ATOM_POLL_SLEPT, }; /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index dcd8d6e8801f..6acd220dc3f3 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -502,6 +502,7 @@ struct request_queue { unsigned intrequest_fn_active;
[PATCH 2/4] block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe--- block/blk-core.c | 36 +--- block/blk-mq.c | 33 + block/blk-mq.h | 2 ++ 3 files changed, 40 insertions(+), 31 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index ca77c725b4e5..7728562d77d9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -3293,47 +3293,21 @@ EXPORT_SYMBOL(blk_finish_plug); bool blk_poll(struct request_queue *q, blk_qc_t cookie) { - struct blk_plug *plug; - long state; - unsigned int queue_num; struct blk_mq_hw_ctx *hctx; + struct blk_plug *plug; + struct request *rq; if (!q->mq_ops || !q->mq_ops->poll || !blk_qc_t_valid(cookie) || !test_bit(QUEUE_FLAG_POLL, >queue_flags)) return false; - queue_num = blk_qc_t_to_queue_num(cookie); - hctx = q->queue_hw_ctx[queue_num]; - hctx->poll_considered++; - plug = current->plug; if (plug) blk_flush_plug_list(plug, false); - state = current->state; - while (!need_resched()) { - int ret; - - hctx->poll_invoked++; - - ret = q->mq_ops->poll(hctx, blk_qc_t_to_tag(cookie)); - if (ret > 0) { - hctx->poll_success++; - set_current_state(TASK_RUNNING); - return true; - } - - if (signal_pending_state(state, current)) - set_current_state(TASK_RUNNING); - - if (current->state == TASK_RUNNING) - return true; - if (ret < 0) - break; - cpu_relax(); - } - - return false; + hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)]; + rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie)); + return blk_mq_poll(hctx, rq); } EXPORT_SYMBOL_GPL(blk_poll); diff --git a/block/blk-mq.c b/block/blk-mq.c index 4555a76d22a7..4ef35588c299 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2352,6 +2352,39 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) } EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues); +bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq) +{ + struct request_queue *q = hctx->queue; + long state; + + hctx->poll_considered++; + + state = current->state; + while (!need_resched()) { + int ret; + + hctx->poll_invoked++; + + ret = q->mq_ops->poll(hctx, rq->tag); + if (ret > 0) { + hctx->poll_success++; + set_current_state(TASK_RUNNING); + return true; + } + + if (signal_pending_state(state, current)) + set_current_state(TASK_RUNNING); + + if (current->state == TASK_RUNNING) + return true; + if (ret < 0) + break; + cpu_relax(); + } + + return false; +} + void blk_mq_disable_hotplug(void) { mutex_lock(_q_mutex); diff --git a/block/blk-mq.h b/block/blk-mq.h index 8cf16cb69f64..79ea86e0ed49 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -61,6 +61,8 @@ extern void blk_mq_rq_timed_out(struct request *req, bool reserved); void blk_mq_release(struct request_queue *q); +extern bool blk_mq_poll(struct blk_mq_hw_ctx *, struct request *); + static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, unsigned int cpu) { -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] blk-mq: make the polling code adaptive
The previous commit introduced the hybrid sleep/poll mode. Take that one step further, and use the completion latencies to automatically sleep for half the mean completion time. This is a good approximation. This changes the 'io_poll_delay' sysfs file a bit to expose the various options. Depending on the value, the polling code will behave differently: -1 Never enter hybrid sleep mode 0 Use half of the completion mean for the sleep delay >0 Use this specific value as the sleep delay Signed-off-by: Jens Axboe--- block/blk-mq.c | 50 +++--- block/blk-sysfs.c | 28 include/linux/blkdev.h | 2 +- 3 files changed, 68 insertions(+), 12 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index caa55bec9411..2af75b087ebd 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2353,13 +2353,57 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) } EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues); +static unsigned long blk_mq_poll_nsecs(struct blk_mq_hw_ctx *hctx, + struct request *rq) +{ + struct blk_rq_stat stat[2]; + unsigned long ret = 0; + + /* +* We don't have to do this once per IO, should optimize this +* to just use the current window of stats until it changes +*/ + memset(, 0, sizeof(stat)); + blk_hctx_stat_get(hctx, stat); + + /* +* As an optimistic guess, use half of the mean service time +* for this type of request +*/ + if (req_op(rq) == REQ_OP_READ && stat[0].nr_samples) + ret = (stat[0].mean + 1) / 2; + else if (req_op(rq) == REQ_OP_WRITE && stat[1].nr_samples) + ret = (stat[1].mean + 1) / 2; + + return ret; +} + static void blk_mq_poll_hybrid_sleep(struct request_queue *q, +struct blk_mq_hw_ctx *hctx, struct request *rq) { struct hrtimer_sleeper hs; + unsigned int nsecs; ktime_t kt; - if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags)) + if (test_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags)) + return; + + /* +* poll_nsec can be: +* +* -1: don't ever hybrid sleep +* 0: use half of prev avg +* >0: use this specific value +*/ + if (q->poll_nsec == -1) + return; + else if (q->poll_nsec > 0) + nsecs = q->poll_nsec; + else + nsecs = blk_mq_poll_nsecs(hctx, rq); + + if (!nsecs) return; set_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags); @@ -2368,7 +2412,7 @@ static void blk_mq_poll_hybrid_sleep(struct request_queue *q, * This will be replaced with the stats tracking code, using * 'avg_completion_time / 2' as the pre-sleep target. */ - kt = ktime_set(0, q->poll_nsec); + kt = ktime_set(0, nsecs); hrtimer_init_on_stack(, CLOCK_MONOTONIC, HRTIMER_MODE_REL); hrtimer_set_expires(, kt); @@ -2393,7 +2437,7 @@ bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq) struct request_queue *q = hctx->queue; long state; - blk_mq_poll_hybrid_sleep(q, rq); + blk_mq_poll_hybrid_sleep(q, hctx, rq); hctx->poll_considered++; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 467b81c6713c..c668af57197b 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -338,24 +338,36 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count) static ssize_t queue_poll_delay_show(struct request_queue *q, char *page) { - return queue_var_show(q->poll_nsec / 1000, page); + int val; + + if (q->poll_nsec == -1) + val = -1; + else + val = q->poll_nsec / 1000; + + return sprintf(page, "%d\n", val); } static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page, size_t count) { - unsigned long poll_usec; - ssize_t ret; + int err, val; if (!q->mq_ops || !q->mq_ops->poll) return -EINVAL; - ret = queue_var_store(_usec, page, count); - if (ret < 0) - return ret; + err = kstrtoint(page, 10, ); + if (err < 0) + return err; - q->poll_nsec = poll_usec * 1000; - return ret; + printk(KERN_ERR "val=%d\n", val); + + if (val == -1) + q->poll_nsec = -1; + else + q->poll_nsec = val * 1000; + + return count; } static ssize_t queue_poll_show(struct request_queue *q, char *page) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 6acd220dc3f3..857f866d2751 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -502,7
Re: untange block operations and fs READ/WRITE
On 11/01/2016 07:40 AM, Christoph Hellwig wrote: Hi Jens, this series removes the READ_* and WRITE_* defintions from fs.h and makes all bio submitters use the REQ_* flags directly. To make that easier we also change the meaning of some of the flags slightly, so that the callers don't have to set half a dozen flags for normal operations. After that the READ and WRITE defintions are decouple from REQ_OP_READ and REQ_OP_WRITE and moved to kernel.h, and last but not least we can now stop including blk_types.h from fs.h and avoid the CONFIG_BLOCK ifdefs in it. Looks sane to me, and passes basic testing as well. Applied for 4.10. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block device direct I/O fast path
On Tue, Nov 01 2016, Christoph Hellwig wrote: > On Tue, Nov 01, 2016 at 11:00:19AM -0600, Jens Axboe wrote: > > #2 is a bit more problematic, I'm pondering how we can implement that on > > top of the bio approach. The nice thing about the request based approach > > is that we have a 1:1 mapping with the unit on the driver side. And we > > have a place to store the timer. I don't particularly love the embedded > > timer, however, it'd be great to implement that differently. Trying to > > think of options there, haven't found any yet. > > I have a couple ideas for that. Give me a few weeks and I'll send > patches. I'm not that patient :-) For the SYNC part, it should be easy enough to do by just using an on-stack hrtimer. I guess that will do for now, since we don't have poll support for async O_DIRECT right now anyway. http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.10/dio=e96d9afd56791a61d463cb88f8f3b48393b71020 Untested, but should get the point across. I'll fire up a test box. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block device direct I/O fast path
On Tue, Nov 01, 2016 at 11:00:19AM -0600, Jens Axboe wrote: > #2 is a bit more problematic, I'm pondering how we can implement that on > top of the bio approach. The nice thing about the request based approach > is that we have a 1:1 mapping with the unit on the driver side. And we > have a place to store the timer. I don't particularly love the embedded > timer, however, it'd be great to implement that differently. Trying to > think of options there, haven't found any yet. I have a couple ideas for that. Give me a few weeks and I'll send patches. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block device direct I/O fast path
On Mon, Oct 31 2016, Christoph Hellwig wrote: > Hi Jens, > > this small series adds a fasth path to the block device direct I/O > code. It uses new magic created by Kent to avoid allocating an array > for the pages, and as part of that allows small, non-aio direct I/O > requests to be done without memory allocations or atomic ops and with > a minimal cache footprint. It's basically a cut down version of the > new iomap direct I/O code, and in the future it might also make sense > to move the main direct I/O code to a similar model. But indepedent > of that it's always worth to optimize the case of small, non-I/O > requests as allocating the bio and biovec on stack and a trivial > completion handler will always win over a full blown implementation. I'm not particularly tied to the request based implementation that I did here: http://git.kernel.dk/cgit/linux-block/log/?h=blk-dio I basically wanted to solve two problems with that: 1) The slow old direct-io.c code 2) Implement the hybrid polling #1 is accomplished with your code as well, though it does lack suppor for async IO, but that would not be hard to add. #2 is a bit more problematic, I'm pondering how we can implement that on top of the bio approach. The nice thing about the request based approach is that we have a 1:1 mapping with the unit on the driver side. And we have a place to store the timer. I don't particularly love the embedded timer, however, it'd be great to implement that differently. Trying to think of options there, haven't found any yet. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()
On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote: > [ My mail system got broken and original reply didn't get to through. Resent. > ] OK, this answers some of my questions from previous email so disregard that one. > On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote: > > On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote: > > > Most of work happans on head page. Only when we need to do copy data to > > > userspace we find relevant subpage. > > > > > > We are still limited by PAGE_SIZE per iteration. Lifting this limitation > > > would require some more work. > > > > Hum, I'm kind of lost. > > The limitation here comes from how copy_page_to_iter() and > copy_page_from_iter() work wrt. highmem: it can only handle one small > page a time. > > On write side, we also have problem with assuming small page: write length > and offset within page calculated before we know if small or huge page is > allocated. It's not easy to fix. Looks like it would require change in > ->write_begin() interface to accept len > PAGE_SIZE. > > > Can you point me to some design document / email that would explain some > > high level ideas how are huge pages in page cache supposed to work? > > I'll elaborate more in cover letter to next revision. > > > When are we supposed to operate on the head page and when on subpage? > > It's case-by-case. See above explanation why we're limited to PAGE_SIZE > here. > > > What is protected by the page lock of the head page? > > Whole huge page. As with anon pages. > > > Do page locks of subpages play any role? > > lock_page() on any subpage would lock whole huge page. > > > If understand right, e.g. pagecache_get_page() will return subpages but > > is it generally safe to operate on subpages individually or do we have > > to be aware that they are part of a huge page? > > I tried to make it as transparent as possible: page flag operations will > be redirected to head page, if necessary. Things like page_mapping() and > page_to_pgoff() know about huge pages. > > Direct access to struct page fields must be avoided for tail pages as most > of them doesn't have meaning you would expect for small pages. OK, good to know. > > If I understand the motivation right, it is mostly about being able to mmap > > PMD-sized chunks to userspace. So my naive idea would be that we could just > > implement it by allocating PMD sized chunks of pages when adding pages to > > page cache, we don't even have to read them all unless we come from PMD > > fault path. > > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc} > per-hugepage, one common list of buffer heads... > > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling > it otherwise doesn't make sense) and handling it differently for file-THP > is nightmare from maintenance POV. But the complexity of two different page sizes for page cache and *each* filesystem that wants to support it does not make the maintenance easy either. So I'm not convinced that using the same rules for anon-THP and file-THP is a clear win. And if we have these two options neither of which has negligible maintenance cost, I'd also like to see more justification for why it is a good idea to have file-THP for normal filesystems. Do you have any performance numbers that show it is a win under some realistic workload? I'd also note that having PMD-sized pages has some obvious disadvantages as well: 1) I'm not sure buffer head handling code will quite scale to 512 or even 2048 buffer_heads on a linked list referenced from a page. It may work but I suspect the performance will suck. 2) PMD-sized pages result in increased space & memory usage. 3) In ext4 we have to estimate how much metadata we may need to modify when allocating blocks underlying a page in the worst case (you don't seem to update this estimate in your patch set). With 2048 blocks underlying a page, each possibly in a different block group, it is a lot of metadata forcing us to reserve a large transaction (not sure if you'll be able to even reserve such large transaction with the default journal size), which again makes things slower. 4) As you have noted some places like write_begin() still depend on 4k pages which creates a strange mix of places that use subpages and that use head pages. All this would be a non-issue (well, except 2 I guess) if we just didn't expose filesystems to the fact that something like file-THP exists. > > Reclaim may need to be aware not to split pages unnecessarily > > but that's about it. So I'd like to understand what's wrong with this > > naive idea and why do filesystems need to be aware that someone wants to > > map in PMD sized chunks... > > In addition to flags, THP uses some space in struct page of tail pages to > encode additional information. See compound_{mapcount,head,dtor,order}, > page_deferred_list(). Thanks, I'll check that. Honza -- Jan Kara
Re: [PATCH v5 12/14] SRP transport, scsi-mq: Wait for .queue_rq() if necessary
and again, Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 13/14] nvme: Fix a race condition related to stopping queues
Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 11/14] SRP transport: Move queuecommand() wait code to SCSI core
Again, Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] blk-mq: export blk_mq_map_queues
On Tue, Nov 01 2016, Martin K. Petersen wrote: > > "Christoph" == Christoph Hellwigwrites: > > Christoph> This will allow SCSI to have a single blk_mq_ops structure > Christoph> that either lets the LLDD map the queues to PCIe MSIx vectors > Christoph> or use the default. > > Jens, any objection to me funneling this change through the SCSI tree? No, that's fine, you can add my reviewed-by. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 08/14] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
Looks useful, Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 05/14] blk-mq: Avoid that requeueing starts stopped queues
Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 06/14] blk-mq: Remove blk_mq_cancel_requeue_work()
Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] scsi: allow LLDDs to expose the queue mapping to blk-mq
Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] blk-mq: export blk_mq_map_queues
> "Christoph" == Christoph Hellwigwrites: Christoph> This will allow SCSI to have a single blk_mq_ops structure Christoph> that either lets the LLDD map the queues to PCIe MSIx vectors Christoph> or use the default. Jens, any objection to me funneling this change through the SCSI tree? -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 5/6] nvme: Add unlock_from_suspend
On Tue, Nov 01, 2016 at 06:57:05AM -0700, Christoph Hellwig wrote: > On Tue, Nov 01, 2016 at 10:18:13AM +0200, Sagi Grimberg wrote: > > > + > > > + return nvme_insert_rq(q, req, 1, sec_submit_endio); > > > > No need to introduce nvme_insert_rq at all, just call > > blk_mq_insert_request (other examples call blk_execute_rq_nowait > > but its pretty much the same...) > > blk_execute_rq_nowait is the API to use - blk_mq_insert_request isn't > even exported. Thanks for the reviews. This patch needs to be separated into two patches. There is the addition of the nvme-suspend stuff and the addition of sec_ops. Most of the clutter and weird stuff is coming from the latter. I'll separate the patches, use the correct api and clean the clutter. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/16] fs: decouple READ and WRITE from the block layer ops
On Tue, Nov 01, 2016 at 08:23:54AM -0600, Bart Van Assche wrote: > On 11/01/2016 07:40 AM, Christoph Hellwig wrote: >> +/* generic data direction defintions */ >> +#define READ0 >> +#define WRITE 1 > > Hello Christoph, > > If you have to resend this patch series, please fix the spelling of > "definitions". Thanks Bart, that's one of my favourite misspelling that I keep getting wrong again and again.. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/16] fs: decouple READ and WRITE from the block layer ops
On 11/01/2016 07:40 AM, Christoph Hellwig wrote: +/* generic data direction defintions */ +#define READ 0 +#define WRITE 1 Hello Christoph, If you have to resend this patch series, please fix the spelling of "definitions". Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 45/60] block: bio: introduce bio_for_each_segment_all_rd() and its write pair
On Tue, Nov 01, 2016 at 07:51:27AM +0800, Ming Lei wrote: > Sorry for forgetting to mention one important point: > > - after multipage bvec is introduced, the iterated bvec pointer > still points to singlge page bvec, which is generated in-flight > and is readonly actually. That is the motivation about the introduction > of bio_for_each_segment_all_rd(). > > So maybe bio_for_each_page_all_ro() is better? > > For _wt(), we still can keep it as bio_for_each_segment(), which also > reflects that now the iterated bvec points to one whole segment if > we name _rd as bio_for_each_page_all_ro(). I'm agnostic as to what the right names are --- my big concern is there is an explosion of bio_for_each_page_* functions, and that there isn't good documentation about (a) when to use each of these functions, and (b) why. I was goinig through the patch series, and it was hard for me to figure out why, and I was looking through all of the patches. Once all of the patches are merged in, I am concerned this is going to be massive trapdoor that will snare a large number of unwitting developers. As far as my preference, from an abstract perspective, if one version (the read-write variant, I presume) is always safe, while one (the read-only variant) is faster, if you can work under restricted circumstances, naming the safe version so it is the "default", and more dangerous one with the name that makes it a bit more obvious what you have to do in order to use it safely, and then very clearly document both in sources, and in the Documentation directory, what the issues are and what you have to do in order to use the faster version. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
expose queue the queue mapping for SCSI drivers V2
In 4.9 I've added support in the interrupt layer to automatically assign the interrupt affinity at interrupt allocation time, and expose that information to blk-mq. This series extents that so that SCSI driver can pass on the information as well. The SCSI part is fairly trivial, although we need to also export the default queue mapping function in blk-mq to keep things simple. I've also converted over the smartpqi driver as an example as it's the easiest of the multiqueue SCSI drivers to convert. Changes since V1: - move the EXPORT_SYMBOL of blk_mq_map_queues to the right patch - added Reviewed-by, Acked-by and Tested-by tags -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] blk-mq: export blk_mq_map_queues
This will allow SCSI to have a single blk_mq_ops structure that either lets the LLDD map the queues to PCIe MSIx vectors or use the default. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke Reviewed-by: Johannes Thumshirn --- block/blk-mq-cpumap.c | 1 + block/blk-mq.h | 1 - include/linux/blk-mq.h | 1 + 3 files changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 19b1d9c..8e61e86 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -87,6 +87,7 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set) free_cpumask_var(cpus); return 0; } +EXPORT_SYMBOL_GPL(blk_mq_map_queues); /* * We have no quick way of doing reverse lookups. This is only used at diff --git a/block/blk-mq.h b/block/blk-mq.h index e5d2524..5347f01 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -38,7 +38,6 @@ void blk_mq_disable_hotplug(void); /* * CPU -> queue mappings */ -int blk_mq_map_queues(struct blk_mq_tag_set *set); extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int); static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 535ab2e..6c0fb25 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -237,6 +237,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q); void blk_mq_freeze_queue_start(struct request_queue *q); int blk_mq_reinit_tagset(struct blk_mq_tag_set *set); +int blk_mq_map_queues(struct blk_mq_tag_set *set); void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues); /* -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] scsi: allow LLDDs to expose the queue mapping to blk-mq
Just hand through the blk-mq map_queues method in the host template. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke Reviewed-by: Johannes Thumshirn --- drivers/scsi/scsi_lib.c | 10 ++ include/scsi/scsi_host.h | 8 2 files changed, 18 insertions(+) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 2cca9cf..f23ec24 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1990,6 +1990,15 @@ static void scsi_exit_request(void *data, struct request *rq, kfree(cmd->sense_buffer); } +static int scsi_map_queues(struct blk_mq_tag_set *set) +{ + struct Scsi_Host *shost = container_of(set, struct Scsi_Host, tag_set); + + if (shost->hostt->map_queues) + return shost->hostt->map_queues(shost); + return blk_mq_map_queues(set); +} + static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost) { struct device *host_dev; @@ -2082,6 +2091,7 @@ static struct blk_mq_ops scsi_mq_ops = { .timeout= scsi_timeout, .init_request = scsi_init_request, .exit_request = scsi_exit_request, + .map_queues = scsi_map_queues, }; struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev) diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 7e4cd53..36680f1 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -278,6 +278,14 @@ struct scsi_host_template { int (* change_queue_depth)(struct scsi_device *, int); /* +* This functions lets the driver expose the queue mapping +* to the block layer. +* +* Status: OPTIONAL +*/ + int (* map_queues)(struct Scsi_Host *shost); + + /* * This function determines the BIOS parameters for a given * harddisk. These tend to be numbers that are made up by * the host adapter. Parameters: -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 5/6] nvme: Add unlock_from_suspend
On Tue, Nov 01, 2016 at 10:18:13AM +0200, Sagi Grimberg wrote: > > + > > + return nvme_insert_rq(q, req, 1, sec_submit_endio); > > No need to introduce nvme_insert_rq at all, just call > blk_mq_insert_request (other examples call blk_execute_rq_nowait > but its pretty much the same...) blk_execute_rq_nowait is the API to use - blk_mq_insert_request isn't even exported. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/16] block: replace REQ_NOIDLE with REQ_IDLE
Noidle should be the default for writes as seen by all the compounds defintions in fs.h using it. In fact only direct I/O really should be using NODILE, so turn the whole flag around to get the defaults right, which will make our life much easier especially onces the WRITE_* defines go away. This assumes all the existing "raw" users of REQ_SYNC for writes want noidle behavior, which seems to be spot on from a quick audit. Signed-off-by: Christoph Hellwig--- Documentation/block/cfq-iosched.txt | 32 block/cfq-iosched.c | 11 --- drivers/block/drbd/drbd_actlog.c| 2 +- include/linux/blk_types.h | 4 ++-- include/linux/fs.h | 10 +- include/trace/events/f2fs.h | 2 +- 6 files changed, 33 insertions(+), 28 deletions(-) diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt index 1e4f835..895bd38 100644 --- a/Documentation/block/cfq-iosched.txt +++ b/Documentation/block/cfq-iosched.txt @@ -240,11 +240,11 @@ All cfq queues doing synchronous sequential IO go on to sync-idle tree. On this tree we idle on each queue individually. All synchronous non-sequential queues go on sync-noidle tree. Also any -request which are marked with REQ_NOIDLE go on this service tree. On this -tree we do not idle on individual queues instead idle on the whole group -of queues or the tree. So if there are 4 queues waiting for IO to dispatch -we will idle only once last queue has dispatched the IO and there is -no more IO on this service tree. +synchronous write request which is not marked with REQ_IDLE goes on this +service tree. On this tree we do not idle on individual queues instead idle +on the whole group of queues or the tree. So if there are 4 queues waiting +for IO to dispatch we will idle only once last queue has dispatched the IO +and there is no more IO on this service tree. All async writes go on async service tree. There is no idling on async queues. @@ -257,17 +257,17 @@ tree idling provides isolation with buffered write queues on async tree. FAQ === -Q1. Why to idle at all on queues marked with REQ_NOIDLE. +Q1. Why to idle at all on queues not marked with REQ_IDLE. -A1. We only do tree idle (all queues on sync-noidle tree) on queues marked -with REQ_NOIDLE. This helps in providing isolation with all the sync-idle +A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked +with REQ_IDLE. This helps in providing isolation with all the sync-idle queues. Otherwise in presence of many sequential readers, other synchronous IO might not get fair share of disk. For example, if there are 10 sequential readers doing IO and they get -100ms each. If a REQ_NOIDLE request comes in, it will be scheduled -roughly after 1 second. If after completion of REQ_NOIDLE request we -do not idle, and after a couple of milli seconds a another REQ_NOIDLE +100ms each. If a !REQ_IDLE request comes in, it will be scheduled +roughly after 1 second. If after completion of !REQ_IDLE request we +do not idle, and after a couple of milli seconds a another !REQ_IDLE request comes in, again it will be scheduled after 1second. Repeat it and notice how a workload can lose its disk share and suffer due to multiple sequential readers. @@ -276,16 +276,16 @@ A1. We only do tree idle (all queues on sync-noidle tree) on queues marked context of fsync, and later some journaling data is written. Journaling data comes in only after fsync has finished its IO (atleast for ext4 that seemed to be the case). Now if one decides not to idle on fsync -thread due to REQ_NOIDLE, then next journaling write will not get +thread due to !REQ_IDLE, then next journaling write will not get scheduled for another second. A process doing small fsync, will suffer badly in presence of multiple sequential readers. -Hence doing tree idling on threads using REQ_NOIDLE flag on requests +Hence doing tree idling on threads using !REQ_IDLE flag on requests provides isolation from multiple sequential readers and at the same time we do not idle on individual threads. -Q2. When to specify REQ_NOIDLE -A2. I would think whenever one is doing synchronous write and not expecting +Q2. When to specify REQ_IDLE +A2. I would think whenever one is doing synchronous write and expecting more writes to be dispatched from same context soon, should be able -to specify REQ_NOIDLE on writes and that probably should work well for +to specify REQ_IDLE on writes and that probably should work well for most of the cases. diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index f28db97..dcbed8c 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -3914,6 +3914,12 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, cfqq->seek_history |= (sdist
[PATCH 12/16] block,fs: untangle fs.h and blk_types.h
Nothing in fs.h should require blk_types.h to be included. Signed-off-by: Christoph Hellwig--- fs/9p/vfs_addr.c | 1 + fs/cifs/connect.c | 1 + fs/cifs/transport.c | 1 + fs/gfs2/dir.c | 1 + fs/isofs/compress.c | 1 + fs/ntfs/logfile.c | 1 + fs/ocfs2/buffer_head_io.c | 1 + fs/orangefs/inode.c | 1 + fs/reiserfs/stree.c | 1 + fs/squashfs/block.c | 1 + fs/udf/dir.c | 1 + fs/udf/directory.c| 1 + fs/udf/inode.c| 1 + fs/ufs/balloc.c | 1 + include/linux/fs.h| 2 +- include/linux/swap.h | 1 + include/linux/writeback.h | 2 ++ lib/iov_iter.c| 1 + 18 files changed, 19 insertions(+), 1 deletion(-) diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index 6181ad7..5ca1fb0 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index aab5227..db726e8 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -41,6 +41,7 @@ #include #include #include +#include #include "cifspdu.h" #include "cifsglob.h" diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c index 206a597..5f02edc 100644 --- a/fs/cifs/transport.c +++ b/fs/cifs/transport.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index 3cdde5f..7911321 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -62,6 +62,7 @@ #include #include #include +#include #include "gfs2.h" #include "incore.h" diff --git a/fs/isofs/compress.c b/fs/isofs/compress.c index 44af14b..9bb2fe3 100644 --- a/fs/isofs/compress.c +++ b/fs/isofs/compress.c @@ -18,6 +18,7 @@ #include #include +#include #include #include diff --git a/fs/ntfs/logfile.c b/fs/ntfs/logfile.c index 761f12f..353379f 100644 --- a/fs/ntfs/logfile.c +++ b/fs/ntfs/logfile.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "attrib.h" #include "aops.h" diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c index 8f040f8..d9ebe11 100644 --- a/fs/ocfs2/buffer_head_io.c +++ b/fs/ocfs2/buffer_head_io.c @@ -26,6 +26,7 @@ #include #include #include +#include #include diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index ef3b4eb..551bc74 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -8,6 +8,7 @@ * Linux VFS inode operations. */ +#include #include "protocol.h" #include "orangefs-kernel.h" #include "orangefs-bufmap.h" diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index a97e352..0037aea 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "reiserfs.h" #include #include diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c index ce62a38..2751476 100644 --- a/fs/squashfs/block.c +++ b/fs/squashfs/block.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "squashfs_fs.h" #include "squashfs_fs_sb.h" diff --git a/fs/udf/dir.c b/fs/udf/dir.c index aaec13c..2d0e028 100644 --- a/fs/udf/dir.c +++ b/fs/udf/dir.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "udf_i.h" #include "udf_sb.h" diff --git a/fs/udf/directory.c b/fs/udf/directory.c index 988d535..7aa48bd 100644 --- a/fs/udf/directory.c +++ b/fs/udf/directory.c @@ -16,6 +16,7 @@ #include #include +#include struct fileIdentDesc *udf_fileident_read(struct inode *dir, loff_t *nf_pos, struct udf_fileident_bh *fibh, diff --git a/fs/udf/inode.c b/fs/udf/inode.c index aad4640..0f3db71 100644 --- a/fs/udf/inode.c +++ b/fs/udf/inode.c @@ -38,6 +38,7 @@ #include #include #include +#include #include "udf_i.h" #include "udf_sb.h" diff --git a/fs/ufs/balloc.c b/fs/ufs/balloc.c index 67e085d..b035af5 100644 --- a/fs/ufs/balloc.c +++ b/fs/ufs/balloc.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include "ufs_fs.h" diff --git a/include/linux/fs.h b/include/linux/fs.h index 5b0a9b7..8533e9d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -28,7 +28,6 @@ #include #include #include -#include #include #include #include @@ -38,6 +37,7 @@ struct backing_dev_info; struct bdi_writeback; +struct bio; struct export_operations; struct hd_geometry; struct iovec; diff --git a/include/linux/swap.h b/include/linux/swap.h index a56523c..3a6aebc 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -11,6 +11,7 @@ #include #include #include +#include #include struct notifier_block; diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 797100e..e4c3870 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -10,6 +10,8 @@ #include #include +struct bio; + DECLARE_PER_CPU(int, dirty_throttle_leaks);
[PATCH 16/16] block: remove the CONFIG_BLOCK ifdef in blk_types.h
Now that we have a separate header for struct bio_vec there is absolutely no excuse for including this header from non-block I/O code. Signed-off-by: Christoph Hellwig--- include/linux/blk_types.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 63b750a..bb92102 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -17,7 +17,6 @@ struct io_context; struct cgroup_subsys_state; typedef void (bio_end_io_t) (struct bio *); -#ifdef CONFIG_BLOCK /* * main unit of I/O for the block layer and lower layers (ie drivers and * stacking drivers) @@ -126,8 +125,6 @@ struct bio { #define BVEC_POOL_OFFSET (16 - BVEC_POOL_BITS) #define BVEC_POOL_IDX(bio) ((bio)->bi_flags >> BVEC_POOL_OFFSET) -#endif /* CONFIG_BLOCK */ - /* * Operations and flags common to the bio and request structures. * We use 8 bits for encoding the operation, and the remaining 24 for flags. -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/16] arm, arm64: don't include blk_types.h in
No need for it - we only use struct bio_vec in prototypes and already have forward declarations for it. Signed-off-by: Christoph Hellwig--- arch/arm/include/asm/io.h | 1 - arch/arm64/include/asm/io.h | 1 - 2 files changed, 2 deletions(-) diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h index 021692c..42871fb 100644 --- a/arch/arm/include/asm/io.h +++ b/arch/arm/include/asm/io.h @@ -25,7 +25,6 @@ #include #include -#include #include #include #include diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 0bba427..0c00c87 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -22,7 +22,6 @@ #ifdef __KERNEL__ #include -#include #include #include -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/16] mm: only include blk_types in swap.h if CONFIG_SWAP is enabled
It's only needed for the CONFIG_SWAP-only use of bio_end_io_t. Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some ifdefs in blk_types.h. Instead we'll need to add a few explicit includes that were implicit before, though. Signed-off-by: Christoph Hellwig--- drivers/staging/lustre/include/linux/lnet/types.h | 1 + drivers/staging/lustre/lustre/llite/rw.c | 1 + fs/ntfs/aops.c| 1 + fs/ntfs/mft.c | 1 + fs/reiserfs/inode.c | 1 + fs/splice.c | 1 + include/linux/swap.h | 4 +++- 7 files changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/staging/lustre/include/linux/lnet/types.h b/drivers/staging/lustre/include/linux/lnet/types.h index f8be0e2..8ca1e9d 100644 --- a/drivers/staging/lustre/include/linux/lnet/types.h +++ b/drivers/staging/lustre/include/linux/lnet/types.h @@ -34,6 +34,7 @@ #define __LNET_TYPES_H__ #include +#include /** \addtogroup lnet * @{ diff --git a/drivers/staging/lustre/lustre/llite/rw.c b/drivers/staging/lustre/lustre/llite/rw.c index 50c0152..76a6836 100644 --- a/drivers/staging/lustre/lustre/llite/rw.c +++ b/drivers/staging/lustre/lustre/llite/rw.c @@ -47,6 +47,7 @@ #include /* current_is_kswapd() */ #include +#include #define DEBUG_SUBSYSTEM S_LLITE diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c index fe251f1..d0cf6fe 100644 --- a/fs/ntfs/aops.c +++ b/fs/ntfs/aops.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "aops.h" #include "attrib.h" diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c index d3c0096..b6f4021 100644 --- a/fs/ntfs/mft.c +++ b/fs/ntfs/mft.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "attrib.h" #include "aops.h" diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index 58b2ded..cfeae9b 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -19,6 +19,7 @@ #include #include #include +#include int reiserfs_commit_write(struct file *f, struct page *page, unsigned from, unsigned to); diff --git a/fs/splice.c b/fs/splice.c index 153d4f3..51492f2 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -17,6 +17,7 @@ * Copyright (C) 2006 Ingo Molnar * */ +#include #include #include #include diff --git a/include/linux/swap.h b/include/linux/swap.h index 3a6aebc..bfee1af 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -11,7 +11,6 @@ #include #include #include -#include #include struct notifier_block; @@ -352,6 +351,9 @@ extern int kswapd_run(int nid); extern void kswapd_stop(int nid); #ifdef CONFIG_SWAP + +#include /* for bio_end_io_t */ + /* linux/mm/page_io.c */ extern int swap_readpage(struct page *); extern int swap_writepage(struct page *page, struct writeback_control *wbc); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/16] ceph: don't include blk_types.h in messenger.h
The file only needs the struct bvec_iter delcaration, which is available from bvec.h. Signed-off-by: Christoph Hellwig--- include/linux/ceph/messenger.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 8dbd787..67bcef2 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -1,7 +1,7 @@ #ifndef __FS_CEPH_MESSENGER_H #define __FS_CEPH_MESSENGER_H -#include +#include #include #include #include -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/16] btrfs: use op_is_sync to check for synchronous requests
Signed-off-by: Christoph Hellwig--- fs/btrfs/disk-io.c | 2 +- fs/btrfs/volumes.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3a57f99..c8454a8 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -930,7 +930,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct inode *inode, atomic_inc(_info->nr_async_submits); - if (bio->bi_opf & REQ_SYNC) + if (op_is_sync(bio->bi_opf)) btrfs_set_work_high_priority(>work); btrfs_queue_work(fs_info->workers, >work); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 71a60cc..deda46cf 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6100,7 +6100,7 @@ static noinline void btrfs_schedule_bio(struct btrfs_root *root, bio->bi_next = NULL; spin_lock(>io_lock); - if (bio->bi_opf & REQ_SYNC) + if (op_is_sync(bio->bi_opf)) pending_bios = >pending_sync_bios; else pending_bios = >pending_bios; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/16] block, fs: move submit_bio to bio.h
This is where all the other bio operations live, so users must include bio.h anyway. Signed-off-by: Christoph Hellwig--- include/linux/bio.h | 2 ++ include/linux/fs.h | 1 - 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/bio.h b/include/linux/bio.h index fe9a170..5c604b49 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -404,6 +404,8 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask) } +extern blk_qc_t submit_bio(struct bio *); + extern void bio_endio(struct bio *); static inline void bio_io_error(struct bio *bio) diff --git a/include/linux/fs.h b/include/linux/fs.h index 0ad36e0..5b0a9b7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2717,7 +2717,6 @@ static inline void remove_inode_hash(struct inode *inode) extern void inode_sb_list_add(struct inode *inode); #ifdef CONFIG_BLOCK -extern blk_qc_t submit_bio(struct bio *); extern int bdev_read_only(struct block_device *); #endif extern int set_blocksize(struct block_device *, int); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/16] block: treat REQ_FUA and REQ_PREFLUSH as synchronous
Instead of requiring everyone to specify the REQ_SYNC flag aѕ well. Signed-off-by: Christoph Hellwig--- include/linux/blk_types.h | 8 +++- include/linux/fs.h| 6 +++--- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 3fa62ca..107d23d 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -216,9 +216,15 @@ static inline bool op_is_write(unsigned int op) return (op & 1); } +/* + * Reads are always treated as synchronous, as are requests with the FUA or + * PREFLUSH flag. Other operations may be marked as synchronous using the + * REQ_SYNC flag. + */ static inline bool op_is_sync(unsigned int op) { - return (op & REQ_OP_MASK) == REQ_OP_READ || (op & REQ_SYNC); + return (op & REQ_OP_MASK) == REQ_OP_READ || + (op & (REQ_SYNC | REQ_FUA | REQ_PREFLUSH)); } typedef unsigned int blk_qc_t; diff --git a/include/linux/fs.h b/include/linux/fs.h index 5e0078f..ccedccb 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -199,9 +199,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define READ_SYNC 0 #define WRITE_SYNC (REQ_SYNC | REQ_NOIDLE) #define WRITE_ODIRECT REQ_SYNC -#define WRITE_FLUSH(REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH) -#define WRITE_FUA (REQ_SYNC | REQ_NOIDLE | REQ_FUA) -#define WRITE_FLUSH_FUA(REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH | REQ_FUA) +#define WRITE_FLUSH(REQ_NOIDLE | REQ_PREFLUSH) +#define WRITE_FUA (REQ_NOIDLE | REQ_FUA) +#define WRITE_FLUSH_FUA(REQ_NOIDLE | REQ_PREFLUSH | REQ_FUA) /* * Attribute flags. These should be or-ed together to figure out what -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/16] bcache: use op_is_sync to check for synchronous requests
(and remove one layer of masking for the op_is_write call next to it). Signed-off-by: Christoph Hellwig--- drivers/md/bcache/request.c | 4 ++-- drivers/md/bcache/writeback.h | 3 +-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 40ffe5e..e8a2b69 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -404,8 +404,8 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio) if (!congested && mode == CACHE_MODE_WRITEBACK && - op_is_write(bio_op(bio)) && - (bio->bi_opf & REQ_SYNC)) + op_is_write(bio->bi_opf) && + op_is_sync(bio->bi_opf)) goto rescale; spin_lock(>io_lock); diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 301eaf5..629bd1a 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -57,8 +57,7 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio, if (would_skip) return false; - return bio->bi_opf & REQ_SYNC || - in_use <= CUTOFF_WRITEBACK; + return op_is_sync(bio->bi_opf) || in_use <= CUTOFF_WRITEBACK; } static inline void bch_writeback_queue(struct cached_dev *dc) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/16] umem: use op_is_sync to check for synchronous requests
Signed-off-by: Christoph Hellwig--- drivers/block/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/umem.c b/drivers/block/umem.c index be90e15..46f4c71 100644 --- a/drivers/block/umem.c +++ b/drivers/block/umem.c @@ -535,7 +535,7 @@ static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio) *card->biotail = bio; bio->bi_next = NULL; card->biotail = >bi_next; - if (bio->bi_opf & REQ_SYNC || !mm_check_plugged(card)) + if (op_is_sync(bio->bi_opf) || !mm_check_plugged(card)) activate(card); spin_unlock_irq(>lock); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/16] block: don't use REQ_SYNC in the READ_SYNC definition
Reads are synchronous per defintion, don't add another flag for it. Signed-off-by: Christoph Hellwig--- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index e3e878f..5e0078f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -196,7 +196,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define READ REQ_OP_READ #define WRITE REQ_OP_WRITE -#define READ_SYNC REQ_SYNC +#define READ_SYNC 0 #define WRITE_SYNC (REQ_SYNC | REQ_NOIDLE) #define WRITE_ODIRECT REQ_SYNC #define WRITE_FLUSH(REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/16] blk-cgroup: use op_is_sync to check for synchronous requests
Signed-off-by: Christoph Hellwig--- include/linux/blk-cgroup.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index ddaf28d..01b62e7 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -599,7 +599,7 @@ static inline void blkg_rwstat_add(struct blkg_rwstat *rwstat, __percpu_counter_add(cnt, val, BLKG_STAT_CPU_BATCH); - if (op & REQ_SYNC) + if (op_is_sync(op)) cnt = >cpu_cnt[BLKG_RWSTAT_SYNC]; else cnt = >cpu_cnt[BLKG_RWSTAT_ASYNC]; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/16] cfq-iosched: use op_is_sync instead of opencoding it
Signed-off-by: Christoph Hellwig--- block/cfq-iosched.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index c96186a..f28db97 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -912,15 +912,6 @@ static inline struct cfq_data *cic_to_cfqd(struct cfq_io_cq *cic) } /* - * We regard a request as SYNC, if it's either a read or has the SYNC bit - * set (in which case it could also be direct WRITE). - */ -static inline bool cfq_bio_sync(struct bio *bio) -{ - return bio_data_dir(bio) == READ || (bio->bi_opf & REQ_SYNC); -} - -/* * scheduler run of queue, if there are requests pending and no one in the * driver that will restart queueing */ @@ -2490,7 +2481,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio) if (!cic) return NULL; - cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio)); + cfqq = cic_to_cfqq(cic, op_is_sync(bio->bi_opf)); if (cfqq) return elv_rb_find(>sort_list, bio_end_sector(bio)); @@ -2604,13 +2595,14 @@ static int cfq_allow_bio_merge(struct request_queue *q, struct request *rq, struct bio *bio) { struct cfq_data *cfqd = q->elevator->elevator_data; + bool is_sync = op_is_sync(bio->bi_opf); struct cfq_io_cq *cic; struct cfq_queue *cfqq; /* * Disallow merge of a sync bio into an async request. */ - if (cfq_bio_sync(bio) && !rq_is_sync(rq)) + if (is_sync && !rq_is_sync(rq)) return false; /* @@ -2621,7 +2613,7 @@ static int cfq_allow_bio_merge(struct request_queue *q, struct request *rq, if (!cic) return false; - cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio)); + cfqq = cic_to_cfqq(cic, is_sync); return cfqq == RQ_CFQQ(rq); } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 5/6] nvme: Add unlock_from_suspend
+struct sed_cb_data { + sec_cb *cb; + void*cb_data; + struct nvme_command cmd; +}; + +static void sec_submit_endio(struct request *req, int error) +{ + struct sed_cb_data *sed_data = req->end_io_data; + + if (sed_data->cb) + sed_data->cb(error, sed_data->cb_data); + + kfree(sed_data); + blk_mq_free_request(req); +} + +static int nvme_insert_rq(struct request_queue *q, struct request *rq, + int at_head, rq_end_io_fn *done) +{ + WARN_ON(rq->cmd_type == REQ_TYPE_FS); + + rq->end_io = done; + + if (!q->mq_ops) + return -EINVAL; + + blk_mq_insert_request(rq, at_head, true, true); + + return 0; +} No need for this function... you control the call site... + +static int nvme_sec_submit(void *data, u8 opcode, u16 SPSP, + u8 SECP, void *buffer, size_t len, + sec_cb *cb, void *cb_data) +{ + struct request_queue *q; + struct request *req; + struct sed_cb_data *sed_data; + struct nvme_ns *ns; + struct nvme_command *cmd; + int ret; + + ns = data;//bdev->bd_disk->private_data; ?? you don't even have data anywhere in here... + + sed_data = kzalloc(sizeof(*sed_data), GFP_NOWAIT); + if (!sed_data) + return -ENOMEM; + sed_data->cb = cb; + sed_data->cb_data = cb_data; + cmd = _data->cmd; + + cmd->common.opcode = opcode; + cmd->common.nsid = ns->ns_id; + cmd->common.cdw10[0] = SECP << 24 | SPSP << 8; + cmd->common.cdw10[1] = len; + + q = ns->ctrl->admin_q; + + req = nvme_alloc_request(q, cmd, 0, NVME_QID_ANY); + if (IS_ERR(req)) { + ret = PTR_ERR(req); + goto err_free; + } + + req->timeout = ADMIN_TIMEOUT; + req->special = NULL; + + if (buffer && len) { + ret = blk_rq_map_kern(q, req, buffer, len, GFP_NOWAIT); + if (ret) { + blk_mq_free_request(req); + goto err_free; + } + } + + req->end_io_data = sed_data; + //req->rq_disk = bdev->bd_disk; ?? + + return nvme_insert_rq(q, req, 1, sec_submit_endio); No need to introduce nvme_insert_rq at all, just call blk_mq_insert_request (other examples call blk_execute_rq_nowait but its pretty much the same...) @@ -582,6 +583,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct nvme_command cmnd; unsigned map_len; int ret = BLK_MQ_RQ_QUEUE_OK; + unsigned long flags; /* * If formated with metadata, require the block layer provide a buffer @@ -614,18 +616,18 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, cmnd.common.command_id = req->tag; blk_mq_start_request(req); - spin_lock_irq(>q_lock); + spin_lock_irqsave(>q_lock, flags); if (unlikely(nvmeq->cq_vector < 0)) { if (ns && !test_bit(NVME_NS_DEAD, >flags)) ret = BLK_MQ_RQ_QUEUE_BUSY; else ret = BLK_MQ_RQ_QUEUE_ERROR; - spin_unlock_irq(>q_lock); + spin_unlock_irqrestore(>q_lock, flags); goto out; } __nvme_submit_cmd(nvmeq, ); nvme_process_cq(nvmeq); - spin_unlock_irq(>q_lock); + spin_unlock_irqrestore(>q_lock, flags); No documentation why this is needed... return BLK_MQ_RQ_QUEUE_OK; out: nvme_free_iod(dev, req); @@ -635,11 +637,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, static void nvme_complete_rq(struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - struct nvme_dev *dev = iod->nvmeq->dev; + struct nvme_queue *nvmeq = iod->nvmeq; + struct nvme_dev *dev = nvmeq->dev; This is a cleanup that should go in a different patch... int error = 0; nvme_unmap_data(dev, req); - Same here... if (unlikely(req->errors)) { if (nvme_req_needs_retry(req, req->errors)) { req->retries++; @@ -658,7 +660,6 @@ static void nvme_complete_rq(struct request *req) "completing aborted command with status: %04x\n", req->errors); } - Here... blk_mq_end_request(req, error); } @@ -1758,10 +1759,11 @@ static void nvme_reset_work(struct work_struct *work) { struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work); int result = -ENODEV; - + bool was_suspend = false; if (WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)) goto out; + was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL); /* * If we're called to reset a live controller first shut it down before * moving on. @@ -1789,6 +1791,9 @@ static void
Re: [PATCH] Unittest framework based on nvme-cli.
Hi, Introducing nvme-cli based unit test framework. All the test cases use commands implemented in nvme-cli. The goal is to have simple, lightweight, and easily expandable framework which we can used to develop various categories of unit tests based on nvme-cli and improve overall development. Over the period of time since release of the nvme-cli various test cases are developed which are now integral part of our device deriver testing. These test cases are evolved around nvme-cli and can be used for nvme-cli testing. I would like to take this opportunity and share first set of test cases which has most frequently used generic NVMe features from cli :- 1. nvme_attach_detach_ns_test.py 2. nvme_compare_test.py 3. nvme_create_max_ns_test.py 4. nvme_error_log_test.py 5. nvme_flush_test.py 6. nvme_format_test.py 7. nvme_get_features_test.py 8. nvme_read_write_test.py 9. nvme_smart_log_test.py 10. nvme_writeuncor_test.py 11. nvme_writezeros_test.py Please have a look at README for an overview and process of adding new test case. Framework also has a sample skeleton which can be used readily to write new testcard. Assumptions for current implementation :- 1. nvme-cli is already installed on the system. 2. Only one test case can be executed at any given time. 3. Each test case has logical PRE, RUN and POST sections. 4. It assumes that driver is loaded and default namespace "/dev/nvme0n1" is present. (It is easy to add driver load/unload in pre and post sections of test cases in as per requirement.) I’d like to know what features, test cases most people would want as a part of this framework. Any suggestions are welcome, I'd like to implement them. On approval I would like to submit more test cases to enhance the framework. Regards, Chaitanya On Mon, Oct 31, 2016 at 11:01 PM, Chaitanya Kulkarniwrote: > > From: Chaitanya Kulkarni > > Signed-off-by: Chaitanya Kulkarni > --- > Makefile| 5 +- > tests/Makefile | 48 + > tests/README| 84 > tests/TODO | 14 ++ > tests/config.json | 5 + > tests/nvme_attach_detach_ns_test.py | 90 > tests/nvme_compare_test.py | 79 > tests/nvme_create_max_ns_test.py| 97 + > tests/nvme_error_log_test.py| 86 > tests/nvme_flush_test.py| 61 ++ > tests/nvme_format_test.py | 145 + > tests/nvme_get_features_test.py | 103 ++ > tests/nvme_read_write_test.py | 72 +++ > tests/nvme_simple_template_test.py | 55 + > tests/nvme_smart_log_test.py| 86 > tests/nvme_test.py | 395 > > tests/nvme_test_io.py | 99 + > tests/nvme_test_logger.py | 52 + > tests/nvme_writeuncor_test.py | 76 +++ > tests/nvme_writezeros_test.py | 102 ++ > 20 files changed, 1753 insertions(+), 1 deletion(-) > create mode 100644 tests/Makefile > create mode 100644 tests/README > create mode 100644 tests/TODO > create mode 100644 tests/config.json > create mode 100644 tests/nvme_attach_detach_ns_test.py > create mode 100644 tests/nvme_compare_test.py > create mode 100644 tests/nvme_create_max_ns_test.py > create mode 100644 tests/nvme_error_log_test.py > create mode 100644 tests/nvme_flush_test.py > create mode 100644 tests/nvme_format_test.py > create mode 100644 tests/nvme_get_features_test.py > create mode 100644 tests/nvme_read_write_test.py > create mode 100644 tests/nvme_simple_template_test.py > create mode 100644 tests/nvme_smart_log_test.py > create mode 100644 tests/nvme_test.py > create mode 100644 tests/nvme_test_io.py > create mode 100644 tests/nvme_test_logger.py > create mode 100644 tests/nvme_writeuncor_test.py > create mode 100644 tests/nvme_writezeros_test.py > > diff --git a/Makefile b/Makefile > index 117cbbe..33c7190 100644 > --- a/Makefile > +++ b/Makefile > @@ -46,6 +46,9 @@ nvme.o: nvme.c nvme.h nvme-print.h nvme-ioctl.h argconfig.h > suffix.h nvme-lightn > doc: $(NVME) > $(MAKE) -C Documentation > > +test: > + $(MAKE) -C tests/ run > + > all: doc > > clean: > @@ -136,4 +139,4 @@ rpm: dist > $(RPMBUILD) -ta nvme-$(NVME_VERSION).tar.gz > > .PHONY: default doc all clean clobber install-man install-bin install > -.PHONY: dist pkg dist-orig deb deb-light rpm FORCE > +.PHONY: dist pkg dist-orig deb deb-light rpm FORCE test > diff --git a/tests/Makefile b/tests/Makefile > new file mode 100644 > index 000..c0f9f31 > --- /dev/null > +++ b/tests/Makefile > @@ -0,0 +1,48 @@ > +### > +# > +#Makefile : Allows user to run testcases, generate documentation, and > +# perform static
[PATCH] Unittest framework based on nvme-cli.
From: Chaitanya KulkarniSigned-off-by: Chaitanya Kulkarni --- Makefile| 5 +- tests/Makefile | 48 + tests/README| 84 tests/TODO | 14 ++ tests/config.json | 5 + tests/nvme_attach_detach_ns_test.py | 90 tests/nvme_compare_test.py | 79 tests/nvme_create_max_ns_test.py| 97 + tests/nvme_error_log_test.py| 86 tests/nvme_flush_test.py| 61 ++ tests/nvme_format_test.py | 145 + tests/nvme_get_features_test.py | 103 ++ tests/nvme_read_write_test.py | 72 +++ tests/nvme_simple_template_test.py | 55 + tests/nvme_smart_log_test.py| 86 tests/nvme_test.py | 395 tests/nvme_test_io.py | 99 + tests/nvme_test_logger.py | 52 + tests/nvme_writeuncor_test.py | 76 +++ tests/nvme_writezeros_test.py | 102 ++ 20 files changed, 1753 insertions(+), 1 deletion(-) create mode 100644 tests/Makefile create mode 100644 tests/README create mode 100644 tests/TODO create mode 100644 tests/config.json create mode 100644 tests/nvme_attach_detach_ns_test.py create mode 100644 tests/nvme_compare_test.py create mode 100644 tests/nvme_create_max_ns_test.py create mode 100644 tests/nvme_error_log_test.py create mode 100644 tests/nvme_flush_test.py create mode 100644 tests/nvme_format_test.py create mode 100644 tests/nvme_get_features_test.py create mode 100644 tests/nvme_read_write_test.py create mode 100644 tests/nvme_simple_template_test.py create mode 100644 tests/nvme_smart_log_test.py create mode 100644 tests/nvme_test.py create mode 100644 tests/nvme_test_io.py create mode 100644 tests/nvme_test_logger.py create mode 100644 tests/nvme_writeuncor_test.py create mode 100644 tests/nvme_writezeros_test.py diff --git a/Makefile b/Makefile index 117cbbe..33c7190 100644 --- a/Makefile +++ b/Makefile @@ -46,6 +46,9 @@ nvme.o: nvme.c nvme.h nvme-print.h nvme-ioctl.h argconfig.h suffix.h nvme-lightn doc: $(NVME) $(MAKE) -C Documentation +test: + $(MAKE) -C tests/ run + all: doc clean: @@ -136,4 +139,4 @@ rpm: dist $(RPMBUILD) -ta nvme-$(NVME_VERSION).tar.gz .PHONY: default doc all clean clobber install-man install-bin install -.PHONY: dist pkg dist-orig deb deb-light rpm FORCE +.PHONY: dist pkg dist-orig deb deb-light rpm FORCE test diff --git a/tests/Makefile b/tests/Makefile new file mode 100644 index 000..c0f9f31 --- /dev/null +++ b/tests/Makefile @@ -0,0 +1,48 @@ +### +# +#Makefile : Allows user to run testcases, generate documentation, and +# perform static code analysis. +# +### + +NOSE2_OPTIONS="--verbose" + +help: all + +all: + @echo "Usage:" + @echo + @echo " make run - Run all testcases." + @echo " make doc - Generate Documentation." + @echo " make cleanall- removes *pyc, documentation." + @echo " make static_check- runs pep8, flake8, pylint on code." + +doc: + @epydoc -v --output=Documentation *.py + +run: + @for i in `ls *.py`; \ + do \ + echo "Running $${i}"; \ + TESTCASE_NAME=`echo $${i} | cut -f 1 -d '.'`; \ + nose2 ${NOSE2_OPTIONS} $${TESTCASE_NAME}; \ + done + +static_check: + @for i in `ls *.py`; \ + do \ + echo "Pylint :- " ; \ + printf "%10s" $${i}; \ + pylint $${i} 2>&1 | grep "^Your code" | awk '{print $$7}';\ + echo "";\ + pep8 $${i}; \ + echo "pep8 :- "; \ + echo "flake8 :- "; \ + flake8 $${i}; \ + done + +cleanall: clean + @rm -fr *.pyc Documentation + +clean: + @rm -fr *.pyc diff --git a/tests/README b/tests/README new file mode 100644 index 000..70686d8 --- /dev/null +++ b/tests/README @@ -0,0 +1,84 @@ +nvmetests += + +This contains NVMe unit tests framework. The purpose of this framework +to use nvme cli and test various supported commands and scenarios for +NVMe device. + +In current implementation this framework uses nvme cli to +interact with underlying controller/namespace. + +1. Common Package Dependencies +-- + +1. Python(>= 2.7.5 or >= 3.3) +2. nose2(Installation guide http://nose2.readthedocs.io/) +3. nvme-cli(https://github.com/linux-nvme/nvme-cli.git) + +2. Overview +--- + +This framework follows simple class hierarchy. Each test