Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-01 Thread Jens Axboe
On Tue, Nov 01 2016, Johannes Thumshirn wrote:
> On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> > For legacy block, we simply track them in the request queue. For
> > blk-mq, we track them on a per-sw queue basis, which we can then
> > sum up through the hardware queues and finally to a per device
> > state.
> > 
> > The stats are tracked in, roughly, 0.1s interval windows.
> > 
> > Add sysfs files to display the stats.
> > 
> > Signed-off-by: Jens Axboe 
> > ---
> 
> [...]
> 
> >  
> > /* incremented at completion time */
> > unsigned long   cacheline_aligned_in_smp rq_completed[2];
> > +   struct blk_rq_stat  stat[2];
> 
> Can you add an enum or define for the directions? Just 0 and 1 aren't very
> intuitive.

Good point, I added that and updated both the stats patch and the
subsequent blk-mq poll code.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments

2016-11-01 Thread Kent Overstreet
On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote:
> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote:
> > Avoid to access .bi_vcnt directly, because it may be not what
> > the driver expected any more after supporting multipage bvec.
> > 
> > Signed-off-by: Ming Lei 
> 
> It would be really nice to have a comment in the code why it's
> even checking for multiple segments.

Or ideally refactor the code to not care about multiple segments at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 28/60] block: introduce QUEUE_FLAG_SPLIT_MP

2016-11-01 Thread Kent Overstreet
On Mon, Oct 31, 2016 at 08:39:15AM -0700, Christoph Hellwig wrote:
> On Sat, Oct 29, 2016 at 04:08:27PM +0800, Ming Lei wrote:
> > Some drivers(such as dm) should be capable of dealing with multipage
> > bvec, but the incoming bio may be too big, such as, a new singlepage bvec
> > bio can't be cloned from the bio, or can't be allocated to singlepage
> > bvec with same size.
> > 
> > At least crypt dm, log writes and bcache have this kind of issue.
> 
> We already have the segment_size limitation for request based drivers.
> I'd rather extent it to bio drivers if really needed.
> 
> But then again we should look into not having this limitation.  E.g.
> for bcache I'd be really surprised if it's that limited, given that
> Kent came up with this whole multipage bvec scheme.

AFAIK the only issue is with drivers that may have to bounce bios - pages that
were contiguous in the original bio won't necessarily be contiguous in the
bounced bio, thus bouncing might require more than BIO_MAX_SEGMENTS bvecs.

I don't know what Ming's referring to by "singlepage bvec bios".

Anyways, bouncing comes up in multiple places so we probably need to come up
with a generic solution for that. Other than that, there shouldn't be any issues
or limitations - if you're not bouncing, there's no need to clone the bvecs.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 45/60] block: bio: introduce bio_for_each_segment_all_rd() and its write pair

2016-11-01 Thread Kent Overstreet
On Mon, Oct 31, 2016 at 08:11:23AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 31, 2016 at 09:59:43AM -0400, Theodore Ts'o wrote:
> > What is _rd and _wt supposed to stand for?
> 
> I think it's read and write, but I think the naming is highly
> unfortunate.  I started dabbling around with the patches a bit,
> and to keep my sanity a started reaming it to _pages and _bvec
> which is the real semantics - the _rd or _pages gives you a synthetic
> bvec for each page, and the other one gives you the full bvec.

My original naming was bio_for_each_segment() and bio_for_each_page().
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 07/14] blk-mq: Introduce blk_mq_quiesce_queue()

2016-11-01 Thread Ming Lei
On Wed, Nov 2, 2016 at 12:02 AM, Sagi Grimberg  wrote:
> Reviewed-by: Sagi Grimberg 

Reviewed-by: Ming Lei 


--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 45/60] block: bio: introduce bio_for_each_segment_all_rd() and its write pair

2016-11-01 Thread Ming Lei
On Tue, Nov 1, 2016 at 10:17 PM, Theodore Ts'o  wrote:
> On Tue, Nov 01, 2016 at 07:51:27AM +0800, Ming Lei wrote:
>> Sorry for forgetting to mention one important point:
>>
>> - after multipage bvec is introduced, the iterated bvec pointer
>> still points to singlge page bvec, which is generated in-flight
>> and is readonly actually. That is the motivation about the introduction
>> of bio_for_each_segment_all_rd().
>>
>> So maybe bio_for_each_page_all_ro() is better?
>>
>> For _wt(), we still can keep it as bio_for_each_segment(), which also
>> reflects that now the iterated bvec points to one whole segment if
>> we name _rd as bio_for_each_page_all_ro().
>
> I'm agnostic as to what the right names are --- my big concern is
> there is an explosion of bio_for_each_page_* functions, and that there

There isn't big users of bio_for_each_segment_all(), see:

[ming@linux-2.6]$git grep -n bio_for_each_segment_all ./fs/ | wc -l
23

I guess there isn't execuses to switch to that after this patchset.

>From view of API, bio_for_each_segment_all() is ugly and
exposes the bvec table to users, and the main reason we keep it
is that it can avoid one bvec copy in one loop. And it can be replaced
easily by bio_for_each_segment().

> isn't good documentation about (a) when to use each of these
> functions, and (b) why.  I was goinig through the patch series, and it
> was hard for me to figure out why, and I was looking through all of
> the patches.  Once all of the patches are merged in, I am concerned
> this is going to be massive trapdoor that will snare a large number of
> unwitting developers.

I understand your concern, and let me explain the whole story a bit:

1) in current linus tree, we have the following two bio iterator helpers,
for which we still don't provide any document:

   bio_for_each_segment(bvl, bio, iter)
   bio_for_each_segment_all(bvl, bio, i)

- the former is used to traverse each 'segment' in the bio range
descibed by the 'iter'(just like [start, size]); the latter is used
to traverse each 'segment' in the whole bio, so there isn't 'iter'
passed in.

- in the former helper, typeof('bvl') is 'struct bvec', and the 'segment'
is copied to 'bvl'; in the latter helper, typeof('bvl') is 'struct bvec *', and
it just points to one bvec directly in the table(bio->bi_io_vec) one by one.

- we can use the former helper to implement the latter easily and provide
a more friendly interface, and the main reason we keep it is that _all can
avoid bvec copy in each loop, so it might be a bit efficient.

- even segment is used in the helper's name, but each 'bvl' in the
helper just describes one single page, so actually they should have been
named as the following:

 bio_for_each_page(bvl, bio, iter)
 bio_for_each_page(bvl, bio, iter)

2) this patchset introduces multipage bvec, which will store one
real segment in each 'bvec' of the table(bio->bi_io_vec), and one
segment may include more than one page

- bio_for_each_segment() is kept as current interface to retrieve
one page in each 'bvl', that is just for making current users happy,
and it will be replaced with bio_for_each_page() finally, which
should be a follow-up work of this patchset

- the story of introduction of bio_for_each_segment_all_rd(bvl, bio, i):
we can't simply make 'bvl' point to each bvec in the table direclty
any more, because now each bvec in the table store one real segment
instead of one page. So in this patchst the _rd() is implemented by
bio_for_each_segment(), and we can't change/write to the bvec in the
table any more using the pointer of 'bvl' via this helper.

>
> As far as my preference, from an abstract perspective, if one version
> (the read-write variant, I presume) is always safe, while one (the
> read-only variant) is faster, if you can work under restricted
> circumstances, naming the safe version so it is the "default", and
> more dangerous one with the name that makes it a bit more obvious what
> you have to do in order to use it safely, and then very clearly
> document both in sources, and in the Documentation directory, what the
> issues are and what you have to do in order to use the faster version.

I will add detailed documents about these helpers in next version:

- bio_for_each_segment()
- bio_for_each_segment_all()
- bio_for_each_page_all_ro()(renamed from bio_for_each_segment_all_rd())

Thanks,
Ming

>
> Cheers,
>
> - Ted
>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-01 Thread Johannes Thumshirn
On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.
> 
> The stats are tracked in, roughly, 0.1s interval windows.
> 
> Add sysfs files to display the stats.
> 
> Signed-off-by: Jens Axboe 
> ---

[...]

>  
>   /* incremented at completion time */
>   unsigned long   cacheline_aligned_in_smp rq_completed[2];
> + struct blk_rq_stat  stat[2];

Can you add an enum or define for the directions? Just 0 and 1 aren't very
intuitive.

Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] block: add code to track actual device queue depth

2016-11-01 Thread Jens Axboe
For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.

Signed-off-by: Jens Axboe 
---
 block/blk-settings.c   | 12 
 drivers/scsi/scsi.c|  3 +++
 include/linux/blkdev.h | 11 +++
 3 files changed, 26 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 55369a65dea2..9cf053759363 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -837,6 +837,18 @@ void blk_queue_flush_queueable(struct request_queue *q, 
bool queueable)
 EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 
 /**
+ * blk_set_queue_depth - tell the block layer about the device queue depth
+ * @q: the request queue for the device
+ * @depth: queue depth
+ *
+ */
+void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
+{
+   q->queue_depth = depth;
+}
+EXPORT_SYMBOL(blk_set_queue_depth);
+
+/**
  * blk_queue_write_cache - configure queue's write cache
  * @q: the request queue for the device
  * @wc:write back cache on or off
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 1deb6adc411f..75455d4dab68 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int 
depth)
wmb();
}
 
+   if (sdev->request_queue)
+   blk_set_queue_depth(sdev->request_queue, depth);
+
return sdev->queue_depth;
 }
 EXPORT_SYMBOL(scsi_change_queue_depth);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8396da2bb698..0c677fb35ce4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -405,6 +405,8 @@ struct request_queue {
struct blk_mq_ctx __percpu  *queue_ctx;
unsigned intnr_queues;
 
+   unsigned intqueue_depth;
+
/* hw dispatch queues */
struct blk_mq_hw_ctx**queue_hw_ctx;
unsigned intnr_hw_queues;
@@ -777,6 +779,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, 
struct bio *b)
return false;
 }
 
+static inline unsigned int blk_queue_depth(struct request_queue *q)
+{
+   if (q->queue_depth)
+   return q->queue_depth;
+
+   return q->nr_requests;
+}
+
 /*
  * q->prep_rq_fn return values
  */
@@ -1093,6 +1103,7 @@ extern void blk_limits_io_min(struct queue_limits 
*limits, unsigned int min);
 extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
 extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
+extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
 extern void blk_set_default_limits(struct queue_limits *lim);
 extern void blk_set_stacking_limits(struct queue_limits *lim);
 extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] writeback: mark background writeback as such

2016-11-01 Thread Jens Axboe
If we're doing background type writes, then use the appropriate
background write flags for that.

Signed-off-by: Jens Axboe 
---
 include/linux/writeback.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 50c96ee8108f..c78f9f0920b5 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -107,6 +107,8 @@ static inline int wbc_to_write_flags(struct 
writeback_control *wbc)
 {
if (wbc->sync_mode == WB_SYNC_ALL)
return REQ_SYNC;
+   else if (wbc->for_kupdate || wbc->for_background)
+   return REQ_BACKGROUND;
 
return 0;
 }
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()

2016-11-01 Thread Jens Axboe
Note in the bdi_writeback structure whenever a task ends up sleeping
waiting for progress. We can use that information in the lower layers
to increase the priority of writes.

Signed-off-by: Jens Axboe 
---
 include/linux/backing-dev-defs.h | 2 ++
 mm/backing-dev.c | 1 +
 mm/page-writeback.c  | 1 +
 3 files changed, 4 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index c357f27d5483..dc5f76d7f648 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
struct list_head work_list;
struct delayed_work dwork;  /* work item used for writeback */
 
+   unsigned long dirty_sleep;  /* last wait */
+
struct list_head bdi_node;  /* anchored at bdi->wb_list */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8fde443f36d7..3bfed5ab2475 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -310,6 +310,7 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
spin_lock_init(>work_lock);
INIT_LIST_HEAD(>work_list);
INIT_DELAYED_WORK(>dwork, wb_workfn);
+   wb->dirty_sleep = jiffies;
 
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
if (!wb->congested)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 439cc63ad903..52e2f8e3b472 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1778,6 +1778,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
  pause,
  start_time);
__set_current_state(TASK_KILLABLE);
+   wb->dirty_sleep = now;
io_schedule_timeout(pause);
 
current->dirty_paused_when = now + pause;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] block: add WRITE_BACKGROUND

2016-11-01 Thread Jens Axboe
This adds a new request flag, REQ_BACKGROUND, that callers can use to
tell the block layer that this is background (non-urgent) IO.

Signed-off-by: Jens Axboe 
---
 include/linux/blk_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bb921028e7c5..562ac46cb790 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -177,6 +177,7 @@ enum req_flag_bits {
__REQ_FUA,  /* forced unit access */
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
+   __REQ_BACKGROUND,   /* background IO */
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -192,6 +193,7 @@ enum req_flag_bits {
 #define REQ_FUA(1ULL << __REQ_FUA)
 #define REQ_PREFLUSH   (1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
+#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
 
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHSET] Throttled buffered writeback

2016-11-01 Thread Jens Axboe
I have addressed the (small) review comments from Christoph, and
rebased it on top of for-4.10/block, since that now has the
flag unification and fs side cleanups as well. This impacted the
prep patches, and the wbt code.

I'd really like to get this merged for 4.10. It's block specific
at this point, and defaults to just being enabled for blk-mq
managed devices.

Let me know if there are any objections.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] block: add scalable completion tracking of requests

2016-11-01 Thread Jens Axboe
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe 
---
 block/Makefile|   2 +-
 block/blk-core.c  |   4 +
 block/blk-mq-sysfs.c  |  47 ++
 block/blk-mq.c|  14 +++
 block/blk-mq.h|   3 +
 block/blk-stat.c  | 226 ++
 block/blk-stat.h  |  37 
 block/blk-sysfs.c |  26 ++
 include/linux/blk_types.h |  16 
 include/linux/blkdev.h|   4 +
 10 files changed, 378 insertions(+), 1 deletion(-)
 create mode 100644 block/blk-stat.c
 create mode 100644 block/blk-stat.h

diff --git a/block/Makefile b/block/Makefile
index 934dac73fb37..2528c596f7ec 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-   blk-lib.o blk-mq.o blk-mq-tag.o \
+   blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 0bfaa54d3e9f..ca77c725b4e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
 {
blk_dequeue_request(req);
 
+   blk_stat_set_issue_time(>issue_stat);
+
/*
 * We are now handing the request to the hardware, initialize
 * resid_len to full count and add the timeout handler.
@@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
 
trace_block_rq_complete(req->q, req, nr_bytes);
 
+   blk_stat_add(>q->rq_stats[rq_data_dir(req)], req);
+
if (!req->bio)
return false;
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 01fb455d3377..633c79a538ea 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
blk_mq_hw_ctx *hctx, char *page)
return ret;
 }
 
+static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
+{
+   struct blk_mq_ctx *ctx;
+   unsigned int i;
+
+   hctx_for_each_ctx(hctx, ctx, i) {
+   blk_stat_init(>stat[0]);
+   blk_stat_init(>stat[1]);
+   }
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
+ const char *page, size_t count)
+{
+   blk_mq_stat_clear(hctx);
+   return count;
+}
+
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char 
*pre)
+{
+   return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+   pre, (long long) stat->nr_samples,
+   (long long) stat->mean, (long long) stat->min,
+   (long long) stat->max);
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char 
*page)
+{
+   struct blk_rq_stat stat[2];
+   ssize_t ret;
+
+   blk_stat_init([0]);
+   blk_stat_init([1]);
+
+   blk_hctx_stat_get(hctx, stat);
+
+   ret = print_stat(page, [0], "read :");
+   ret += print_stat(page + ret, [1], "write:");
+   return ret;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
.attr = {.name = "dispatched", .mode = S_IRUGO },
.show = blk_mq_sysfs_dispatched_show,
@@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry 
blk_mq_hw_sysfs_poll = {
.show = blk_mq_hw_sysfs_poll_show,
.store = blk_mq_hw_sysfs_poll_store,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
+   .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
+   .show = blk_mq_hw_sysfs_stat_show,
+   .store = blk_mq_hw_sysfs_stat_store,
+};
 
 static struct attribute *default_hw_ctx_attrs[] = {
_mq_hw_sysfs_queued.attr,
@@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
_mq_hw_sysfs_cpus.attr,
_mq_hw_sysfs_active.attr,
_mq_hw_sysfs_poll.attr,
+   _mq_hw_sysfs_stat.attr,
NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2da1a0ee3318..4555a76d22a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -30,6 +30,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-stat.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -376,10 +377,19 @@ static void 

[PATCH 2/8] writeback: add wbc_to_write_flags()

2016-11-01 Thread Jens Axboe
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe 
Reviewed-by: Jan Kara 
---
 fs/buffer.c   | 2 +-
 fs/f2fs/data.c| 2 +-
 fs/f2fs/node.c| 2 +-
 fs/gfs2/meta_io.c | 3 +--
 fs/mpage.c| 2 +-
 fs/xfs/xfs_aops.c | 8 ++--
 include/linux/writeback.h | 9 +
 mm/page_io.c  | 5 +
 8 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index bc7c2bb30a9b..af5776da814a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1697,7 +1697,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr_underway = 0;
-   int write_flags = (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0);
+   int write_flags = wbc_to_write_flags(wbc);
 
head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b80bf10603d7..9e5561fa4cb6 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1249,7 +1249,7 @@ static int f2fs_write_data_page(struct page *page,
.sbi = sbi,
.type = DATA,
.op = REQ_OP_WRITE,
-   .op_flags = (wbc->sync_mode == WB_SYNC_ALL) ? REQ_SYNC : 0,
+   .op_flags = wbc_to_write_flags(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 932f3f8bb57b..d1e29deb4598 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1570,7 +1570,7 @@ static int f2fs_write_node_page(struct page *page,
.sbi = sbi,
.type = NODE,
.op = REQ_OP_WRITE,
-   .op_flags = (wbc->sync_mode == WB_SYNC_ALL) ? REQ_SYNC : 0,
+   .op_flags = wbc_to_write_flags(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index e562b1191c9c..49db8ef13fdf 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct 
writeback_control *wb
 {
struct buffer_head *bh, *head;
int nr_underway = 0;
-   int write_flags = REQ_META | REQ_PRIO |
-   (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0);
+   int write_flags = REQ_META | REQ_PRIO | wbc_to_write_flags(wbc);
 
BUG_ON(!PageLocked(page));
BUG_ON(!page_has_buffers(page));
diff --git a/fs/mpage.c b/fs/mpage.c
index f35e2819d0c6..98fc11aa7e0b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -489,7 +489,7 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
struct buffer_head map_bh;
loff_t i_size = i_size_read(inode);
int ret = 0;
-   int op_flags = (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0);
+   int op_flags = wbc_to_write_flags(wbc);
 
if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 594e02c485b2..6be5204a06d3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -495,9 +495,7 @@ xfs_submit_ioend(
 
ioend->io_bio->bi_private = ioend;
ioend->io_bio->bi_end_io = xfs_end_bio;
-   ioend->io_bio->bi_opf = REQ_OP_WRITE;
-   if (wbc->sync_mode == WB_SYNC_ALL)
-   ioend->io_bio->bi_opf |= REQ_SYNC;
+   ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
 
/*
 * If we are failing the IO now, just mark the ioend with an
@@ -569,9 +567,7 @@ xfs_chain_bio(
 
bio_chain(ioend->io_bio, new);
bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
-   ioend->io_bio->bi_opf = REQ_OP_WRITE;
-   if (wbc->sync_mode == WB_SYNC_ALL)
-   ioend->io_bio->bi_opf |= REQ_SYNC;
+   ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
submit_bio(ioend->io_bio);
ioend->io_bio = new;
 }
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e4c38703bf4e..50c96ee8108f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct bio;
 
@@ -102,6 +103,14 @@ struct writeback_control {
 #endif
 };
 
+static inline int wbc_to_write_flags(struct writeback_control *wbc)
+{
+   if (wbc->sync_mode == WB_SYNC_ALL)
+   return REQ_SYNC;
+
+   return 0;
+}
+
 /*
  * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
  * and are measured against each other in.  There always is one global
diff --git a/mm/page_io.c b/mm/page_io.c
index 

[PATCH 8/8] block: hook up writeback throttling

2016-11-01 Thread Jens Axboe
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers two sysfs entries. The first one, 'wb_window_usec',
defines the window of monitoring. The second one, 'wb_lat_usec',
sets the latency target for the window. It defaults to 2 msec for
non-rotational storage, and 75 msec for rotational storage. Setting
this value to '0' disables blk-wb. Generally, a user would not have
to touch these settings.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe 
---
 Documentation/block/queue-sysfs.txt |  13 
 block/Kconfig   |  24 +++
 block/blk-core.c|  18 -
 block/blk-mq.c  |  27 +++-
 block/blk-settings.c|   4 ++
 block/blk-sysfs.c   | 134 
 block/cfq-iosched.c |  14 
 include/linux/blkdev.h  |   3 +
 8 files changed, 233 insertions(+), 4 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt 
b/Documentation/block/queue-sysfs.txt
index 2a3904030dea..2847219ebd8c 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -169,5 +169,18 @@ This is the number of bytes the device can write in a 
single write-same
 command.  A value of '0' means write-same is not supported by this
 device.
 
+wb_lat_usec (RW)
+
+If the device is registered for writeback throttling, then this file shows
+the target minimum read latency. If this latency is exceeded in a given
+window of time (see wb_window_usec), then the writeback throttling will start
+scaling back writes.
+
+wb_window_usec (RW)
+---
+If the device is registered for writeback throttling, then this file shows
+the value of the monitoring window in which we'll look at the target
+latency. See wb_lat_usec.
+
 
 Jens Axboe , February 2009
diff --git a/block/Kconfig b/block/Kconfig
index 6b0ad08f0677..9f5d4dd7d751 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -120,6 +120,30 @@ config BLK_CMDLINE_PARSER
 
See Documentation/block/cmdline-partition.txt for more information.
 
+config BLK_WBT
+   bool "Enable support for block device writeback throttling"
+   default n
+   ---help---
+   Enabling this option enables the block layer to throttle buffered
+   writeback from the VM, making it more smooth and having less
+   impact on foreground operations.
+
+config BLK_WBT_SQ
+   bool "Single queue writeback throttling"
+   default n
+   depends on BLK_WBT
+   ---help---
+   Enable writeback throttling by default on legacy single queue devices
+
+config BLK_WBT_MQ
+   bool "Multiqueue writeback throttling"
+   default y
+   depends on BLK_WBT
+   ---help---
+   Enable writeback throttling by default on multiqueue devices.
+   Multiqueue currently doesn't have support for IO scheduling,
+   enabling this option is recommended.
+
 menu "Partition Types"
 
 source "block/partitions/Kconfig"
diff --git a/block/blk-core.c b/block/blk-core.c
index ca77c725b4e5..c68e92acf21a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include 

[PATCH 7/8] blk-wbt: add general throttling mechanism

2016-11-01 Thread Jens Axboe
We can hook this up to the block layer, to help throttle buffered
writes.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
   wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe 
---
 block/Makefile |   1 +
 block/blk-wbt.c| 704 +
 block/blk-wbt.h| 166 +++
 include/trace/events/wbt.h | 153 ++
 4 files changed, 1024 insertions(+)
 create mode 100644 block/blk-wbt.c
 create mode 100644 block/blk-wbt.h
 create mode 100644 include/trace/events/wbt.h

diff --git a/block/Makefile b/block/Makefile
index 2528c596f7ec..a827f988c4e6 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -24,3 +24,4 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)  += cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)   += blk-mq-pci.o
 obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o
+obj-$(CONFIG_BLK_WBT)  += blk-wbt.o
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
new file mode 100644
index ..1b1d67aae1d3
--- /dev/null
+++ b/block/blk-wbt.c
@@ -0,0 +1,704 @@
+/*
+ * buffered writeback throttling. loosely based on CoDel. We can't drop
+ * packets for IO scheduling, so the logic is something like this:
+ *
+ * - Monitor latencies in a defined window of time.
+ * - If the minimum latency in the above window exceeds some target, increment
+ *   scaling step and scale down queue depth by a factor of 2x. The monitoring
+ *   window is then shrunk to 100 / sqrt(scaling step + 1).
+ * - For any window where we don't have solid data on what the latencies
+ *   look like, retain status quo.
+ * - If latencies look good, decrement scaling step.
+ * - If we're only doing writes, allow the scaling step to go negative. This
+ *   will temporarily boost write performance, snapping back to a stable
+ *   scaling step of 0 if reads show up or the heavy writers finish. Unlike
+ *   positive scaling steps where we shrink the monitoring window, a negative
+ *   scaling step retains the default step==0 window size.
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "blk-wbt.h"
+
+#define CREATE_TRACE_POINTS
+#include 
+
+enum {
+   /*
+* Default setting, we'll scale up (to 75% of QD max) or down (min 1)
+* from here depending on device stats
+*/
+   RWB_DEF_DEPTH   = 16,
+
+   /*
+* 100msec window
+*/
+   RWB_WINDOW_NSEC = 100 * 1000 * 1000ULL,
+
+   /*
+* Disregard stats, if we don't meet this minimum
+*/
+   RWB_MIN_WRITE_SAMPLES   = 3,
+
+   /*
+* If we have this number of consecutive windows with not enough
+* information to scale up or down, scale up.
+*/
+   RWB_UNKNOWN_BUMP= 5,
+};
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+   return rwb && rwb->wb_normal != 0;
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+   int cur = atomic_read(v);
+
+   for (;;) {
+   int old;
+
+   if (cur >= below)
+   return false;
+   old = atomic_cmpxchg(v, cur, cur + 1);
+   if (old == cur)
+   break;
+   cur = old;
+   }
+
+   return true;
+}
+
+static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
+{
+   if (rwb_enabled(rwb)) {
+   const unsigned long cur = jiffies;
+
+   if (cur != *var)
+   *var = cur;
+   }
+}
+
+/*
+ * If a task was rate throttled in balance_dirty_pages() within the last
+ * second or so, use that to indicate a higher cleaning rate.
+ */
+static bool wb_recent_wait(struct rq_wb *rwb)
+{
+   struct bdi_writeback *wb = >bdi->wb;
+
+   return time_before(jiffies, wb->dirty_sleep + HZ);
+}
+
+static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
+{
+   return >rq_wait[is_kswapd];
+}
+
+static void rwb_wake_all(struct rq_wb *rwb)
+{
+   int i;
+
+   for (i = 0; i < WBT_NUM_RWQ; i++) {
+   struct rq_wait *rqw = >rq_wait[i];
+
+   if (waitqueue_active(>wait))
+   wake_up_all(>wait);
+   }
+}
+
+void __wbt_done(struct rq_wb *rwb, enum 

[PATCH 1/4] block: add scalable completion tracking of requests

2016-11-01 Thread Jens Axboe
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe 
---
 block/Makefile|   2 +-
 block/blk-core.c  |   4 +
 block/blk-mq-sysfs.c  |  47 ++
 block/blk-mq.c|  14 +++
 block/blk-mq.h|   3 +
 block/blk-stat.c  | 226 ++
 block/blk-stat.h  |  37 
 block/blk-sysfs.c |  26 ++
 include/linux/blk_types.h |  16 
 include/linux/blkdev.h|   4 +
 10 files changed, 378 insertions(+), 1 deletion(-)
 create mode 100644 block/blk-stat.c
 create mode 100644 block/blk-stat.h

diff --git a/block/Makefile b/block/Makefile
index 934dac73fb37..2528c596f7ec 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-   blk-lib.o blk-mq.o blk-mq-tag.o \
+   blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 0bfaa54d3e9f..ca77c725b4e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
 {
blk_dequeue_request(req);
 
+   blk_stat_set_issue_time(>issue_stat);
+
/*
 * We are now handing the request to the hardware, initialize
 * resid_len to full count and add the timeout handler.
@@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
 
trace_block_rq_complete(req->q, req, nr_bytes);
 
+   blk_stat_add(>q->rq_stats[rq_data_dir(req)], req);
+
if (!req->bio)
return false;
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 01fb455d3377..633c79a538ea 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
blk_mq_hw_ctx *hctx, char *page)
return ret;
 }
 
+static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
+{
+   struct blk_mq_ctx *ctx;
+   unsigned int i;
+
+   hctx_for_each_ctx(hctx, ctx, i) {
+   blk_stat_init(>stat[0]);
+   blk_stat_init(>stat[1]);
+   }
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
+ const char *page, size_t count)
+{
+   blk_mq_stat_clear(hctx);
+   return count;
+}
+
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char 
*pre)
+{
+   return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+   pre, (long long) stat->nr_samples,
+   (long long) stat->mean, (long long) stat->min,
+   (long long) stat->max);
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char 
*page)
+{
+   struct blk_rq_stat stat[2];
+   ssize_t ret;
+
+   blk_stat_init([0]);
+   blk_stat_init([1]);
+
+   blk_hctx_stat_get(hctx, stat);
+
+   ret = print_stat(page, [0], "read :");
+   ret += print_stat(page + ret, [1], "write:");
+   return ret;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
.attr = {.name = "dispatched", .mode = S_IRUGO },
.show = blk_mq_sysfs_dispatched_show,
@@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry 
blk_mq_hw_sysfs_poll = {
.show = blk_mq_hw_sysfs_poll_show,
.store = blk_mq_hw_sysfs_poll_store,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
+   .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
+   .show = blk_mq_hw_sysfs_stat_show,
+   .store = blk_mq_hw_sysfs_stat_store,
+};
 
 static struct attribute *default_hw_ctx_attrs[] = {
_mq_hw_sysfs_queued.attr,
@@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
_mq_hw_sysfs_cpus.attr,
_mq_hw_sysfs_active.attr,
_mq_hw_sysfs_poll.attr,
+   _mq_hw_sysfs_stat.attr,
NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2da1a0ee3318..4555a76d22a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -30,6 +30,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-stat.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -376,10 +377,19 @@ static void 

[PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT

2016-11-01 Thread Jens Axboe
This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.

Signed-off-by: Jens Axboe 
---
 block/blk-mq.c | 38 ++
 block/blk-sysfs.c  | 29 +
 block/blk.h|  1 +
 include/linux/blkdev.h |  1 +
 4 files changed, 69 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4ef35588c299..caa55bec9411 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -302,6 +302,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx 
*hctx,
rq->rq_flags = 0;
 
clear_bit(REQ_ATOM_STARTED, >atomic_flags);
+   clear_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags);
blk_mq_put_tag(hctx, ctx, tag);
blk_queue_exit(q);
 }
@@ -2352,11 +2353,48 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
+static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
+struct request *rq)
+{
+   struct hrtimer_sleeper hs;
+   ktime_t kt;
+
+   if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags))
+   return;
+
+   set_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags);
+
+   /*
+* This will be replaced with the stats tracking code, using
+* 'avg_completion_time / 2' as the pre-sleep target.
+*/
+   kt = ktime_set(0, q->poll_nsec);
+
+   hrtimer_init_on_stack(, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+   hrtimer_set_expires(, kt);
+
+   hrtimer_init_sleeper(, current);
+   do {
+   if (test_bit(REQ_ATOM_COMPLETE, >atomic_flags))
+   break;
+   set_current_state(TASK_INTERRUPTIBLE);
+   hrtimer_start_expires(, HRTIMER_MODE_REL);
+   if (hs.task)
+   io_schedule();
+   hrtimer_cancel();
+   } while (hs.task && !signal_pending(current));
+
+   __set_current_state(TASK_RUNNING);
+   destroy_hrtimer_on_stack();
+}
+
 bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
 {
struct request_queue *q = hctx->queue;
long state;
 
+   blk_mq_poll_hybrid_sleep(q, rq);
+
hctx->poll_considered++;
 
state = current->state;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5bb4648f434a..467b81c6713c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -336,6 +336,28 @@ queue_rq_affinity_store(struct request_queue *q, const 
char *page, size_t count)
return ret;
 }
 
+static ssize_t queue_poll_delay_show(struct request_queue *q, char *page)
+{
+   return queue_var_show(q->poll_nsec / 1000, page);
+}
+
+static ssize_t queue_poll_delay_store(struct request_queue *q, const char 
*page,
+   size_t count)
+{
+   unsigned long poll_usec;
+   ssize_t ret;
+
+   if (!q->mq_ops || !q->mq_ops->poll)
+   return -EINVAL;
+
+   ret = queue_var_store(_usec, page, count);
+   if (ret < 0)
+   return ret;
+
+   q->poll_nsec = poll_usec * 1000;
+   return ret;
+}
+
 static ssize_t queue_poll_show(struct request_queue *q, char *page)
 {
return queue_var_show(test_bit(QUEUE_FLAG_POLL, >queue_flags), page);
@@ -562,6 +584,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
.store = queue_poll_store,
 };
 
+static struct queue_sysfs_entry queue_poll_delay_entry = {
+   .attr = {.name = "io_poll_delay", .mode = S_IRUGO | S_IWUSR },
+   .show = queue_poll_delay_show,
+   .store = queue_poll_delay_store,
+};
+
 static struct queue_sysfs_entry queue_wc_entry = {
.attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
.show = queue_wc_show,
@@ -608,6 +636,7 @@ static struct attribute *default_attrs[] = {
_wc_entry.attr,
_dax_entry.attr,
_stats_entry.attr,
+   _poll_delay_entry.attr,
NULL,
 };
 
diff --git a/block/blk.h b/block/blk.h
index aa132dea598c..041185e5f129 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -111,6 +111,7 @@ void blk_account_io_done(struct request *req);
 enum rq_atomic_flags {
REQ_ATOM_COMPLETE = 0,
REQ_ATOM_STARTED,
+   REQ_ATOM_POLL_SLEPT,
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dcd8d6e8801f..6acd220dc3f3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -502,6 +502,7 @@ struct request_queue {
unsigned intrequest_fn_active;
 

[PATCH 2/4] block: move poll code to blk-mq

2016-11-01 Thread Jens Axboe
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe 
---
 block/blk-core.c | 36 +---
 block/blk-mq.c   | 33 +
 block/blk-mq.h   |  2 ++
 3 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ca77c725b4e5..7728562d77d9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3293,47 +3293,21 @@ EXPORT_SYMBOL(blk_finish_plug);
 
 bool blk_poll(struct request_queue *q, blk_qc_t cookie)
 {
-   struct blk_plug *plug;
-   long state;
-   unsigned int queue_num;
struct blk_mq_hw_ctx *hctx;
+   struct blk_plug *plug;
+   struct request *rq;
 
if (!q->mq_ops || !q->mq_ops->poll || !blk_qc_t_valid(cookie) ||
!test_bit(QUEUE_FLAG_POLL, >queue_flags))
return false;
 
-   queue_num = blk_qc_t_to_queue_num(cookie);
-   hctx = q->queue_hw_ctx[queue_num];
-   hctx->poll_considered++;
-
plug = current->plug;
if (plug)
blk_flush_plug_list(plug, false);
 
-   state = current->state;
-   while (!need_resched()) {
-   int ret;
-
-   hctx->poll_invoked++;
-
-   ret = q->mq_ops->poll(hctx, blk_qc_t_to_tag(cookie));
-   if (ret > 0) {
-   hctx->poll_success++;
-   set_current_state(TASK_RUNNING);
-   return true;
-   }
-
-   if (signal_pending_state(state, current))
-   set_current_state(TASK_RUNNING);
-
-   if (current->state == TASK_RUNNING)
-   return true;
-   if (ret < 0)
-   break;
-   cpu_relax();
-   }
-
-   return false;
+   hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+   rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
+   return blk_mq_poll(hctx, rq);
 }
 EXPORT_SYMBOL_GPL(blk_poll);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4555a76d22a7..4ef35588c299 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2352,6 +2352,39 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
+bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+   struct request_queue *q = hctx->queue;
+   long state;
+
+   hctx->poll_considered++;
+
+   state = current->state;
+   while (!need_resched()) {
+   int ret;
+
+   hctx->poll_invoked++;
+
+   ret = q->mq_ops->poll(hctx, rq->tag);
+   if (ret > 0) {
+   hctx->poll_success++;
+   set_current_state(TASK_RUNNING);
+   return true;
+   }
+
+   if (signal_pending_state(state, current))
+   set_current_state(TASK_RUNNING);
+
+   if (current->state == TASK_RUNNING)
+   return true;
+   if (ret < 0)
+   break;
+   cpu_relax();
+   }
+
+   return false;
+}
+
 void blk_mq_disable_hotplug(void)
 {
mutex_lock(_q_mutex);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8cf16cb69f64..79ea86e0ed49 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -61,6 +61,8 @@ extern void blk_mq_rq_timed_out(struct request *req, bool 
reserved);
 
 void blk_mq_release(struct request_queue *q);
 
+extern bool blk_mq_poll(struct blk_mq_hw_ctx *, struct request *);
+
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
   unsigned int cpu)
 {
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] blk-mq: make the polling code adaptive

2016-11-01 Thread Jens Axboe
The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1  Never enter hybrid sleep mode
 0  Use half of the completion mean for the sleep delay
>0  Use this specific value as the sleep delay

Signed-off-by: Jens Axboe 
---
 block/blk-mq.c | 50 +++---
 block/blk-sysfs.c  | 28 
 include/linux/blkdev.h |  2 +-
 3 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index caa55bec9411..2af75b087ebd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2353,13 +2353,57 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
+static unsigned long blk_mq_poll_nsecs(struct blk_mq_hw_ctx *hctx,
+  struct request *rq)
+{
+   struct blk_rq_stat stat[2];
+   unsigned long ret = 0;
+
+   /*
+* We don't have to do this once per IO, should optimize this
+* to just use the current window of stats until it changes
+*/
+   memset(, 0, sizeof(stat));
+   blk_hctx_stat_get(hctx, stat);
+
+   /*
+* As an optimistic guess, use half of the mean service time
+* for this type of request
+*/
+   if (req_op(rq) == REQ_OP_READ && stat[0].nr_samples)
+   ret = (stat[0].mean + 1) / 2;
+   else if (req_op(rq) == REQ_OP_WRITE && stat[1].nr_samples)
+   ret = (stat[1].mean + 1) / 2;
+
+   return ret;
+}
+
 static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
+struct blk_mq_hw_ctx *hctx,
 struct request *rq)
 {
struct hrtimer_sleeper hs;
+   unsigned int nsecs;
ktime_t kt;
 
-   if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags))
+   if (test_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags))
+   return;
+
+   /*
+* poll_nsec can be:
+*
+* -1:  don't ever hybrid sleep
+*  0:  use half of prev avg
+* >0:  use this specific value
+*/
+   if (q->poll_nsec == -1)
+   return;
+   else if (q->poll_nsec > 0)
+   nsecs = q->poll_nsec;
+   else
+   nsecs = blk_mq_poll_nsecs(hctx, rq);
+
+   if (!nsecs)
return;
 
set_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags);
@@ -2368,7 +2412,7 @@ static void blk_mq_poll_hybrid_sleep(struct request_queue 
*q,
 * This will be replaced with the stats tracking code, using
 * 'avg_completion_time / 2' as the pre-sleep target.
 */
-   kt = ktime_set(0, q->poll_nsec);
+   kt = ktime_set(0, nsecs);
 
hrtimer_init_on_stack(, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer_set_expires(, kt);
@@ -2393,7 +2437,7 @@ bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct 
request *rq)
struct request_queue *q = hctx->queue;
long state;
 
-   blk_mq_poll_hybrid_sleep(q, rq);
+   blk_mq_poll_hybrid_sleep(q, hctx, rq);
 
hctx->poll_considered++;
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 467b81c6713c..c668af57197b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -338,24 +338,36 @@ queue_rq_affinity_store(struct request_queue *q, const 
char *page, size_t count)
 
 static ssize_t queue_poll_delay_show(struct request_queue *q, char *page)
 {
-   return queue_var_show(q->poll_nsec / 1000, page);
+   int val;
+
+   if (q->poll_nsec == -1)
+   val = -1;
+   else
+   val = q->poll_nsec / 1000;
+
+   return sprintf(page, "%d\n", val);
 }
 
 static ssize_t queue_poll_delay_store(struct request_queue *q, const char 
*page,
size_t count)
 {
-   unsigned long poll_usec;
-   ssize_t ret;
+   int err, val;
 
if (!q->mq_ops || !q->mq_ops->poll)
return -EINVAL;
 
-   ret = queue_var_store(_usec, page, count);
-   if (ret < 0)
-   return ret;
+   err = kstrtoint(page, 10, );
+   if (err < 0)
+   return err;
 
-   q->poll_nsec = poll_usec * 1000;
-   return ret;
+   printk(KERN_ERR "val=%d\n", val);
+
+   if (val == -1)
+   q->poll_nsec = -1;
+   else
+   q->poll_nsec = val * 1000;
+
+   return count;
 }
 
 static ssize_t queue_poll_show(struct request_queue *q, char *page)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6acd220dc3f3..857f866d2751 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -502,7 

Re: untange block operations and fs READ/WRITE

2016-11-01 Thread Jens Axboe

On 11/01/2016 07:40 AM, Christoph Hellwig wrote:

Hi Jens,

this series removes the READ_* and WRITE_* defintions from fs.h and makes
all bio submitters use the REQ_* flags directly.  To make that easier
we also change the meaning of some of the flags slightly, so that the
callers don't have to set half a dozen flags for normal operations.

After that the READ and WRITE defintions are decouple from REQ_OP_READ
and REQ_OP_WRITE and moved to kernel.h, and last but not least we can
now stop including blk_types.h from fs.h and avoid the CONFIG_BLOCK
ifdefs in it.


Looks sane to me, and passes basic testing as well. Applied for 4.10.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: block device direct I/O fast path

2016-11-01 Thread Jens Axboe
On Tue, Nov 01 2016, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 11:00:19AM -0600, Jens Axboe wrote:
> > #2 is a bit more problematic, I'm pondering how we can implement that on
> > top of the bio approach. The nice thing about the request based approach
> > is that we have a 1:1 mapping with the unit on the driver side. And we
> > have a place to store the timer. I don't particularly love the embedded
> > timer, however, it'd be great to implement that differently. Trying to
> > think of options there, haven't found any yet.
> 
> I have a couple ideas for that.  Give me a few weeks and I'll send
> patches.

I'm not that patient :-)

For the SYNC part, it should be easy enough to do by just using an
on-stack hrtimer. I guess that will do for now, since we don't have
poll support for async O_DIRECT right now anyway.

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.10/dio=e96d9afd56791a61d463cb88f8f3b48393b71020

Untested, but should get the point across. I'll fire up a test box.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: block device direct I/O fast path

2016-11-01 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 11:00:19AM -0600, Jens Axboe wrote:
> #2 is a bit more problematic, I'm pondering how we can implement that on
> top of the bio approach. The nice thing about the request based approach
> is that we have a 1:1 mapping with the unit on the driver side. And we
> have a place to store the timer. I don't particularly love the embedded
> timer, however, it'd be great to implement that differently. Trying to
> think of options there, haven't found any yet.

I have a couple ideas for that.  Give me a few weeks and I'll send
patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: block device direct I/O fast path

2016-11-01 Thread Jens Axboe
On Mon, Oct 31 2016, Christoph Hellwig wrote:
> Hi Jens,
> 
> this small series adds a fasth path to the block device direct I/O
> code.  It uses new magic created by Kent to avoid allocating an array
> for the pages, and as part of that allows small, non-aio direct I/O
> requests to be done without memory allocations or atomic ops and with
> a minimal cache footprint.  It's basically a cut down version of the
> new iomap direct I/O code, and in the future it might also make sense
> to move the main direct I/O code to a similar model.  But indepedent
> of that it's always worth to optimize the case of small, non-I/O
> requests as allocating the bio and biovec on stack and a trivial
> completion handler will always win over a full blown implementation.

I'm not particularly tied to the request based implementation that I did
here:

http://git.kernel.dk/cgit/linux-block/log/?h=blk-dio

I basically wanted to solve two problems with that:

1) The slow old direct-io.c code
2) Implement the hybrid polling

#1 is accomplished with your code as well, though it does lack suppor
for async IO, but that would not be hard to add.

#2 is a bit more problematic, I'm pondering how we can implement that on
top of the bio approach. The nice thing about the request based approach
is that we have a 1:1 mapping with the unit on the driver side. And we
have a place to store the timer. I don't particularly love the embedded
timer, however, it'd be great to implement that differently. Trying to
think of options there, haven't found any yet.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-01 Thread Jan Kara
On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> [ My mail system got broken and original reply didn't get to through. Resent. 
> ]

OK, this answers some of my questions from previous email so disregard that
one.

> On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> > > Most of work happans on head page. Only when we need to do copy data to
> > > userspace we find relevant subpage.
> > > 
> > > We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> > > would require some more work.
> >
> > Hum, I'm kind of lost.
> 
> The limitation here comes from how copy_page_to_iter() and
> copy_page_from_iter() work wrt. highmem: it can only handle one small
> page a time.
> 
> On write side, we also have problem with assuming small page: write length
> and offset within page calculated before we know if small or huge page is
> allocated. It's not easy to fix. Looks like it would require change in
> ->write_begin() interface to accept len > PAGE_SIZE.
>
> > Can you point me to some design document / email that would explain some
> > high level ideas how are huge pages in page cache supposed to work?
> 
> I'll elaborate more in cover letter to next revision.
> 
> > When are we supposed to operate on the head page and when on subpage?
> 
> It's case-by-case. See above explanation why we're limited to PAGE_SIZE
> here.
> 
> > What is protected by the page lock of the head page?
> 
> Whole huge page. As with anon pages.
> 
> > Do page locks of subpages play any role?
> 
> lock_page() on any subpage would lock whole huge page.
> 
> > If understand right, e.g.  pagecache_get_page() will return subpages but
> > is it generally safe to operate on subpages individually or do we have
> > to be aware that they are part of a huge page?
> 
> I tried to make it as transparent as possible: page flag operations will
> be redirected to head page, if necessary. Things like page_mapping() and
> page_to_pgoff() know about huge pages.
> 
> Direct access to struct page fields must be avoided for tail pages as most
> of them doesn't have meaning you would expect for small pages.

OK, good to know.

> > If I understand the motivation right, it is mostly about being able to mmap
> > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > implement it by allocating PMD sized chunks of pages when adding pages to
> > page cache, we don't even have to read them all unless we come from PMD
> > fault path.
> 
> Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> per-hugepage, one common list of buffer heads...
> 
> PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> it otherwise doesn't make sense) and handling it differently for file-THP
> is nightmare from maintenance POV.

But the complexity of two different page sizes for page cache and *each*
filesystem that wants to support it does not make the maintenance easy
either. So I'm not convinced that using the same rules for anon-THP and
file-THP is a clear win. And if we have these two options neither of which
has negligible maintenance cost, I'd also like to see more justification
for why it is a good idea to have file-THP for normal filesystems. Do you
have any performance numbers that show it is a win under some realistic
workload?

I'd also note that having PMD-sized pages has some obvious disadvantages as
well:

1) I'm not sure buffer head handling code will quite scale to 512 or even
2048 buffer_heads on a linked list referenced from a page. It may work but
I suspect the performance will suck. 

2) PMD-sized pages result in increased space & memory usage.

3) In ext4 we have to estimate how much metadata we may need to modify when
allocating blocks underlying a page in the worst case (you don't seem to
update this estimate in your patch set). With 2048 blocks underlying a page,
each possibly in a different block group, it is a lot of metadata forcing
us to reserve a large transaction (not sure if you'll be able to even
reserve such large transaction with the default journal size), which again
makes things slower.

4) As you have noted some places like write_begin() still depend on 4k
pages which creates a strange mix of places that use subpages and that use
head pages.

All this would be a non-issue (well, except 2 I guess) if we just didn't
expose filesystems to the fact that something like file-THP exists.

> > Reclaim may need to be aware not to split pages unnecessarily
> > but that's about it. So I'd like to understand what's wrong with this
> > naive idea and why do filesystems need to be aware that someone wants to
> > map in PMD sized chunks...
> 
> In addition to flags, THP uses some space in struct page of tail pages to
> encode additional information. See compound_{mapcount,head,dtor,order},
> page_deferred_list().

Thanks, I'll check that.

Honza
-- 
Jan Kara 

Re: [PATCH v5 12/14] SRP transport, scsi-mq: Wait for .queue_rq() if necessary

2016-11-01 Thread Sagi Grimberg

and again,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 13/14] nvme: Fix a race condition related to stopping queues

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 11/14] SRP transport: Move queuecommand() wait code to SCSI core

2016-11-01 Thread Sagi Grimberg

Again,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] blk-mq: export blk_mq_map_queues

2016-11-01 Thread Jens Axboe
On Tue, Nov 01 2016, Martin K. Petersen wrote:
> > "Christoph" == Christoph Hellwig  writes:
> 
> Christoph> This will allow SCSI to have a single blk_mq_ops structure
> Christoph> that either lets the LLDD map the queues to PCIe MSIx vectors
> Christoph> or use the default.
> 
> Jens, any objection to me funneling this change through the SCSI tree?

No, that's fine, you can add my reviewed-by.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 08/14] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()

2016-11-01 Thread Sagi Grimberg

Looks useful,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 05/14] blk-mq: Avoid that requeueing starts stopped queues

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 06/14] blk-mq: Remove blk_mq_cancel_requeue_work()

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] scsi: allow LLDDs to expose the queue mapping to blk-mq

2016-11-01 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] blk-mq: export blk_mq_map_queues

2016-11-01 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> This will allow SCSI to have a single blk_mq_ops structure
Christoph> that either lets the LLDD map the queues to PCIe MSIx vectors
Christoph> or use the default.

Jens, any objection to me funneling this change through the SCSI tree?

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] nvme: Add unlock_from_suspend

2016-11-01 Thread Scott Bauer
On Tue, Nov 01, 2016 at 06:57:05AM -0700, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 10:18:13AM +0200, Sagi Grimberg wrote:
> > > +
> > > + return nvme_insert_rq(q, req, 1, sec_submit_endio);
> > 
> > No need to introduce nvme_insert_rq at all, just call
> > blk_mq_insert_request (other examples call blk_execute_rq_nowait
> > but its pretty much the same...)
> 
> blk_execute_rq_nowait is the API to use - blk_mq_insert_request isn't
> even exported.

Thanks for the reviews. This patch needs to be separated into two 
patches. There is the addition of the nvme-suspend stuff and the
addition of sec_ops. Most of the clutter and weird stuff is coming
from the latter. I'll separate the patches, use the correct api and 
clean the clutter.

Thanks




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/16] fs: decouple READ and WRITE from the block layer ops

2016-11-01 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 08:23:54AM -0600, Bart Van Assche wrote:
> On 11/01/2016 07:40 AM, Christoph Hellwig wrote:
>> +/* generic data direction defintions */
>> +#define READ0
>> +#define WRITE   1
>
> Hello Christoph,
>
> If you have to resend this patch series, please fix the spelling of 
> "definitions".

Thanks Bart, that's one of my favourite misspelling that I keep getting
wrong again and again..
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/16] fs: decouple READ and WRITE from the block layer ops

2016-11-01 Thread Bart Van Assche

On 11/01/2016 07:40 AM, Christoph Hellwig wrote:

+/* generic data direction defintions */
+#define READ   0
+#define WRITE  1


Hello Christoph,

If you have to resend this patch series, please fix the spelling of 
"definitions".


Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 45/60] block: bio: introduce bio_for_each_segment_all_rd() and its write pair

2016-11-01 Thread Theodore Ts'o
On Tue, Nov 01, 2016 at 07:51:27AM +0800, Ming Lei wrote:
> Sorry for forgetting to mention one important point:
> 
> - after multipage bvec is introduced, the iterated bvec pointer
> still points to singlge page bvec, which is generated in-flight
> and is readonly actually. That is the motivation about the introduction
> of bio_for_each_segment_all_rd().
> 
> So maybe bio_for_each_page_all_ro() is better?
> 
> For _wt(), we still can keep it as bio_for_each_segment(), which also
> reflects that now the iterated bvec points to one whole segment if
> we name _rd as bio_for_each_page_all_ro().

I'm agnostic as to what the right names are --- my big concern is
there is an explosion of bio_for_each_page_* functions, and that there
isn't good documentation about (a) when to use each of these
functions, and (b) why.  I was goinig through the patch series, and it
was hard for me to figure out why, and I was looking through all of
the patches.  Once all of the patches are merged in, I am concerned
this is going to be massive trapdoor that will snare a large number of
unwitting developers.

As far as my preference, from an abstract perspective, if one version
(the read-write variant, I presume) is always safe, while one (the
read-only variant) is faster, if you can work under restricted
circumstances, naming the safe version so it is the "default", and
more dangerous one with the name that makes it a bit more obvious what
you have to do in order to use it safely, and then very clearly
document both in sources, and in the Documentation directory, what the
issues are and what you have to do in order to use the faster version.

Cheers,

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


expose queue the queue mapping for SCSI drivers V2

2016-11-01 Thread Christoph Hellwig
In 4.9 I've added support in the interrupt layer to automatically
assign the interrupt affinity at interrupt allocation time, and
expose that information to blk-mq.

This series extents that so that SCSI driver can pass on the information
as well.  The SCSI part is fairly trivial, although we need to also
export the default queue mapping function in blk-mq to keep things simple.

I've also converted over the smartpqi driver as an example as it's the
easiest of the multiqueue SCSI drivers to convert.

Changes since V1:
 - move the EXPORT_SYMBOL of blk_mq_map_queues to the right patch
 - added Reviewed-by, Acked-by and Tested-by tags
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] blk-mq: export blk_mq_map_queues

2016-11-01 Thread Christoph Hellwig
This will allow SCSI to have a single blk_mq_ops structure that either
lets the LLDD map the queues to PCIe MSIx vectors or use the default.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Johannes Thumshirn 
---
 block/blk-mq-cpumap.c  | 1 +
 block/blk-mq.h | 1 -
 include/linux/blk-mq.h | 1 +
 3 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 19b1d9c..8e61e86 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -87,6 +87,7 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
free_cpumask_var(cpus);
return 0;
 }
+EXPORT_SYMBOL_GPL(blk_mq_map_queues);
 
 /*
  * We have no quick way of doing reverse lookups. This is only used at
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e5d2524..5347f01 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -38,7 +38,6 @@ void blk_mq_disable_hotplug(void);
 /*
  * CPU -> queue mappings
  */
-int blk_mq_map_queues(struct blk_mq_tag_set *set);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 535ab2e..6c0fb25 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -237,6 +237,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q);
 void blk_mq_freeze_queue_start(struct request_queue *q);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set);
 
+int blk_mq_map_queues(struct blk_mq_tag_set *set);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 /*
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] scsi: allow LLDDs to expose the queue mapping to blk-mq

2016-11-01 Thread Christoph Hellwig
Just hand through the blk-mq map_queues method in the host template.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Johannes Thumshirn 
---
 drivers/scsi/scsi_lib.c  | 10 ++
 include/scsi/scsi_host.h |  8 
 2 files changed, 18 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2cca9cf..f23ec24 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1990,6 +1990,15 @@ static void scsi_exit_request(void *data, struct request 
*rq,
kfree(cmd->sense_buffer);
 }
 
+static int scsi_map_queues(struct blk_mq_tag_set *set)
+{
+   struct Scsi_Host *shost = container_of(set, struct Scsi_Host, tag_set);
+
+   if (shost->hostt->map_queues)
+   return shost->hostt->map_queues(shost);
+   return blk_mq_map_queues(set);
+}
+
 static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
 {
struct device *host_dev;
@@ -2082,6 +2091,7 @@ static struct blk_mq_ops scsi_mq_ops = {
.timeout= scsi_timeout,
.init_request   = scsi_init_request,
.exit_request   = scsi_exit_request,
+   .map_queues = scsi_map_queues,
 };
 
 struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7e4cd53..36680f1 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -278,6 +278,14 @@ struct scsi_host_template {
int (* change_queue_depth)(struct scsi_device *, int);
 
/*
+* This functions lets the driver expose the queue mapping
+* to the block layer.
+*
+* Status: OPTIONAL
+*/
+   int (* map_queues)(struct Scsi_Host *shost);
+
+   /*
 * This function determines the BIOS parameters for a given
 * harddisk.  These tend to be numbers that are made up by
 * the host adapter.  Parameters:
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] nvme: Add unlock_from_suspend

2016-11-01 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 10:18:13AM +0200, Sagi Grimberg wrote:
> > +
> > +   return nvme_insert_rq(q, req, 1, sec_submit_endio);
> 
> No need to introduce nvme_insert_rq at all, just call
> blk_mq_insert_request (other examples call blk_execute_rq_nowait
> but its pretty much the same...)

blk_execute_rq_nowait is the API to use - blk_mq_insert_request isn't
even exported.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/16] block: replace REQ_NOIDLE with REQ_IDLE

2016-11-01 Thread Christoph Hellwig
Noidle should be the default for writes as seen by all the compounds
defintions in fs.h using it.  In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.

This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.

Signed-off-by: Christoph Hellwig 
---
 Documentation/block/cfq-iosched.txt | 32 
 block/cfq-iosched.c | 11 ---
 drivers/block/drbd/drbd_actlog.c|  2 +-
 include/linux/blk_types.h   |  4 ++--
 include/linux/fs.h  | 10 +-
 include/trace/events/f2fs.h |  2 +-
 6 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/Documentation/block/cfq-iosched.txt 
b/Documentation/block/cfq-iosched.txt
index 1e4f835..895bd38 100644
--- a/Documentation/block/cfq-iosched.txt
+++ b/Documentation/block/cfq-iosched.txt
@@ -240,11 +240,11 @@ All cfq queues doing synchronous sequential IO go on to 
sync-idle tree.
 On this tree we idle on each queue individually.
 
 All synchronous non-sequential queues go on sync-noidle tree. Also any
-request which are marked with REQ_NOIDLE go on this service tree. On this
-tree we do not idle on individual queues instead idle on the whole group
-of queues or the tree. So if there are 4 queues waiting for IO to dispatch
-we will idle only once last queue has dispatched the IO and there is
-no more IO on this service tree.
+synchronous write request which is not marked with REQ_IDLE goes on this
+service tree. On this tree we do not idle on individual queues instead idle
+on the whole group of queues or the tree. So if there are 4 queues waiting
+for IO to dispatch we will idle only once last queue has dispatched the IO
+and there is no more IO on this service tree.
 
 All async writes go on async service tree. There is no idling on async
 queues.
@@ -257,17 +257,17 @@ tree idling provides isolation with buffered write queues 
on async tree.
 
 FAQ
 ===
-Q1. Why to idle at all on queues marked with REQ_NOIDLE.
+Q1. Why to idle at all on queues not marked with REQ_IDLE.
 
-A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
-with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
+A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked
+with REQ_IDLE. This helps in providing isolation with all the sync-idle
 queues. Otherwise in presence of many sequential readers, other
 synchronous IO might not get fair share of disk.
 
 For example, if there are 10 sequential readers doing IO and they get
-100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
-roughly after 1 second. If after completion of REQ_NOIDLE request we
-do not idle, and after a couple of milli seconds a another REQ_NOIDLE
+100ms each. If a !REQ_IDLE request comes in, it will be scheduled
+roughly after 1 second. If after completion of !REQ_IDLE request we
+do not idle, and after a couple of milli seconds a another !REQ_IDLE
 request comes in, again it will be scheduled after 1second. Repeat it
 and notice how a workload can lose its disk share and suffer due to
 multiple sequential readers.
@@ -276,16 +276,16 @@ A1. We only do tree idle (all queues on sync-noidle tree) 
on queues marked
 context of fsync, and later some journaling data is written. Journaling
 data comes in only after fsync has finished its IO (atleast for ext4
 that seemed to be the case). Now if one decides not to idle on fsync
-thread due to REQ_NOIDLE, then next journaling write will not get
+thread due to !REQ_IDLE, then next journaling write will not get
 scheduled for another second. A process doing small fsync, will suffer
 badly in presence of multiple sequential readers.
 
-Hence doing tree idling on threads using REQ_NOIDLE flag on requests
+Hence doing tree idling on threads using !REQ_IDLE flag on requests
 provides isolation from multiple sequential readers and at the same
 time we do not idle on individual threads.
 
-Q2. When to specify REQ_NOIDLE
-A2. I would think whenever one is doing synchronous write and not expecting
+Q2. When to specify REQ_IDLE
+A2. I would think whenever one is doing synchronous write and expecting
 more writes to be dispatched from same context soon, should be able
-to specify REQ_NOIDLE on writes and that probably should work well for
+to specify REQ_IDLE on writes and that probably should work well for
 most of the cases.
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f28db97..dcbed8c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3914,6 +3914,12 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
cfqq->seek_history |= (sdist 

[PATCH 12/16] block,fs: untangle fs.h and blk_types.h

2016-11-01 Thread Christoph Hellwig
Nothing in fs.h should require blk_types.h to be included.

Signed-off-by: Christoph Hellwig 
---
 fs/9p/vfs_addr.c  | 1 +
 fs/cifs/connect.c | 1 +
 fs/cifs/transport.c   | 1 +
 fs/gfs2/dir.c | 1 +
 fs/isofs/compress.c   | 1 +
 fs/ntfs/logfile.c | 1 +
 fs/ocfs2/buffer_head_io.c | 1 +
 fs/orangefs/inode.c   | 1 +
 fs/reiserfs/stree.c   | 1 +
 fs/squashfs/block.c   | 1 +
 fs/udf/dir.c  | 1 +
 fs/udf/directory.c| 1 +
 fs/udf/inode.c| 1 +
 fs/ufs/balloc.c   | 1 +
 include/linux/fs.h| 2 +-
 include/linux/swap.h  | 1 +
 include/linux/writeback.h | 2 ++
 lib/iov_iter.c| 1 +
 18 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 6181ad7..5ca1fb0 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index aab5227..db726e8 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cifspdu.h"
 #include "cifsglob.h"
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 206a597..5f02edc 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 3cdde5f..7911321 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -62,6 +62,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "gfs2.h"
 #include "incore.h"
diff --git a/fs/isofs/compress.c b/fs/isofs/compress.c
index 44af14b..9bb2fe3 100644
--- a/fs/isofs/compress.c
+++ b/fs/isofs/compress.c
@@ -18,6 +18,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/fs/ntfs/logfile.c b/fs/ntfs/logfile.c
index 761f12f..353379f 100644
--- a/fs/ntfs/logfile.c
+++ b/fs/ntfs/logfile.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "attrib.h"
 #include "aops.h"
diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
index 8f040f8..d9ebe11 100644
--- a/fs/ocfs2/buffer_head_io.c
+++ b/fs/ocfs2/buffer_head_io.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index ef3b4eb..551bc74 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -8,6 +8,7 @@
  *  Linux VFS inode operations.
  */
 
+#include 
 #include "protocol.h"
 #include "orangefs-kernel.h"
 #include "orangefs-bufmap.h"
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index a97e352..0037aea 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "reiserfs.h"
 #include 
 #include 
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index ce62a38..2751476 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/udf/dir.c b/fs/udf/dir.c
index aaec13c..2d0e028 100644
--- a/fs/udf/dir.c
+++ b/fs/udf/dir.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "udf_i.h"
 #include "udf_sb.h"
diff --git a/fs/udf/directory.c b/fs/udf/directory.c
index 988d535..7aa48bd 100644
--- a/fs/udf/directory.c
+++ b/fs/udf/directory.c
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 
 struct fileIdentDesc *udf_fileident_read(struct inode *dir, loff_t *nf_pos,
 struct udf_fileident_bh *fibh,
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index aad4640..0f3db71 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "udf_i.h"
 #include "udf_sb.h"
diff --git a/fs/ufs/balloc.c b/fs/ufs/balloc.c
index 67e085d..b035af5 100644
--- a/fs/ufs/balloc.c
+++ b/fs/ufs/balloc.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "ufs_fs.h"
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5b0a9b7..8533e9d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -28,7 +28,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -38,6 +37,7 @@
 
 struct backing_dev_info;
 struct bdi_writeback;
+struct bio;
 struct export_operations;
 struct hd_geometry;
 struct iovec;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a56523c..3a6aebc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct notifier_block;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 797100e..e4c3870 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,6 +10,8 @@
 #include 
 #include 
 
+struct bio;
+
 DECLARE_PER_CPU(int, dirty_throttle_leaks);
 
 

[PATCH 16/16] block: remove the CONFIG_BLOCK ifdef in blk_types.h

2016-11-01 Thread Christoph Hellwig
Now that we have a separate header for struct bio_vec there is absolutely
no excuse for including this header from non-block I/O code.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blk_types.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 63b750a..bb92102 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -17,7 +17,6 @@ struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *);
 
-#ifdef CONFIG_BLOCK
 /*
  * main unit of I/O for the block layer and lower layers (ie drivers and
  * stacking drivers)
@@ -126,8 +125,6 @@ struct bio {
 #define BVEC_POOL_OFFSET   (16 - BVEC_POOL_BITS)
 #define BVEC_POOL_IDX(bio) ((bio)->bi_flags >> BVEC_POOL_OFFSET)
 
-#endif /* CONFIG_BLOCK */
-
 /*
  * Operations and flags common to the bio and request structures.
  * We use 8 bits for encoding the operation, and the remaining 24 for flags.
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/16] arm, arm64: don't include blk_types.h in

2016-11-01 Thread Christoph Hellwig
No need for it - we only use struct bio_vec in prototypes and already have
forward declarations for it.

Signed-off-by: Christoph Hellwig 
---
 arch/arm/include/asm/io.h   | 1 -
 arch/arm64/include/asm/io.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h
index 021692c..42871fb 100644
--- a/arch/arm/include/asm/io.h
+++ b/arch/arm/include/asm/io.h
@@ -25,7 +25,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 0bba427..0c00c87 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,7 +22,6 @@
 #ifdef __KERNEL__
 
 #include 
-#include 
 
 #include 
 #include 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/16] mm: only include blk_types in swap.h if CONFIG_SWAP is enabled

2016-11-01 Thread Christoph Hellwig
It's only needed for the CONFIG_SWAP-only use of bio_end_io_t.

Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some
ifdefs in blk_types.h.

Instead we'll need to add a few explicit includes that were implicit
before, though.

Signed-off-by: Christoph Hellwig 
---
 drivers/staging/lustre/include/linux/lnet/types.h | 1 +
 drivers/staging/lustre/lustre/llite/rw.c  | 1 +
 fs/ntfs/aops.c| 1 +
 fs/ntfs/mft.c | 1 +
 fs/reiserfs/inode.c   | 1 +
 fs/splice.c   | 1 +
 include/linux/swap.h  | 4 +++-
 7 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/types.h 
b/drivers/staging/lustre/include/linux/lnet/types.h
index f8be0e2..8ca1e9d 100644
--- a/drivers/staging/lustre/include/linux/lnet/types.h
+++ b/drivers/staging/lustre/include/linux/lnet/types.h
@@ -34,6 +34,7 @@
 #define __LNET_TYPES_H__
 
 #include 
+#include 
 
 /** \addtogroup lnet
  * @{
diff --git a/drivers/staging/lustre/lustre/llite/rw.c 
b/drivers/staging/lustre/lustre/llite/rw.c
index 50c0152..76a6836 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -47,6 +47,7 @@
 #include 
 /* current_is_kswapd() */
 #include 
+#include 
 
 #define DEBUG_SUBSYSTEM S_LLITE
 
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index fe251f1..d0cf6fe 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "aops.h"
 #include "attrib.h"
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
index d3c0096..b6f4021 100644
--- a/fs/ntfs/mft.c
+++ b/fs/ntfs/mft.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "attrib.h"
 #include "aops.h"
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 58b2ded..cfeae9b 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int reiserfs_commit_write(struct file *f, struct page *page,
  unsigned from, unsigned to);
diff --git a/fs/splice.c b/fs/splice.c
index 153d4f3..51492f2 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -17,6 +17,7 @@
  * Copyright (C) 2006 Ingo Molnar 
  *
  */
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3a6aebc..bfee1af 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 struct notifier_block;
@@ -352,6 +351,9 @@ extern int kswapd_run(int nid);
 extern void kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
+
+#include  /* for bio_end_io_t */
+
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/16] ceph: don't include blk_types.h in messenger.h

2016-11-01 Thread Christoph Hellwig
The file only needs the struct bvec_iter delcaration, which is available
from bvec.h.

Signed-off-by: Christoph Hellwig 
---
 include/linux/ceph/messenger.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 8dbd787..67bcef2 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -1,7 +1,7 @@
 #ifndef __FS_CEPH_MESSENGER_H
 #define __FS_CEPH_MESSENGER_H
 
-#include 
+#include 
 #include 
 #include 
 #include 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/16] btrfs: use op_is_sync to check for synchronous requests

2016-11-01 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/disk-io.c | 2 +-
 fs/btrfs/volumes.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3a57f99..c8454a8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -930,7 +930,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, 
struct inode *inode,
 
atomic_inc(_info->nr_async_submits);
 
-   if (bio->bi_opf & REQ_SYNC)
+   if (op_is_sync(bio->bi_opf))
btrfs_set_work_high_priority(>work);
 
btrfs_queue_work(fs_info->workers, >work);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 71a60cc..deda46cf 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6100,7 +6100,7 @@ static noinline void btrfs_schedule_bio(struct btrfs_root 
*root,
bio->bi_next = NULL;
 
spin_lock(>io_lock);
-   if (bio->bi_opf & REQ_SYNC)
+   if (op_is_sync(bio->bi_opf))
pending_bios = >pending_sync_bios;
else
pending_bios = >pending_bios;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/16] block, fs: move submit_bio to bio.h

2016-11-01 Thread Christoph Hellwig
This is where all the other bio operations live, so users must include
bio.h anyway.

Signed-off-by: Christoph Hellwig 
---
 include/linux/bio.h | 2 ++
 include/linux/fs.h  | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index fe9a170..5c604b49 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -404,6 +404,8 @@ static inline struct bio *bio_clone_kmalloc(struct bio 
*bio, gfp_t gfp_mask)
 
 }
 
+extern blk_qc_t submit_bio(struct bio *);
+
 extern void bio_endio(struct bio *);
 
 static inline void bio_io_error(struct bio *bio)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0ad36e0..5b0a9b7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2717,7 +2717,6 @@ static inline void remove_inode_hash(struct inode *inode)
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
-extern blk_qc_t submit_bio(struct bio *);
 extern int bdev_read_only(struct block_device *);
 #endif
 extern int set_blocksize(struct block_device *, int);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/16] block: treat REQ_FUA and REQ_PREFLUSH as synchronous

2016-11-01 Thread Christoph Hellwig
Instead of requiring everyone to specify the REQ_SYNC flag aѕ well.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blk_types.h | 8 +++-
 include/linux/fs.h| 6 +++---
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3fa62ca..107d23d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -216,9 +216,15 @@ static inline bool op_is_write(unsigned int op)
return (op & 1);
 }
 
+/*
+ * Reads are always treated as synchronous, as are requests with the FUA or
+ * PREFLUSH flag.  Other operations may be marked as synchronous using the
+ * REQ_SYNC flag.
+ */
 static inline bool op_is_sync(unsigned int op)
 {
-   return (op & REQ_OP_MASK) == REQ_OP_READ || (op & REQ_SYNC);
+   return (op & REQ_OP_MASK) == REQ_OP_READ ||
+   (op & (REQ_SYNC | REQ_FUA | REQ_PREFLUSH));
 }
 
 typedef unsigned int blk_qc_t;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5e0078f..ccedccb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -199,9 +199,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
offset,
 #define READ_SYNC  0
 #define WRITE_SYNC (REQ_SYNC | REQ_NOIDLE)
 #define WRITE_ODIRECT  REQ_SYNC
-#define WRITE_FLUSH(REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH)
-#define WRITE_FUA  (REQ_SYNC | REQ_NOIDLE | REQ_FUA)
-#define WRITE_FLUSH_FUA(REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH | 
REQ_FUA)
+#define WRITE_FLUSH(REQ_NOIDLE | REQ_PREFLUSH)
+#define WRITE_FUA  (REQ_NOIDLE | REQ_FUA)
+#define WRITE_FLUSH_FUA(REQ_NOIDLE | REQ_PREFLUSH | REQ_FUA)
 
 /*
  * Attribute flags.  These should be or-ed together to figure out what
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/16] bcache: use op_is_sync to check for synchronous requests

2016-11-01 Thread Christoph Hellwig
(and remove one layer of masking for the op_is_write call next to it).

Signed-off-by: Christoph Hellwig 
---
 drivers/md/bcache/request.c   | 4 ++--
 drivers/md/bcache/writeback.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 40ffe5e..e8a2b69 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -404,8 +404,8 @@ static bool check_should_bypass(struct cached_dev *dc, 
struct bio *bio)
 
if (!congested &&
mode == CACHE_MODE_WRITEBACK &&
-   op_is_write(bio_op(bio)) &&
-   (bio->bi_opf & REQ_SYNC))
+   op_is_write(bio->bi_opf) &&
+   op_is_sync(bio->bi_opf))
goto rescale;
 
spin_lock(>io_lock);
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 301eaf5..629bd1a 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -57,8 +57,7 @@ static inline bool should_writeback(struct cached_dev *dc, 
struct bio *bio,
if (would_skip)
return false;
 
-   return bio->bi_opf & REQ_SYNC ||
-   in_use <= CUTOFF_WRITEBACK;
+   return op_is_sync(bio->bi_opf) || in_use <= CUTOFF_WRITEBACK;
 }
 
 static inline void bch_writeback_queue(struct cached_dev *dc)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/16] umem: use op_is_sync to check for synchronous requests

2016-11-01 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 drivers/block/umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index be90e15..46f4c71 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -535,7 +535,7 @@ static blk_qc_t mm_make_request(struct request_queue *q, 
struct bio *bio)
*card->biotail = bio;
bio->bi_next = NULL;
card->biotail = >bi_next;
-   if (bio->bi_opf & REQ_SYNC || !mm_check_plugged(card))
+   if (op_is_sync(bio->bi_opf) || !mm_check_plugged(card))
activate(card);
spin_unlock_irq(>lock);
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/16] block: don't use REQ_SYNC in the READ_SYNC definition

2016-11-01 Thread Christoph Hellwig
Reads are synchronous per defintion, don't add another flag for it.

Signed-off-by: Christoph Hellwig 
---
 include/linux/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3e878f..5e0078f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -196,7 +196,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
offset,
 #define READ   REQ_OP_READ
 #define WRITE  REQ_OP_WRITE
 
-#define READ_SYNC  REQ_SYNC
+#define READ_SYNC  0
 #define WRITE_SYNC (REQ_SYNC | REQ_NOIDLE)
 #define WRITE_ODIRECT  REQ_SYNC
 #define WRITE_FLUSH(REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/16] blk-cgroup: use op_is_sync to check for synchronous requests

2016-11-01 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/blk-cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index ddaf28d..01b62e7 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -599,7 +599,7 @@ static inline void blkg_rwstat_add(struct blkg_rwstat 
*rwstat,
 
__percpu_counter_add(cnt, val, BLKG_STAT_CPU_BATCH);
 
-   if (op & REQ_SYNC)
+   if (op_is_sync(op))
cnt = >cpu_cnt[BLKG_RWSTAT_SYNC];
else
cnt = >cpu_cnt[BLKG_RWSTAT_ASYNC];
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/16] cfq-iosched: use op_is_sync instead of opencoding it

2016-11-01 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 block/cfq-iosched.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c96186a..f28db97 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -912,15 +912,6 @@ static inline struct cfq_data *cic_to_cfqd(struct 
cfq_io_cq *cic)
 }
 
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
- */
-static inline bool cfq_bio_sync(struct bio *bio)
-{
-   return bio_data_dir(bio) == READ || (bio->bi_opf & REQ_SYNC);
-}
-
-/*
  * scheduler run of queue, if there are requests pending and no one in the
  * driver that will restart queueing
  */
@@ -2490,7 +2481,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;
 
-   cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+   cfqq = cic_to_cfqq(cic, op_is_sync(bio->bi_opf));
if (cfqq)
return elv_rb_find(>sort_list, bio_end_sector(bio));
 
@@ -2604,13 +2595,14 @@ static int cfq_allow_bio_merge(struct request_queue *q, 
struct request *rq,
   struct bio *bio)
 {
struct cfq_data *cfqd = q->elevator->elevator_data;
+   bool is_sync = op_is_sync(bio->bi_opf);
struct cfq_io_cq *cic;
struct cfq_queue *cfqq;
 
/*
 * Disallow merge of a sync bio into an async request.
 */
-   if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+   if (is_sync && !rq_is_sync(rq))
return false;
 
/*
@@ -2621,7 +2613,7 @@ static int cfq_allow_bio_merge(struct request_queue *q, 
struct request *rq,
if (!cic)
return false;
 
-   cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+   cfqq = cic_to_cfqq(cic, is_sync);
return cfqq == RQ_CFQQ(rq);
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] nvme: Add unlock_from_suspend

2016-11-01 Thread Sagi Grimberg



+struct sed_cb_data {
+   sec_cb  *cb;
+   void*cb_data;
+   struct nvme_command cmd;
+};
+
+static void sec_submit_endio(struct request *req, int error)
+{
+   struct sed_cb_data *sed_data = req->end_io_data;
+
+   if (sed_data->cb)
+   sed_data->cb(error, sed_data->cb_data);
+
+   kfree(sed_data);
+   blk_mq_free_request(req);
+}
+
+static int nvme_insert_rq(struct request_queue *q, struct request *rq,
+ int at_head, rq_end_io_fn *done)
+{
+   WARN_ON(rq->cmd_type == REQ_TYPE_FS);
+
+   rq->end_io = done;
+
+   if (!q->mq_ops)
+   return -EINVAL;
+
+   blk_mq_insert_request(rq, at_head, true, true);
+
+   return 0;
+}


No need for this function... you control the call site...


+
+static int nvme_sec_submit(void *data, u8 opcode, u16 SPSP,
+  u8 SECP, void *buffer, size_t len,
+  sec_cb *cb, void *cb_data)
+{
+   struct request_queue *q;
+   struct request *req;
+   struct sed_cb_data *sed_data;
+   struct nvme_ns *ns;
+   struct nvme_command *cmd;
+   int ret;
+
+   ns = data;//bdev->bd_disk->private_data;


??

you don't even have data anywhere in here...


+
+   sed_data = kzalloc(sizeof(*sed_data), GFP_NOWAIT);
+   if (!sed_data)
+   return -ENOMEM;
+   sed_data->cb = cb;
+   sed_data->cb_data = cb_data;
+   cmd = _data->cmd;
+
+   cmd->common.opcode = opcode;
+   cmd->common.nsid = ns->ns_id;
+   cmd->common.cdw10[0] = SECP << 24 | SPSP << 8;
+   cmd->common.cdw10[1] = len;
+
+   q = ns->ctrl->admin_q;
+
+   req = nvme_alloc_request(q, cmd, 0, NVME_QID_ANY);
+   if (IS_ERR(req)) {
+   ret = PTR_ERR(req);
+   goto err_free;
+   }
+
+   req->timeout = ADMIN_TIMEOUT;
+   req->special = NULL;
+
+   if (buffer && len) {
+   ret = blk_rq_map_kern(q, req, buffer, len, GFP_NOWAIT);
+   if (ret) {
+   blk_mq_free_request(req);
+   goto err_free;
+   }
+   }
+
+   req->end_io_data = sed_data;
+   //req->rq_disk = bdev->bd_disk;


??


+
+   return nvme_insert_rq(q, req, 1, sec_submit_endio);


No need to introduce nvme_insert_rq at all, just call
blk_mq_insert_request (other examples call blk_execute_rq_nowait
but its pretty much the same...)


@@ -582,6 +583,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
struct nvme_command cmnd;
unsigned map_len;
int ret = BLK_MQ_RQ_QUEUE_OK;
+   unsigned long flags;

/*
 * If formated with metadata, require the block layer provide a buffer
@@ -614,18 +616,18 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
cmnd.common.command_id = req->tag;
blk_mq_start_request(req);

-   spin_lock_irq(>q_lock);
+   spin_lock_irqsave(>q_lock, flags);
if (unlikely(nvmeq->cq_vector < 0)) {
if (ns && !test_bit(NVME_NS_DEAD, >flags))
ret = BLK_MQ_RQ_QUEUE_BUSY;
else
ret = BLK_MQ_RQ_QUEUE_ERROR;
-   spin_unlock_irq(>q_lock);
+   spin_unlock_irqrestore(>q_lock, flags);
goto out;
}
__nvme_submit_cmd(nvmeq, );
nvme_process_cq(nvmeq);
-   spin_unlock_irq(>q_lock);
+   spin_unlock_irqrestore(>q_lock, flags);


No documentation why this is needed...


return BLK_MQ_RQ_QUEUE_OK;
 out:
nvme_free_iod(dev, req);
@@ -635,11 +637,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 static void nvme_complete_rq(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   struct nvme_dev *dev = iod->nvmeq->dev;
+   struct nvme_queue *nvmeq = iod->nvmeq;
+   struct nvme_dev *dev = nvmeq->dev;


This is a cleanup that should go in a different patch...


int error = 0;

nvme_unmap_data(dev, req);
-


Same here...


if (unlikely(req->errors)) {
if (nvme_req_needs_retry(req, req->errors)) {
req->retries++;
@@ -658,7 +660,6 @@ static void nvme_complete_rq(struct request *req)
"completing aborted command with status: %04x\n",
req->errors);
}
-


Here...


blk_mq_end_request(req, error);
 }

@@ -1758,10 +1759,11 @@ static void nvme_reset_work(struct work_struct *work)
 {
struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work);
int result = -ENODEV;
-
+   bool was_suspend = false;
if (WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING))
goto out;

+   was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
/*
 * If we're called to reset a live controller first shut it down before
 * moving on.
@@ -1789,6 +1791,9 @@ static void 

Re: [PATCH] Unittest framework based on nvme-cli.

2016-11-01 Thread chaitany kulkarni
Hi,

Introducing nvme-cli based unit test framework. All the test cases use
commands implemented in nvme-cli.

The goal is to have simple, lightweight, and easily expandable
framework which we can used to
develop various categories of unit tests based on nvme-cli and improve
overall development.

Over the period of time since release of the nvme-cli various test
cases are developed which are now integral part of our device deriver
testing.

These test cases are evolved around nvme-cli and can be used for
nvme-cli testing.

I would like to take this opportunity and share first set of test
cases which has most frequently used generic NVMe features from cli :-

1. nvme_attach_detach_ns_test.py
2. nvme_compare_test.py
3. nvme_create_max_ns_test.py
4. nvme_error_log_test.py
5. nvme_flush_test.py
6. nvme_format_test.py
7. nvme_get_features_test.py
8. nvme_read_write_test.py
9. nvme_smart_log_test.py
10. nvme_writeuncor_test.py
11. nvme_writezeros_test.py

Please have a look at README for an overview and process of adding new
test case.
Framework also has a sample skeleton which can be used readily to
write new testcard.

Assumptions for current implementation :-
1. nvme-cli is already installed on the system.
2. Only one test case can be executed at any given time.
3. Each test case has logical PRE, RUN and POST sections.
4. It assumes that driver is loaded and default namespace
"/dev/nvme0n1" is present.

(It is easy to add driver load/unload in pre and post sections of test
cases in as per requirement.)

I’d like to know what features, test cases most people would want as a
part of this framework.
Any suggestions are welcome, I'd like to implement them.

On approval I would like to submit more test cases to enhance the framework.

Regards,
Chaitanya

On Mon, Oct 31, 2016 at 11:01 PM, Chaitanya Kulkarni
 wrote:
>
> From: Chaitanya Kulkarni 
>
> Signed-off-by: Chaitanya Kulkarni 
> ---
>  Makefile|   5 +-
>  tests/Makefile  |  48 +
>  tests/README|  84 
>  tests/TODO  |  14 ++
>  tests/config.json   |   5 +
>  tests/nvme_attach_detach_ns_test.py |  90 
>  tests/nvme_compare_test.py  |  79 
>  tests/nvme_create_max_ns_test.py|  97 +
>  tests/nvme_error_log_test.py|  86 
>  tests/nvme_flush_test.py|  61 ++
>  tests/nvme_format_test.py   | 145 +
>  tests/nvme_get_features_test.py | 103 ++
>  tests/nvme_read_write_test.py   |  72 +++
>  tests/nvme_simple_template_test.py  |  55 +
>  tests/nvme_smart_log_test.py|  86 
>  tests/nvme_test.py  | 395 
> 
>  tests/nvme_test_io.py   |  99 +
>  tests/nvme_test_logger.py   |  52 +
>  tests/nvme_writeuncor_test.py   |  76 +++
>  tests/nvme_writezeros_test.py   | 102 ++
>  20 files changed, 1753 insertions(+), 1 deletion(-)
>  create mode 100644 tests/Makefile
>  create mode 100644 tests/README
>  create mode 100644 tests/TODO
>  create mode 100644 tests/config.json
>  create mode 100644 tests/nvme_attach_detach_ns_test.py
>  create mode 100644 tests/nvme_compare_test.py
>  create mode 100644 tests/nvme_create_max_ns_test.py
>  create mode 100644 tests/nvme_error_log_test.py
>  create mode 100644 tests/nvme_flush_test.py
>  create mode 100644 tests/nvme_format_test.py
>  create mode 100644 tests/nvme_get_features_test.py
>  create mode 100644 tests/nvme_read_write_test.py
>  create mode 100644 tests/nvme_simple_template_test.py
>  create mode 100644 tests/nvme_smart_log_test.py
>  create mode 100644 tests/nvme_test.py
>  create mode 100644 tests/nvme_test_io.py
>  create mode 100644 tests/nvme_test_logger.py
>  create mode 100644 tests/nvme_writeuncor_test.py
>  create mode 100644 tests/nvme_writezeros_test.py
>
> diff --git a/Makefile b/Makefile
> index 117cbbe..33c7190 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -46,6 +46,9 @@ nvme.o: nvme.c nvme.h nvme-print.h nvme-ioctl.h argconfig.h 
> suffix.h nvme-lightn
>  doc: $(NVME)
> $(MAKE) -C Documentation
>
> +test:
> +   $(MAKE) -C tests/ run
> +
>  all: doc
>
>  clean:
> @@ -136,4 +139,4 @@ rpm: dist
> $(RPMBUILD) -ta nvme-$(NVME_VERSION).tar.gz
>
>  .PHONY: default doc all clean clobber install-man install-bin install
> -.PHONY: dist pkg dist-orig deb deb-light rpm FORCE
> +.PHONY: dist pkg dist-orig deb deb-light rpm FORCE test
> diff --git a/tests/Makefile b/tests/Makefile
> new file mode 100644
> index 000..c0f9f31
> --- /dev/null
> +++ b/tests/Makefile
> @@ -0,0 +1,48 @@
> +###
> +#
> +#Makefile : Allows user to run testcases, generate documentation, and
> +#   perform static 

[PATCH] Unittest framework based on nvme-cli.

2016-11-01 Thread Chaitanya Kulkarni
From: Chaitanya Kulkarni 

Signed-off-by: Chaitanya Kulkarni 
---
 Makefile|   5 +-
 tests/Makefile  |  48 +
 tests/README|  84 
 tests/TODO  |  14 ++
 tests/config.json   |   5 +
 tests/nvme_attach_detach_ns_test.py |  90 
 tests/nvme_compare_test.py  |  79 
 tests/nvme_create_max_ns_test.py|  97 +
 tests/nvme_error_log_test.py|  86 
 tests/nvme_flush_test.py|  61 ++
 tests/nvme_format_test.py   | 145 +
 tests/nvme_get_features_test.py | 103 ++
 tests/nvme_read_write_test.py   |  72 +++
 tests/nvme_simple_template_test.py  |  55 +
 tests/nvme_smart_log_test.py|  86 
 tests/nvme_test.py  | 395 
 tests/nvme_test_io.py   |  99 +
 tests/nvme_test_logger.py   |  52 +
 tests/nvme_writeuncor_test.py   |  76 +++
 tests/nvme_writezeros_test.py   | 102 ++
 20 files changed, 1753 insertions(+), 1 deletion(-)
 create mode 100644 tests/Makefile
 create mode 100644 tests/README
 create mode 100644 tests/TODO
 create mode 100644 tests/config.json
 create mode 100644 tests/nvme_attach_detach_ns_test.py
 create mode 100644 tests/nvme_compare_test.py
 create mode 100644 tests/nvme_create_max_ns_test.py
 create mode 100644 tests/nvme_error_log_test.py
 create mode 100644 tests/nvme_flush_test.py
 create mode 100644 tests/nvme_format_test.py
 create mode 100644 tests/nvme_get_features_test.py
 create mode 100644 tests/nvme_read_write_test.py
 create mode 100644 tests/nvme_simple_template_test.py
 create mode 100644 tests/nvme_smart_log_test.py
 create mode 100644 tests/nvme_test.py
 create mode 100644 tests/nvme_test_io.py
 create mode 100644 tests/nvme_test_logger.py
 create mode 100644 tests/nvme_writeuncor_test.py
 create mode 100644 tests/nvme_writezeros_test.py

diff --git a/Makefile b/Makefile
index 117cbbe..33c7190 100644
--- a/Makefile
+++ b/Makefile
@@ -46,6 +46,9 @@ nvme.o: nvme.c nvme.h nvme-print.h nvme-ioctl.h argconfig.h 
suffix.h nvme-lightn
 doc: $(NVME)
$(MAKE) -C Documentation
 
+test:
+   $(MAKE) -C tests/ run
+
 all: doc
 
 clean:
@@ -136,4 +139,4 @@ rpm: dist
$(RPMBUILD) -ta nvme-$(NVME_VERSION).tar.gz
 
 .PHONY: default doc all clean clobber install-man install-bin install
-.PHONY: dist pkg dist-orig deb deb-light rpm FORCE
+.PHONY: dist pkg dist-orig deb deb-light rpm FORCE test
diff --git a/tests/Makefile b/tests/Makefile
new file mode 100644
index 000..c0f9f31
--- /dev/null
+++ b/tests/Makefile
@@ -0,0 +1,48 @@
+###
+#
+#Makefile : Allows user to run testcases, generate documentation, and
+#   perform static code analysis.
+#
+###
+
+NOSE2_OPTIONS="--verbose"
+
+help: all
+
+all:
+   @echo "Usage:"
+   @echo
+   @echo "  make run - Run all testcases."
+   @echo "  make doc - Generate Documentation."
+   @echo "  make cleanall- removes *pyc, documentation."
+   @echo "  make static_check- runs pep8, flake8, pylint on code."
+
+doc:
+   @epydoc -v --output=Documentation *.py
+
+run:
+   @for i in `ls *.py`; \
+   do \
+   echo "Running $${i}"; \
+   TESTCASE_NAME=`echo $${i} | cut -f 1 -d '.'`; \
+   nose2 ${NOSE2_OPTIONS} $${TESTCASE_NAME}; \
+   done
+
+static_check:
+   @for i in `ls *.py`; \
+   do \
+   echo "Pylint :- " ; \
+   printf "%10s" $${i}; \
+   pylint $${i} 2>&1  | grep "^Your code" |  awk '{print $$7}';\
+   echo "";\
+   pep8 $${i}; \
+   echo "pep8 :- "; \
+   echo "flake8 :- "; \
+   flake8 $${i}; \
+   done
+
+cleanall: clean
+   @rm -fr *.pyc Documentation
+
+clean:
+   @rm -fr *.pyc
diff --git a/tests/README b/tests/README
new file mode 100644
index 000..70686d8
--- /dev/null
+++ b/tests/README
@@ -0,0 +1,84 @@
+nvmetests
+=
+
+This contains NVMe unit tests framework. The purpose of this framework
+to use nvme cli and test various supported commands and scenarios for
+NVMe device.
+
+In current implementation this framework uses nvme cli to
+interact with underlying controller/namespace.
+
+1. Common Package Dependencies
+--
+
+1. Python(>= 2.7.5 or >= 3.3)
+2. nose2(Installation guide http://nose2.readthedocs.io/)
+3. nvme-cli(https://github.com/linux-nvme/nvme-cli.git)
+
+2. Overview
+---
+
+This framework follows simple class hierarchy. Each test