[PATCH BUGFIX/IMPROVEMENT V2 0/2] block, bfq: improve and refactor throughput-boosting logic

2017-08-03 Thread Paolo Valente
Hi,
these two patches improve throughput-boosting logic in two
aspects. The first patch refactors the parts of the device-idling
logic, related to throughput boosting, that are still scattered across
the source file bfq-iosched.c. The patch concetrates all the logic in
one function. The second patch fixes/improves device idling for
flash-based devices that have no internal queueing of I/O requests.

The contribution in the first patch has been triggered by that in the
second patch: finding the change made by the second patch has been
more difficult than it had to be, because the logic that decides
whether to idle the device is scattered across three functions.

The second patch provides a significant throghput boost, for random
I/O with flash-based non-queueing devices. For example, on a HiKey
board, throughput increases by up to 125%, growing, e.g., from 6.9MB/s
to 15.6MB/s with two or three random readers in parallel.

Thanks,
Paolo

Paolo Valente (2):
  block,bfq: refactor device-idling logic
  block, bfq: boost throughput with flash-based non-queueing devices

 block/bfq-iosched.c | 144 
 block/bfq-iosched.h |  12 ++---
 2 files changed, 85 insertions(+), 71 deletions(-)

--
2.10.0


[PATCH BUGFIX/IMPROVEMENT V2 2/2] block, bfq: boost throughput with flash-based non-queueing devices

2017-08-03 Thread Paolo Valente
When a queue associated with a process remains empty, there are cases
where throughput gets boosted if the device is idled to await the
arrival of a new I/O request for that queue. Currently, BFQ assumes
that one of these cases is when the device has no internal queueing
(regardless of the properties of the I/O being served). Unfortunately,
this condition has proved to be too general. So, this commit refines it
as "the device has no internal queueing and is rotational".

This refinement provides a significant throughput boost with random
I/O, on flash-based storage without internal queueing. For example, on
a HiKey board, throughput increases by up to 125%, growing, e.g., from
6.9MB/s to 15.6MB/s with two or three random readers in parallel.

Signed-off-by: Paolo Valente 
Signed-off-by: Luca Miccio 
---
 block/bfq-iosched.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ccdc9e6..509f399 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3114,7 +3114,10 @@ static bool bfq_may_expire_for_budg_timeout(struct 
bfq_queue *bfqq)
 static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 {
struct bfq_data *bfqd = bfqq->bfqd;
-   bool idling_boosts_thr, idling_boosts_thr_without_issues,
+   bool rot_without_queueing =
+   !blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
+   bfqq_sequential_and_IO_bound,
+   idling_boosts_thr, idling_boosts_thr_without_issues,
idling_needed_for_service_guarantees,
asymmetric_scenario;
 
@@ -3133,28 +3136,34 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
bfq_class_idle(bfqq))
return false;
 
+   bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
+   bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);
+
/*
 * The next variable takes into account the cases where idling
 * boosts the throughput.
 *
 * The value of the variable is computed considering, first, that
 * idling is virtually always beneficial for the throughput if:
-* (a) the device is not NCQ-capable, or
-* (b) regardless of the presence of NCQ, the device is rotational
-* and the request pattern for bfqq is I/O-bound and sequential.
+* (a) the device is not NCQ-capable and rotational, or
+* (b) regardless of the presence of NCQ, the device is rotational and
+* the request pattern for bfqq is I/O-bound and sequential, or
+* (c) regardless of whether it is rotational, the device is
+* not NCQ-capable and the request pattern for bfqq is
+* I/O-bound and sequential.
 *
 * Secondly, and in contrast to the above item (b), idling an
 * NCQ-capable flash-based device would not boost the
 * throughput even with sequential I/O; rather it would lower
 * the throughput in proportion to how fast the device
 * is. Accordingly, the next variable is true if any of the
-* above conditions (a) and (b) is true, and, in particular,
-* happens to be false if bfqd is an NCQ-capable flash-based
-* device.
+* above conditions (a), (b) or (c) is true, and, in
+* particular, happens to be false if bfqd is an NCQ-capable
+* flash-based device.
 */
-   idling_boosts_thr = !bfqd->hw_tag ||
-   (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) &&
-bfq_bfqq_has_short_ttime(bfqq));
+   idling_boosts_thr = rot_without_queueing ||
+   ((!blk_queue_nonrot(bfqd->queue) || !bfqd->hw_tag) &&
+bfqq_sequential_and_IO_bound);
 
/*
 * The value of the next variable,
-- 
2.10.0



[PATCH BUGFIX/IMPROVEMENT V2 1/2] block,bfq: refactor device-idling logic

2017-08-03 Thread Paolo Valente
The logic that decides whether to idle the device is scattered across
three functions. Almost all of the logic is in the function
bfq_bfqq_may_idle, but (1) part of the decision is made in
bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
switch off idling regardless of the output of bfq_bfqq_may_idle. In
addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
their decisions as a function of parameters that are used, for similar
purposes, also in bfq_bfqq_may_idle. This commit addresses these
issues by moving all the logic into bfq_bfqq_may_idle.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 117 +++-
 block/bfq-iosched.h |  12 +++---
 2 files changed, 67 insertions(+), 62 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 436b6ca..ccdc9e6 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -128,7 +128,7 @@ BFQ_BFQQ_FNS(busy);
 BFQ_BFQQ_FNS(wait_request);
 BFQ_BFQQ_FNS(non_blocking_wait_rq);
 BFQ_BFQQ_FNS(fifo_expire);
-BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(has_short_ttime);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(IO_bound);
 BFQ_BFQQ_FNS(in_large_burst);
@@ -731,10 +731,10 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
unsigned int old_wr_coeff = bfqq->wr_coeff;
bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
 
-   if (bic->saved_idle_window)
-   bfq_mark_bfqq_idle_window(bfqq);
+   if (bic->saved_has_short_ttime)
+   bfq_mark_bfqq_has_short_ttime(bfqq);
else
-   bfq_clear_bfqq_idle_window(bfqq);
+   bfq_clear_bfqq_has_short_ttime(bfqq);
 
if (bic->saved_IO_bound)
bfq_mark_bfqq_IO_bound(bfqq);
@@ -2012,7 +2012,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
return;
 
bic->saved_ttime = bfqq->ttime;
-   bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+   bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
bic->was_in_burst_list = !hlist_unhashed(>burst_list_node);
@@ -3038,8 +3038,8 @@ void bfq_bfqq_expire(struct bfq_data *bfqd,
}
 
bfq_log_bfqq(bfqd, bfqq,
-   "expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
-   slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+   "expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,
+   slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
 
/*
 * Increase, decrease or leave budget unchanged according to
@@ -3122,6 +3122,18 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
return true;
 
/*
+* Idling is performed only if slice_idle > 0. In addition, we
+* do not idle if
+* (a) bfqq is async
+* (b) bfqq is in the idle io prio class: in this case we do
+* not idle because we want to minimize the bandwidth that
+* queues in this class can steal to higher-priority queues
+*/
+   if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
+   bfq_class_idle(bfqq))
+   return false;
+
+   /*
 * The next variable takes into account the cases where idling
 * boosts the throughput.
 *
@@ -3142,7 +3154,7 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 */
idling_boosts_thr = !bfqd->hw_tag ||
(!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) &&
-bfq_bfqq_idle_window(bfqq));
+bfq_bfqq_has_short_ttime(bfqq));
 
/*
 * The value of the next variable,
@@ -3313,16 +3325,13 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq);
 
/*
-* We have now all the components we need to compute the return
-* value of the function, which is true only if both the following
-* conditions hold:
-* 1) bfqq is sync, because idling make sense only for sync queues;
-* 2) idling either boosts the throughput (without issues), or
-*is necessary to preserve service guarantees.
+* We have now all the components we need to compute the
+* return value of the function, which is true only if idling
+* either boosts the throughput (without issues), or is
+* necessary to preserve service guarantees.
 */
-   return bfq_bfqq_sync(bfqq) &&
-   (idling_boosts_thr_without_issues ||
-idling_needed_for_service_guarantees);
+   return idling_boosts_thr_without_issues ||
+   idling_needed_for_service_guarantees;
 }
 
 /*
@@ -3338,10 +3347,7 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
  */
 static 

Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time

2017-08-03 Thread Jens Axboe
On 08/03/2017 03:29 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>>  static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>> @@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx 
>> *hctx,
>>  {
>>  struct mq_inflight *mi = priv;
>>  
>> -if (rq->part == mi->part)
>> -mi->inflight++;
>> +if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
>> +return;
> 
> Should the REQ_ATOM_STARTED test perhaps have been introduced in patch 3/4
> instead of in this patch?

It should, moved.

>> +if (rq->part == mi->part1) {
>> +mi->inflight[0]++;
>> +if (mi->part1->partno &&
>> +_to_disk(mi->part1)->part0 == mi->part2)
>> +mi->inflight[1]++;
>> +} else if (rq->part == mi->part2)
>> +mi->inflight[1]++;
>>  }
> 
> So mi->part2 may represent part0 but mi->part1 not? Does that deserve a 
> comment?
> 
> Additionally, shouldn't the mi->part2 == part0 test be moved out of the 
> if-statement
> such that all requests are counted for part0 instead of storing the same 
> count in
> inflight[0] and inflight[1] if mi->part2 == part0?

I think I'll just clean up the whole thing and get rid of part1/part2. The two
can only exist together, if part1 is a partition. So will be easier to just
unconditionally sum part+root, if part is a partition. Then we only need to
pass in 'part', not part1/2.

>> -unsigned int blk_mq_in_flight(struct request_queue *q,
>> -   struct hd_struct *part)
>> +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
>> +  struct hd_struct *part2, unsigned int *inflight)
>>  {
> 
> Should inflight be declared as an array to make it clear that it is a pointer 
> to
> an array with two elements?

Sure, I can do that.

-- 
Jens Axboe



Re: [PATCH 3/4] blk-mq: provide internal in-flight variant

2017-08-03 Thread Jens Axboe
On 08/03/2017 03:25 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>> +  struct request *rq, void *priv,
>> +  bool reserved)
>> +{
>> +struct mq_inflight *mi = priv;
>> +
>> +if (rq->part == mi->part)
>> +mi->inflight++;
>> +}
>> [ ... ]
>> -static inline void part_inc_in_flight(struct request_queue *q,
>> -  struct hd_struct *part, int rw)
>> -{
>> -atomic_inc(>in_flight[rw]);
>> -if (part->partno)
>> -atomic_inc(_to_disk(part)->part0.in_flight[rw]);
>> -}
> 
> Hello Jens,
> 
> The existing part_inc_in_flight() code includes all requests in the in_flight
> statistics for part0 but the new code in blk_mq_check_inflight() not. Is that
> on purpose? Should the rq->part == mi->part check perhaps be skipped if
> mi->part represents part0?

The existing code increments always for the partition in question, and
for the root if it's a partition. I'll take a look at that logic, and
ensure it's all correct.

-- 
Jens Axboe



Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> We only need to iterate the queues and tags once, we if just pass

Hello Jens,

Regarding the patch description: it seems to me like in the text "we if" these
two words appear in the wrong order?

Bart.

Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>  static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> @@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx 
> *hctx,
>  {
>   struct mq_inflight *mi = priv;
>  
> - if (rq->part == mi->part)
> - mi->inflight++;
> + if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
> + return;

Should the REQ_ATOM_STARTED test perhaps have been introduced in patch 3/4
instead of in this patch?

> + if (rq->part == mi->part1) {
> + mi->inflight[0]++;
> + if (mi->part1->partno &&
> + _to_disk(mi->part1)->part0 == mi->part2)
> + mi->inflight[1]++;
> + } else if (rq->part == mi->part2)
> + mi->inflight[1]++;
>  }

So mi->part2 may represent part0 but mi->part1 not? Does that deserve a comment?

Additionally, shouldn't the mi->part2 == part0 test be moved out of the 
if-statement
such that all requests are counted for part0 instead of storing the same count 
in
inflight[0] and inflight[1] if mi->part2 == part0?

> -unsigned int blk_mq_in_flight(struct request_queue *q,
> -struct hd_struct *part)
> +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
> +   struct hd_struct *part2, unsigned int *inflight)
>  {

Should inflight be declared as an array to make it clear that it is a pointer to
an array with two elements?

Thanks,

Bart.

Re: [PATCH 3/4] blk-mq: provide internal in-flight variant

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> +   struct request *rq, void *priv,
> +   bool reserved)
> +{
> + struct mq_inflight *mi = priv;
> +
> + if (rq->part == mi->part)
> + mi->inflight++;
> +}
> [ ... ]
> -static inline void part_inc_in_flight(struct request_queue *q,
> -   struct hd_struct *part, int rw)
> -{
> - atomic_inc(>in_flight[rw]);
> - if (part->partno)
> - atomic_inc(_to_disk(part)->part0.in_flight[rw]);
> -}

Hello Jens,

The existing part_inc_in_flight() code includes all requests in the in_flight
statistics for part0 but the new code in blk_mq_check_inflight() not. Is that
on purpose? Should the rq->part == mi->part check perhaps be skipped if
mi->part represents part0?

Thanks,

Bart.

Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags

2017-08-03 Thread Jens Axboe
On 08/03/2017 02:50 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:40 -0600, Jens Axboe wrote:
>> On 08/03/2017 02:35 PM, Jens Axboe wrote:
 I agree with what you wrote in the description of this patch.
 However, since I have not yet found the code that clears tags->rqs[],
 would it be possible to show me that code?
>>>
>>> Since it's been a month since I wrote this code, I went and looked
>>> too.  My memory was that we set/clear it dynamically since we added
>>> scheduling, but looks like we don't clear it. The race is still valid
>>> for when someone runs a tag check in parallel with someone allocating
>>> a tag, since there's a window of time where the tag bit is set, but
>>> ->rqs[tag] isn't set yet. That's probably the race I hit, not the
>>> completion race mentioned in the change log.
>>
>> Rewrote the commit message:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=1908e43118e688e41ac8656edcf3e7a150f3f5081
> 
> Hello Jens,
> 
> This is what I found in the updated commit:
> 
> blk-mq-tag: check for NULL rq when iterating tags
> 
> Since we introduced blk-mq-sched, the tags->rqs[] array has been
> dynamically assigned. So we need to check for NULL when iterating,
> since there's a window of time where the bit is set, but we haven't
> dynamically assigned the tags->rqs[] array position yet.
> 
> This is perfectly safe, since the memory backing of the request is
> never going away while the device is alive.
> 
> Does this mean that blk_mq_tagset_busy_iter() can skip requests that it
> shouldn't skip and also that blk_mq_tagset_busy_iter() can pass a pointer to
> the previous request that was associated with a tag instead of the current
> request to its busy_tag_iter_fn argument? Shouldn't these races be fixed,
> e.g. by swapping the order in which the tag are set and tags->rqs[] are
> assigned such that the correct request pointer is passed to the
> busy_tag_iter_fn argument?

We can't swap them. We need to reserve a bit first, which entails
setting that bit. Once that's set, we can assign ->rqs[]. The race isn't
a big deal, and I don't want to add any code to prevent it, since that
would mean locking. Since we have the luxury of the request itself
always being valid memory, we can deal with stale info or having a NULL
because of the ordering. It's a conscious trade off.

-- 
Jens Axboe



Re: [PATCH 3/4] blk-mq: provide internal in-flight variant

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:45 -0600, Jens Axboe wrote:
> The inc/dec goes away for mq, the non-mq path still has to use them. I just
> move them as well. Could be a prep patch, but it's just moving the code out
> of the header and into a normal C file instead.

Hello Jens,

I misread that part of the patch. Now I have had another look these changes
look fine to me. Hence:

Reviewed-by: Bart Van Assche 

Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:40 -0600, Jens Axboe wrote:
> On 08/03/2017 02:35 PM, Jens Axboe wrote:
> > > I agree with what you wrote in the description of this patch.
> > > However, since I have not yet found the code that clears tags->rqs[],
> > > would it be possible to show me that code?
> > 
> > Since it's been a month since I wrote this code, I went and looked
> > too.  My memory was that we set/clear it dynamically since we added
> > scheduling, but looks like we don't clear it. The race is still valid
> > for when someone runs a tag check in parallel with someone allocating
> > a tag, since there's a window of time where the tag bit is set, but
> > ->rqs[tag] isn't set yet. That's probably the race I hit, not the
> > completion race mentioned in the change log.
> 
> Rewrote the commit message:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=1908e43118e688e41ac8656edcf3e7a150f3f5081

Hello Jens,

This is what I found in the updated commit:

blk-mq-tag: check for NULL rq when iterating tags

Since we introduced blk-mq-sched, the tags->rqs[] array has been
dynamically assigned. So we need to check for NULL when iterating,
since there's a window of time where the bit is set, but we haven't
dynamically assigned the tags->rqs[] array position yet.

This is perfectly safe, since the memory backing of the request is
never going away while the device is alive.

Does this mean that blk_mq_tagset_busy_iter() can skip requests that it
shouldn't skip and also that blk_mq_tagset_busy_iter() can pass a pointer to
the previous request that was associated with a tag instead of the current
request to its busy_tag_iter_fn argument? Shouldn't these races be fixed,
e.g. by swapping the order in which the tag are set and tags->rqs[] are
assigned such that the correct request pointer is passed to the
busy_tag_iter_fn argument?

Thanks,

Bart.

Re: [PATCH 3/4] blk-mq: provide internal in-flight variant

2017-08-03 Thread Jens Axboe
On 08/03/2017 02:41 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> We don't have to inc/dec some counter, since we can just
>> iterate the tags. That makes inc/dec a noop, but means we
>> have to iterate busy tags to get an in-flight count.
>> [ ... ]
>> +unsigned int blk_mq_in_flight(struct request_queue *q,
>> +   struct hd_struct *part)
>> +{
>> +struct mq_inflight mi = { .part = part, .inflight = 0 };
> 
> Hello Jens,
> 
> A minor stylistic comment: since a C compiler is required to
> initialize to zero all member variables that have not been initialized
> explicitly I think ".inflight = 0" can be left out.

It can, I'll kill it.

>> diff --git a/block/genhd.c b/block/genhd.c
>> index f735af67a0c9..ad5dc567d57f 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
>>  static void disk_del_events(struct gendisk *disk);
>>  static void disk_release_events(struct gendisk *disk);
>>  
>> +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, 
>> int rw)
>> +{
>> +if (q->mq_ops)
>> +return;
>> +
>> +atomic_inc(>in_flight[rw]);
>> +if (part->partno)
>> +atomic_inc(_to_disk(part)->part0.in_flight[rw]);
>> +}
>> [ ... ]
>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>> index 7f7427e00f9c..f2c5096b3a7e 100644
>> --- a/include/linux/genhd.h
>> +++ b/include/linux/genhd.h
>> @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct 
>> *part)
>>  #define part_stat_sub(cpu, gendiskp, field, subnd)  \
>>  part_stat_add(cpu, gendiskp, field, -subnd)
>>  
>> -static inline void part_inc_in_flight(struct request_queue *q,
>> -  struct hd_struct *part, int rw)
>> -{
>> -atomic_inc(>in_flight[rw]);
>> -if (part->partno)
>> -atomic_inc(_to_disk(part)->part0.in_flight[rw]);
>> -}
>> [ ... ]
> 
> Sorry but to me it seems like this part of the patch does match with
> the patch description? The patch description mentions that inc and dec
> become a noop but it seems to me that these functions have been
> uninlined instead of making these a noop?

The inc/dec goes away for mq, the non-mq path still has to use them. I just
move them as well. Could be a prep patch, but it's just moving the code out
of the header and into a normal C file instead.

-- 
Jens Axboe



Re: [PATCH 3/4] blk-mq: provide internal in-flight variant

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> We don't have to inc/dec some counter, since we can just
> iterate the tags. That makes inc/dec a noop, but means we
> have to iterate busy tags to get an in-flight count.
> [ ... ]
> +unsigned int blk_mq_in_flight(struct request_queue *q,
> +struct hd_struct *part)
> +{
> + struct mq_inflight mi = { .part = part, .inflight = 0 };

Hello Jens,

A minor stylistic comment: since a C compiler is required to initialize to zero
all member variables that have not been initialized explicitly I think
".inflight = 0" can be left out.

> diff --git a/block/genhd.c b/block/genhd.c
> index f735af67a0c9..ad5dc567d57f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
>  static void disk_del_events(struct gendisk *disk);
>  static void disk_release_events(struct gendisk *disk);
>  
> +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int 
> rw)
> +{
> + if (q->mq_ops)
> + return;
> +
> + atomic_inc(>in_flight[rw]);
> + if (part->partno)
> + atomic_inc(_to_disk(part)->part0.in_flight[rw]);
> +}
> [ ... ]
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 7f7427e00f9c..f2c5096b3a7e 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct 
> *part)
>  #define part_stat_sub(cpu, gendiskp, field, subnd)   \
>   part_stat_add(cpu, gendiskp, field, -subnd)
>  
> -static inline void part_inc_in_flight(struct request_queue *q,
> -   struct hd_struct *part, int rw)
> -{
> - atomic_inc(>in_flight[rw]);
> - if (part->partno)
> - atomic_inc(_to_disk(part)->part0.in_flight[rw]);
> -}
> [ ... ]

Sorry but to me it seems like this part of the patch does match with the patch
description? The patch description mentions that inc and dec become a noop but
it seems to me that these functions have been uninlined instead of making these
a noop?

Bart.

Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags

2017-08-03 Thread Jens Axboe
On 08/03/2017 02:35 PM, Jens Axboe wrote:
>> I agree with what you wrote in the description of this patch.
>> However, since I have not yet found the code that clears tags->rqs[],
>> would it be possible to show me that code?
> 
> Since it's been a month since I wrote this code, I went and looked
> too.  My memory was that we set/clear it dynamically since we added
> scheduling, but looks like we don't clear it. The race is still valid
> for when someone runs a tag check in parallel with someone allocating
> a tag, since there's a window of time where the tag bit is set, but
> ->rqs[tag] isn't set yet. That's probably the race I hit, not the
> completion race mentioned in the change log.

Rewrote the commit message:

http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=1908e43118e688e41ac8656edcf3e7a150f3f5081

-- 
Jens Axboe



Re: [PATCH 2/4] block: pass in queue to inflight accounting

2017-08-03 Thread Jens Axboe
On 08/03/2017 02:35 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> No functional change in this patch, just in preparation for
>> basing the inflight mechanism on the queue in question.
>>
>> Signed-off-by: Jens Axboe 
>> ---
>>  block/bio.c   | 16 
>>  block/blk-core.c  | 22 --
>>  block/blk-merge.c |  4 ++--
>>  block/genhd.c |  4 ++--
>>  block/partition-generic.c |  5 +++--
>>  drivers/block/drbd/drbd_req.c | 10 +++---
>>  drivers/block/rsxx/dev.c  |  6 +++---
>>  drivers/block/zram/zram_drv.c |  5 +++--
>>  drivers/md/bcache/request.c   |  7 ---
>>  drivers/md/dm.c   |  6 +++---
>>  drivers/nvdimm/nd.h   |  5 +++--
>>  include/linux/bio.h   |  9 +
>>  include/linux/genhd.h | 14 +-
>>  13 files changed, 64 insertions(+), 49 deletions(-)
> 
> Hello Jens,
> 
> Should the DRBD, rsxx, zram, md and nvdimm driver maintainers be CC-ed?

Yeah, they should, forgot to do so. I'll make sure they see it, since
this is the initial posting, I'm assuming I'll have to do a v2 anyway.

-- 
Jens Axboe



Re: [PATCH 2/4] block: pass in queue to inflight accounting

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> No functional change in this patch, just in preparation for
> basing the inflight mechanism on the queue in question.
> 
> Signed-off-by: Jens Axboe 
> ---
>  block/bio.c   | 16 
>  block/blk-core.c  | 22 --
>  block/blk-merge.c |  4 ++--
>  block/genhd.c |  4 ++--
>  block/partition-generic.c |  5 +++--
>  drivers/block/drbd/drbd_req.c | 10 +++---
>  drivers/block/rsxx/dev.c  |  6 +++---
>  drivers/block/zram/zram_drv.c |  5 +++--
>  drivers/md/bcache/request.c   |  7 ---
>  drivers/md/dm.c   |  6 +++---
>  drivers/nvdimm/nd.h   |  5 +++--
>  include/linux/bio.h   |  9 +
>  include/linux/genhd.h | 14 +-
>  13 files changed, 64 insertions(+), 49 deletions(-)

Hello Jens,

Should the DRBD, rsxx, zram, md and nvdimm driver maintainers be CC-ed?

Anyway:

Reviewed-by: Bart Van Assche 

Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags

2017-08-03 Thread Jens Axboe
On 08/03/2017 02:29 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> Since we introduced blk-mq-sched, the tags->rqs[] array has been
>> dynamically assigned. So we need to check for NULL when iterating,
>> since we could be racing with completion.
>>
>> This is perfectly safe, since the memory backing of the request is
>> never going away while the device is alive. Only the pointer in
>> ->rqs[] may be reset.
>>
>> Signed-off-by: Jens Axboe 
>> ---
>>  block/blk-mq-tag.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d0be72ccb091..b856b2827157 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
>> bitnr, void *data)
>>  bitnr += tags->nr_reserved_tags;
>>  rq = tags->rqs[bitnr];
>>  
>> -if (rq->q == hctx->queue)
>> +if (rq && rq->q == hctx->queue)
>>  iter_data->fn(hctx, rq, iter_data->data, reserved);
>>  return true;
>>  }
>> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, 
>> unsigned int bitnr, void *data)
>>  if (!reserved)
>>  bitnr += tags->nr_reserved_tags;
>>  rq = tags->rqs[bitnr];
>> -
>> -iter_data->fn(rq, iter_data->data, reserved);
>> +if (rq)
>> +iter_data->fn(rq, iter_data->data, reserved);
>>  return true;
>>  }
> 
> Hello Jens,
> 
> I agree with what you wrote in the description of this patch. However, since
> I have not yet found the code that clears tags->rqs[], would it be possible
> to show me that code?

Since it's been a month since I wrote this code, I went and looked too.
My memory was that we set/clear it dynamically since we added
scheduling, but looks like we don't clear it. The race is still valid
for when someone runs a tag check in parallel with someone allocating a
tag, since there's a window of time where the tag bit is set, but
->rqs[tag] isn't set yet. That's probably the race I hit, not the
completion race mentioned in the change log.

-- 
Jens Axboe



Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> Since we introduced blk-mq-sched, the tags->rqs[] array has been
> dynamically assigned. So we need to check for NULL when iterating,
> since we could be racing with completion.
> 
> This is perfectly safe, since the memory backing of the request is
> never going away while the device is alive. Only the pointer in
> ->rqs[] may be reset.
> 
> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq-tag.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d0be72ccb091..b856b2827157 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
> bitnr, void *data)
>   bitnr += tags->nr_reserved_tags;
>   rq = tags->rqs[bitnr];
>  
> - if (rq->q == hctx->queue)
> + if (rq && rq->q == hctx->queue)
>   iter_data->fn(hctx, rq, iter_data->data, reserved);
>   return true;
>  }
> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
> int bitnr, void *data)
>   if (!reserved)
>   bitnr += tags->nr_reserved_tags;
>   rq = tags->rqs[bitnr];
> -
> - iter_data->fn(rq, iter_data->data, reserved);
> + if (rq)
> + iter_data->fn(rq, iter_data->data, reserved);
>   return true;
>  }

Hello Jens,

I agree with what you wrote in the description of this patch. However, since
I have not yet found the code that clears tags->rqs[], would it be possible
to show me that code?

Thanks,

Bart.


[PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time

2017-08-03 Thread Jens Axboe
We only need to iterate the queues and tags once, we if just pass
in both partition structs. Add part_in_flight_double() for
returning this value.

Signed-off-by: Jens Axboe 
---
 block/blk-core.c  | 34 +++---
 block/blk-mq.c| 25 +
 block/blk-mq.h|  3 ++-
 block/genhd.c | 23 +--
 include/linux/genhd.h |  2 ++
 5 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ba88d71b442f..d111a8f5bdf1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1470,14 +1470,9 @@ static void add_acct_request(struct request_queue *q, 
struct request *rq,
 }
 
 static void part_round_stats_single(struct request_queue *q, int cpu,
-   struct hd_struct *part, unsigned long now)
+   struct hd_struct *part, unsigned long now,
+   unsigned int inflight)
 {
-   int inflight;
-
-   if (now == part->stamp)
-   return;
-
-   inflight = part_in_flight(q, part);
if (inflight) {
__part_stat_add(cpu, part, time_in_queue,
inflight * (now - part->stamp));
@@ -1505,12 +1500,29 @@ static void part_round_stats_single(struct 
request_queue *q, int cpu,
  */
 void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
 {
+   struct hd_struct *part2 = NULL;
unsigned long now = jiffies;
+   unsigned int inflight[2];
+   int stats = 0;
+
+   if (part->stamp != now)
+   stats |= 1;
+
+   if (part->partno) {
+   part2 = _to_disk(part)->part0;
+   if (part2->stamp != now)
+   stats |= 2;
+   }
+
+   if (!stats)
+   return;
+
+   part_in_flight_double(q, part, part2, inflight);
 
-   if (part->partno)
-   part_round_stats_single(q, cpu, _to_disk(part)->part0,
-   now);
-   part_round_stats_single(q, cpu, part, now);
+   if (stats & 2)
+   part_round_stats_single(q, cpu, part2, now, inflight[1]);
+   if (stats & 1)
+   part_round_stats_single(q, cpu, part, now, inflight[0]);
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 37035891e120..03d164917160 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -87,8 +87,9 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx 
*hctx,
 }
 
 struct mq_inflight {
-   struct hd_struct *part;
-   unsigned int inflight;
+   struct hd_struct *part1;
+   struct hd_struct *part2;
+   unsigned int *inflight;
 };
 
 static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
@@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx 
*hctx,
 {
struct mq_inflight *mi = priv;
 
-   if (rq->part == mi->part)
-   mi->inflight++;
+   if (!test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   return;
+   if (rq->part == mi->part1) {
+   mi->inflight[0]++;
+   if (mi->part1->partno &&
+   _to_disk(mi->part1)->part0 == mi->part2)
+   mi->inflight[1]++;
+   } else if (rq->part == mi->part2)
+   mi->inflight[1]++;
 }
 
-unsigned int blk_mq_in_flight(struct request_queue *q,
-  struct hd_struct *part)
+void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
+ struct hd_struct *part2, unsigned int *inflight)
 {
-   struct mq_inflight mi = { .part = part, .inflight = 0 };
+   struct mq_inflight mi = { .part1 = part1, .part2 = part2,
+ .inflight = inflight };
 
+   inflight[0] = inflight[1] = 0;
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, );
-   return mi.inflight;
 }
 
 void blk_freeze_queue_start(struct request_queue *q)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cade1a512a01..c5264a64fb7c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -138,6 +138,7 @@ static inline bool blk_mq_hw_queue_mapped(struct 
blk_mq_hw_ctx *hctx)
return hctx->nr_ctx && hctx->tags;
 }
 
-unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part);
+void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
+   struct hd_struct *part2, unsigned int *inflight);
 
 #endif
diff --git a/block/genhd.c b/block/genhd.c
index ad5dc567d57f..6faacccd8453 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -67,13 +67,32 @@ void part_dec_in_flight(struct request_queue *q, struct 
hd_struct *part, int rw)
 
 int part_in_flight(struct request_queue *q, struct hd_struct *part)
 {
-   if (q->mq_ops)
-   return blk_mq_in_flight(q, part);
+   if (q->mq_ops) {
+   unsigned int inflight[2];
+

[PATCH 0/4] block: more scalable inflight tracking

2017-08-03 Thread Jens Axboe
One scaling issue we currently have in the block code is the
inflight accounting. It's based on a per-device atomic count
for reads and writes, which means that even for an mq device
with lots of hardware queues, we end up dirtying a per-device
cacheline for each IO. The issue can easily be observed by
using null_blk:

modprobe null_blk submit_queues=48 queue_mode=2

and running a fio job that has 32 jobs doing sync reads on
the device (average of 3 runs, though deviation is low):

stats   IOPSusr sys
--
on   2.6M5.4%   94.6%
off 21.0M   33.7%   67.3%

which shows a 10x slowdown with stats enabled. If we look at
the profile for stats on, the top entries are:

37.38%  fio  [kernel.vmlinux][k] blk_account_io_done
14.65%  fio  [kernel.vmlinux][k] blk_account_io_start
14.29%  fio  [kernel.vmlinux][k] part_round_stats_single
11.81%  fio  [kernel.vmlinux][k] blk_account_io_completion
 3.62%  fio  [kernel.vmlinux][k] part_round_stats
 0.81%  fio  [kernel.vmlinux][k] __blkdev_direct_IO_simple

which shows the system time being dominated by the stats accounting.

This patch series replaces the atomic counter with using the tags
hamming weight for tracking infligh counts. This means we don't have
to do anything when IO starts or completes, and for reading the value
we just have to check how many bits we have set in the tag maps on
the queues. This part is limited to 1000 per second (for HZ=1000).
Using this approach, running the same test again results in:

stats   IOPSusr sys
--
on  20.4M   30.9%   69.0%
off 21.4M   32.4%   67.4%

and doing a profiled run with stats on, the top of stats reporting is
now:

1.23%  fio  [kernel.vmlinux]  [k] blk_account_io_done
0.83%  fio  [kernel.vmlinux]  [k] blk_account_io_start
0.55%  fio  [kernel.vmlinux]  [k] blk_account_io_completion

which is a lot more reasonable. The difference between stats on and
off is now also neglible.

-- 
Jens Axboe




[PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags

2017-08-03 Thread Jens Axboe
Since we introduced blk-mq-sched, the tags->rqs[] array has been
dynamically assigned. So we need to check for NULL when iterating,
since we could be racing with completion.

This is perfectly safe, since the memory backing of the request is
never going away while the device is alive. Only the pointer in
->rqs[] may be reset.

Signed-off-by: Jens Axboe 
---
 block/blk-mq-tag.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d0be72ccb091..b856b2827157 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
bitnr, void *data)
bitnr += tags->nr_reserved_tags;
rq = tags->rqs[bitnr];
 
-   if (rq->q == hctx->queue)
+   if (rq && rq->q == hctx->queue)
iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
 }
@@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
if (!reserved)
bitnr += tags->nr_reserved_tags;
rq = tags->rqs[bitnr];
-
-   iter_data->fn(rq, iter_data->data, reserved);
+   if (rq)
+   iter_data->fn(rq, iter_data->data, reserved);
return true;
 }
 
-- 
2.7.4



[PATCH 2/4] block: pass in queue to inflight accounting

2017-08-03 Thread Jens Axboe
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Signed-off-by: Jens Axboe 
---
 block/bio.c   | 16 
 block/blk-core.c  | 22 --
 block/blk-merge.c |  4 ++--
 block/genhd.c |  4 ++--
 block/partition-generic.c |  5 +++--
 drivers/block/drbd/drbd_req.c | 10 +++---
 drivers/block/rsxx/dev.c  |  6 +++---
 drivers/block/zram/zram_drv.c |  5 +++--
 drivers/md/bcache/request.c   |  7 ---
 drivers/md/dm.c   |  6 +++---
 drivers/nvdimm/nd.h   |  5 +++--
 include/linux/bio.h   |  9 +
 include/linux/genhd.h | 14 +-
 13 files changed, 64 insertions(+), 49 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 9cf98b29588a..4d751a5b56f0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1729,29 +1729,29 @@ void bio_check_pages_dirty(struct bio *bio)
}
 }
 
-void generic_start_io_acct(int rw, unsigned long sectors,
-  struct hd_struct *part)
+void generic_start_io_acct(struct request_queue *q, int rw,
+  unsigned long sectors, struct hd_struct *part)
 {
int cpu = part_stat_lock();
 
-   part_round_stats(cpu, part);
+   part_round_stats(q, cpu, part);
part_stat_inc(cpu, part, ios[rw]);
part_stat_add(cpu, part, sectors[rw], sectors);
-   part_inc_in_flight(part, rw);
+   part_inc_in_flight(q, part, rw);
 
part_stat_unlock();
 }
 EXPORT_SYMBOL(generic_start_io_acct);
 
-void generic_end_io_acct(int rw, struct hd_struct *part,
-unsigned long start_time)
+void generic_end_io_acct(struct request_queue *q, int rw,
+struct hd_struct *part, unsigned long start_time)
 {
unsigned long duration = jiffies - start_time;
int cpu = part_stat_lock();
 
part_stat_add(cpu, part, ticks[rw], duration);
-   part_round_stats(cpu, part);
-   part_dec_in_flight(part, rw);
+   part_round_stats(q, cpu, part);
+   part_dec_in_flight(q, part, rw);
 
part_stat_unlock();
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index af393d5a9680..ba88d71b442f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1469,15 +1469,15 @@ static void add_acct_request(struct request_queue *q, 
struct request *rq,
__elv_add_request(q, rq, where);
 }
 
-static void part_round_stats_single(int cpu, struct hd_struct *part,
-   unsigned long now)
+static void part_round_stats_single(struct request_queue *q, int cpu,
+   struct hd_struct *part, unsigned long now)
 {
int inflight;
 
if (now == part->stamp)
return;
 
-   inflight = part_in_flight(part);
+   inflight = part_in_flight(q, part);
if (inflight) {
__part_stat_add(cpu, part, time_in_queue,
inflight * (now - part->stamp));
@@ -1488,6 +1488,7 @@ static void part_round_stats_single(int cpu, struct 
hd_struct *part,
 
 /**
  * part_round_stats() - Round off the performance stats on a struct disk_stats.
+ * @q: target block queue
  * @cpu: cpu number for stats access
  * @part: target partition
  *
@@ -1502,13 +1503,14 @@ static void part_round_stats_single(int cpu, struct 
hd_struct *part,
  * /proc/diskstats.  This accounts immediately for all queue usage up to
  * the current jiffies and restarts the counters again.
  */
-void part_round_stats(int cpu, struct hd_struct *part)
+void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
 {
unsigned long now = jiffies;
 
if (part->partno)
-   part_round_stats_single(cpu, _to_disk(part)->part0, now);
-   part_round_stats_single(cpu, part, now);
+   part_round_stats_single(q, cpu, _to_disk(part)->part0,
+   now);
+   part_round_stats_single(q, cpu, part, now);
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
@@ -2434,8 +2436,8 @@ void blk_account_io_done(struct request *req)
 
part_stat_inc(cpu, part, ios[rw]);
part_stat_add(cpu, part, ticks[rw], duration);
-   part_round_stats(cpu, part);
-   part_dec_in_flight(part, rw);
+   part_round_stats(req->q, cpu, part);
+   part_dec_in_flight(req->q, part, rw);
 
hd_struct_put(part);
part_stat_unlock();
@@ -2492,8 +2494,8 @@ void blk_account_io_start(struct request *rq, bool new_io)
part = >rq_disk->part0;
hd_struct_get(part);
}
-   part_round_stats(cpu, part);
-   part_inc_in_flight(part, rw);
+   part_round_stats(rq->q, cpu, part);
+   part_inc_in_flight(rq->q, part, rw);

[PATCH 3/4] blk-mq: provide internal in-flight variant

2017-08-03 Thread Jens Axboe
We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Signed-off-by: Jens Axboe 
---
 block/blk-mq.c| 24 
 block/blk-mq.h|  2 ++
 block/genhd.c | 29 +
 include/linux/genhd.h | 25 +++--
 4 files changed, 58 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 05dfa3f270ae..37035891e120 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx 
*hctx,
sbitmap_clear_bit(>ctx_map, ctx->index_hw);
 }
 
+struct mq_inflight {
+   struct hd_struct *part;
+   unsigned int inflight;
+};
+
+static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
+ struct request *rq, void *priv,
+ bool reserved)
+{
+   struct mq_inflight *mi = priv;
+
+   if (rq->part == mi->part)
+   mi->inflight++;
+}
+
+unsigned int blk_mq_in_flight(struct request_queue *q,
+  struct hd_struct *part)
+{
+   struct mq_inflight mi = { .part = part, .inflight = 0 };
+
+   blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, );
+   return mi.inflight;
+}
+
 void blk_freeze_queue_start(struct request_queue *q)
 {
int freeze_depth;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1a06fdf9fd4d..cade1a512a01 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -138,4 +138,6 @@ static inline bool blk_mq_hw_queue_mapped(struct 
blk_mq_hw_ctx *hctx)
return hctx->nr_ctx && hctx->tags;
 }
 
+unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part);
+
 #endif
diff --git a/block/genhd.c b/block/genhd.c
index f735af67a0c9..ad5dc567d57f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
 static void disk_del_events(struct gendisk *disk);
 static void disk_release_events(struct gendisk *disk);
 
+void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int 
rw)
+{
+   if (q->mq_ops)
+   return;
+
+   atomic_inc(>in_flight[rw]);
+   if (part->partno)
+   atomic_inc(_to_disk(part)->part0.in_flight[rw]);
+}
+
+void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int 
rw)
+{
+   if (q->mq_ops)
+   return;
+
+   atomic_dec(>in_flight[rw]);
+   if (part->partno)
+   atomic_dec(_to_disk(part)->part0.in_flight[rw]);
+}
+
+int part_in_flight(struct request_queue *q, struct hd_struct *part)
+{
+   if (q->mq_ops)
+   return blk_mq_in_flight(q, part);
+
+   return atomic_read(>in_flight[0]) +
+   atomic_read(>in_flight[1]);
+}
+
 /**
  * disk_get_part - get partition
  * @disk: disk to look partition from
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7f7427e00f9c..f2c5096b3a7e 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct *part)
 #define part_stat_sub(cpu, gendiskp, field, subnd) \
part_stat_add(cpu, gendiskp, field, -subnd)
 
-static inline void part_inc_in_flight(struct request_queue *q,
- struct hd_struct *part, int rw)
-{
-   atomic_inc(>in_flight[rw]);
-   if (part->partno)
-   atomic_inc(_to_disk(part)->part0.in_flight[rw]);
-}
-
-static inline void part_dec_in_flight(struct request_queue *q,
- struct hd_struct *part, int rw)
-{
-   atomic_dec(>in_flight[rw]);
-   if (part->partno)
-   atomic_dec(_to_disk(part)->part0.in_flight[rw]);
-}
-
-static inline int part_in_flight(struct request_queue *q,
-struct hd_struct *part)
-{
-   return atomic_read(>in_flight[0]) +
-   atomic_read(>in_flight[1]);
-}
+int part_in_flight(struct request_queue *q, struct hd_struct *part);
+void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int 
rw);
+void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int 
rw);
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
 {
-- 
2.7.4



Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue

2017-08-03 Thread Bart Van Assche
On Thu, 2017-08-03 at 11:13 +0800, Ming Lei wrote:
> On Thu, Aug 03, 2017 at 01:35:29AM +, Bart Van Assche wrote:
> > On Wed, 2017-08-02 at 11:31 +0800, Ming Lei wrote:
> > > On Tue, Aug 01, 2017 at 03:11:42PM +, Bart Van Assche wrote:
> > > > On Tue, 2017-08-01 at 18:50 +0800, Ming Lei wrote:
> > > > > On Tue, Aug 01, 2017 at 06:17:18PM +0800, Ming Lei wrote:
> > > > > > How can we get the accurate 'number of requests in progress' 
> > > > > > efficiently?
> > > > 
> > > > Hello Ming,
> > > > 
> > > > How about counting the number of bits that have been set in the tag set?
> > > > I am aware that these bits can be set and/or cleared concurrently with 
> > > > the
> > > > dispatch code but that count is probably a good starting point.
> > > 
> > > It has to be atomic_t, which is too too heavy for us, please see the 
> > > report:
> > > 
> > >   http://marc.info/?t=14986844843=1=2
> > > 
> > > Both Jens and I want to kill hd_struct.in_flight, but looks still no
> > > good way. 
> > 
> > Hello Ming,
> > 
> > Sorry but I disagree that a new atomic variable should be added to keep 
> > track
> > of the number of busy requests. Counting the number of bits that are set in
> > the tag set should be good enough in this context.
> 
> That won't work because the tag set is host wide and shared by all LUNs.

Hello Ming,

Are you aware that the SCSI core already keeps track of the number of busy 
requests
per LUN? See also the device_busy member of struct scsi_device. How about 
giving the
block layer core access in some way to that counter?

Bart.

Re: [PATCH BUGFIX/IMPROVEMENT] block, bfq: improve and refactor throughput-boosting logic

2017-08-03 Thread Jens Axboe
On 08/03/2017 10:48 AM, Paolo Valente wrote:
> When a queue associated with a process remains empty, there are cases
> where throughput gets boosted if the device is idled to await the
> arrival of a new I/O request for that queue. Currently, BFQ assumes
> that one of these cases is when the device has no internal queueing
> (regardless of the properties of the I/O being served). Unfortunately,
> this condition has proved to be too general. So, this commit refines it
> as "the device has no internal queueing and is rotational".
> 
> This refinement provides a significant throughput boost with random
> I/O, on flash-based storage without internal queueing. For example, on
> a HiKey board, throughput increases by up to 125%, growing, e.g., from
> 6.9MB/s to 15.6MB/s with two or three random readers in parallel.
> 
> This commit also refactors the code related to device idling, for the
> following reason. Finding the change that provides the above large
> improvement has been slightly more difficult than it had to be,
> because the logic that decides whether to idle the device is still
> scattered across three functions. Almost all of the logic is in the
> function bfq_bfqq_may_idle, but (1) part of the decision is made in
> bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
> switch off idling regardless of the output of bfq_bfqq_may_idle. In
> addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
> their decisions as a function of parameters that are used, for similar
> purposes, also in bfq_bfqq_may_idle. This commit addresses this issue
> by moving all the logic into bfq_bfqq_may_idle.

This should be split into two patches - do one refactor patch that
doesn't change anything, then your functional change on top of that.

-- 
Jens Axboe



[PATCH BUGFIX/IMPROVEMENT] block, bfq: improve and refactor throughput-boosting logic

2017-08-03 Thread Paolo Valente
When a queue associated with a process remains empty, there are cases
where throughput gets boosted if the device is idled to await the
arrival of a new I/O request for that queue. Currently, BFQ assumes
that one of these cases is when the device has no internal queueing
(regardless of the properties of the I/O being served). Unfortunately,
this condition has proved to be too general. So, this commit refines it
as "the device has no internal queueing and is rotational".

This refinement provides a significant throughput boost with random
I/O, on flash-based storage without internal queueing. For example, on
a HiKey board, throughput increases by up to 125%, growing, e.g., from
6.9MB/s to 15.6MB/s with two or three random readers in parallel.

This commit also refactors the code related to device idling, for the
following reason. Finding the change that provides the above large
improvement has been slightly more difficult than it had to be,
because the logic that decides whether to idle the device is still
scattered across three functions. Almost all of the logic is in the
function bfq_bfqq_may_idle, but (1) part of the decision is made in
bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
switch off idling regardless of the output of bfq_bfqq_may_idle. In
addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
their decisions as a function of parameters that are used, for similar
purposes, also in bfq_bfqq_may_idle. This commit addresses this issue
by moving all the logic into bfq_bfqq_may_idle.

Signed-off-by: Paolo Valente 
Signed-off-by: Luca Miccio 
---
 block/bfq-iosched.c | 144 
 block/bfq-iosched.h |  12 ++---
 2 files changed, 85 insertions(+), 71 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 436b6ca..509f399 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -128,7 +128,7 @@ BFQ_BFQQ_FNS(busy);
 BFQ_BFQQ_FNS(wait_request);
 BFQ_BFQQ_FNS(non_blocking_wait_rq);
 BFQ_BFQQ_FNS(fifo_expire);
-BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(has_short_ttime);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(IO_bound);
 BFQ_BFQQ_FNS(in_large_burst);
@@ -731,10 +731,10 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
unsigned int old_wr_coeff = bfqq->wr_coeff;
bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
 
-   if (bic->saved_idle_window)
-   bfq_mark_bfqq_idle_window(bfqq);
+   if (bic->saved_has_short_ttime)
+   bfq_mark_bfqq_has_short_ttime(bfqq);
else
-   bfq_clear_bfqq_idle_window(bfqq);
+   bfq_clear_bfqq_has_short_ttime(bfqq);
 
if (bic->saved_IO_bound)
bfq_mark_bfqq_IO_bound(bfqq);
@@ -2012,7 +2012,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
return;
 
bic->saved_ttime = bfqq->ttime;
-   bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+   bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
bic->was_in_burst_list = !hlist_unhashed(>burst_list_node);
@@ -3038,8 +3038,8 @@ void bfq_bfqq_expire(struct bfq_data *bfqd,
}
 
bfq_log_bfqq(bfqd, bfqq,
-   "expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
-   slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+   "expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,
+   slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
 
/*
 * Increase, decrease or leave budget unchanged according to
@@ -3114,7 +3114,10 @@ static bool bfq_may_expire_for_budg_timeout(struct 
bfq_queue *bfqq)
 static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 {
struct bfq_data *bfqd = bfqq->bfqd;
-   bool idling_boosts_thr, idling_boosts_thr_without_issues,
+   bool rot_without_queueing =
+   !blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
+   bfqq_sequential_and_IO_bound,
+   idling_boosts_thr, idling_boosts_thr_without_issues,
idling_needed_for_service_guarantees,
asymmetric_scenario;
 
@@ -3122,27 +3125,45 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
return true;
 
/*
+* Idling is performed only if slice_idle > 0. In addition, we
+* do not idle if
+* (a) bfqq is async
+* (b) bfqq is in the idle io prio class: in this case we do
+* not idle because we want to minimize the bandwidth that
+* queues in this class can steal to higher-priority queues
+*/
+   if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
+   bfq_class_idle(bfqq))
+   return false;
+
+   bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
+   

Re: [PATCH 17/19] bcache: fix for gc and write-back race

2017-08-03 Thread Coly Li
On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote:
> From: Tang Junhui 
> 
> gc and write-back get raced (see the email "bcache get stucked" I sended
> before):
> gc thread write-back thread
> | |bch_writeback_thread()
> |bch_gc_thread()  |
> | |==>read_dirty()
> |==>bch_btree_gc()|
> |==>btree_root() //get btree root |
> | node write locker   |
> |==>bch_btree_gc_root()   |
> | |==>read_dirty_submit()
> | |==>write_dirty()
> | |==>continue_at(cl, 
> write_dirty_finish, system_wq);
> | 
> |==>write_dirty_finish()//excute in system_wq
> | |==>bch_btree_insert()
> | 
> |==>bch_btree_map_leaf_nodes()
> | 
> |==>__bch_btree_map_nodes()
> | |==>btree_root //try to 
> get btree root node read lock
> | |-stuck here
> |==>bch_btree_set_root()  |
> |==>bch_journal_meta()|
> |==>bch_journal() |
> |==>journal_try_write()   |
> |==>journal_write_unlocked() //journal_full(>journal) condition satisfied
> |==>continue_at(cl, journal_write, system_wq); //try to excute journal_write 
> in system_wq
> | //but work queue is excuting 
> write_dirty_finish()
> |==>closure_sync(); //wait journal_write execute over and wake up gc,
> | --stuck here
> |==>release root node write locker
> 
> This patch alloc a separate work-queue for write-back thread to avoid such
> race.
> 
> Signed-off-by: Tang Junhui 
> Cc: sta...@vger.kernel.org

Add a per-cached device work queue is a good idea, it's OK to me.

Acked-by: Coly Li 

Thansk.

Coly

> ---
>  drivers/md/bcache/bcache.h| 1 +
>  drivers/md/bcache/super.c | 2 ++
>  drivers/md/bcache/writeback.c | 8 ++--
>  3 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 44123e4..deb0a6c 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -333,6 +333,7 @@ struct cached_dev {
>   /* Limit number of writeback bios in flight */
>   struct semaphorein_flight;
>   struct task_struct  *writeback_thread;
> + struct workqueue_struct *writeback_write_wq;
>  
>   struct keybuf   writeback_keys;
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index e06641e..24cb9b7 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1063,6 +1063,8 @@ static void cached_dev_free(struct closure *cl)
>   cancel_delayed_work_sync(>writeback_rate_update);
>   if (!IS_ERR_OR_NULL(dc->writeback_thread))
>   kthread_stop(dc->writeback_thread);
> + if (dc->writeback_write_wq)
> + destroy_workqueue(dc->writeback_write_wq);
>  
>   mutex_lock(_register_lock);
>  
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 4104eaa..4bc5daa 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -189,7 +189,7 @@ static void write_dirty(struct closure *cl)
>  
>   closure_bio_submit(>bio, cl);
>  
> - continue_at(cl, write_dirty_finish, system_wq);
> + continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
>  }
>  
>  static void read_dirty_endio(struct bio *bio)
> @@ -209,7 +209,7 @@ static void read_dirty_submit(struct closure *cl)
>  
>   closure_bio_submit(>bio, cl);
>  
> - continue_at(cl, write_dirty, system_wq);
> + continue_at(cl, write_dirty, io->dc->writeback_write_wq);
>  }
>  
>  static void read_dirty(struct cached_dev *dc)
> @@ -527,6 +527,10 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  
>  int bch_cached_dev_writeback_start(struct cached_dev *dc)
>  {
> + dc->writeback_write_wq = alloc_workqueue("bcache_writeback_wq", 
> WQ_MEM_RECLAIM, 0);
> + if (!dc->writeback_write_wq)
> + return -ENOMEM;
> +
>   dc->writeback_thread = kthread_create(bch_writeback_thread, dc,
> "bcache_writeback");
>   if (IS_ERR(dc->writeback_thread))
> 



[PATCH] bio-integrity: revert "stop abusing bi_end_io"

2017-08-03 Thread Mikulas Patocka


On Wed, 2 Aug 2017, Christoph Hellwig wrote:

> On Wed, Aug 02, 2017 at 02:27:50PM +0200, Milan Broz wrote:
> > In dm-integrity target we register integrity profile that have
> > both generate_fn and verify_fn callbacks set to NULL.
> > 
> > This is used if dm-integrity is stacked under a dm-crypt device
> > for authenticated encryption (integrity payload contains authentication
> > tag and IV seed).
> > 
> > In this case the verification is done through own crypto API
> > processing inside dm-crypt; integrity profile is only holder
> > of these data. (And memory is owned by dm-crypt as well.)
> 
> Maybe that's where the problem lies?  You're abusing the integrity
> payload for something that is not end to end data integrity at all
> and then wonder why it breaks?  Also the commit that introduced your
> code had absolutely no review by Martin or any of the core block
> folks.
> 
> The right fix is to revert the dm-crypt commit.

That dm-crypt commit that uses bio integrity payload came 3 months before 
7c20f11680a441df09de7235206f70115fbf6290 and it was already present in 
4.12.

The patch 7c20f11680a441df09de7235206f70115fbf6290 not only breaks 
dm-crypt and dm-integrity. It also breaks regular integrity payload usage 
when stacking drivers are used.

Suppose that a bio goes the through the device mapper stack. At each level 
a clone bio is allocated and forwarded to a lower level. It eventually 
reaches the request-based physical disk driver.

In the kernel 4.12, when the bio reaches the request-based driver, the 
function blk_queue_bio is called, it calls bio_integrity_prep, 
bio_integrity_prep saves the bio->bi_end_io in the structure 
bio_integrity_payload and sets bio->bi_end_io = bio_integrity_endio. When 
the bio is finished, bio_integrity_endio is called, it performs the 
verification (a call to bio_integrity_verify_fn), restores bio->bi_end_io 
and calls it.

The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that 
bio->bi_end_io is not changed and bio_integrity_endio is called directly 
from bio_endio if the bio has integrity payload. The unintended 
consequence of this change is that when a request with integrity payload 
is finished and bio_endio is called for each cloned bio, the verification 
routine bio_integrity_verify_fn is called for every level in the stack (it 
used to be called only for the lowest level in 4.12). At different levels 
in the stack, the bio may have different bi_sector value, so the repeated 
verification can't succeed.

I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be 
reverted and it should be reworked for the next merge window, so that it 
calls bio_integrity_verify_fn only for the lowest level.

Mikulas



From: Mikulas Patocka 

The patch 7c20f11680a441df09de7235206f70115fbf6290 ("bio-integrity: stop 
abusing bi_end_io") changes the code so that it doesn't replace 
bio_end_io with bio_integrity_endio and calls bio_integrity_endio directly 
from bio_endio.

The unintended consequence of this change is that when a bio with 
integrity payload is passed through the device stack, bio_integrity_endio 
is called for each level of the stack (it used to be called only for the 
lowest level before).

This breaks dm-integrity and dm-crypt and it also causes verification 
errors when a bio with regular integrity payload is passed through the 
device mapper stack.

Signed-off-by: Mikulas Patocka 

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f733beab6ca2..8df4eb103ba9 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -102,7 +102,7 @@ EXPORT_SYMBOL(bio_integrity_alloc);
  * Description: Used to free the integrity portion of a bio. Usually
  * called from bio_free().
  */
-static void bio_integrity_free(struct bio *bio)
+void bio_integrity_free(struct bio *bio)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct bio_set *bs = bio->bi_pool;
@@ -120,8 +120,8 @@ static void bio_integrity_free(struct bio *bio)
}
 
bio->bi_integrity = NULL;
-   bio->bi_opf &= ~REQ_INTEGRITY;
 }
+EXPORT_SYMBOL(bio_integrity_free);
 
 /**
  * bio_integrity_add_page - Attach integrity metadata
@@ -326,6 +326,12 @@ bool bio_integrity_prep(struct bio *bio)
offset = 0;
}
 
+   /* Install custom I/O completion handler if read verify is enabled */
+   if (bio_data_dir(bio) == READ) {
+   bip->bip_end_io = bio->bi_end_io;
+   bio->bi_end_io = bio_integrity_endio;
+   }
+
/* Auto-generate integrity metadata if this is a write */
if (bio_data_dir(bio) == WRITE) {
bio_integrity_process(bio, >bi_iter,
@@ -369,12 +375,13 @@ static void bio_integrity_verify_fn(struct work_struct 
*work)
bio->bi_status = BLK_STS_IOERR;
}
 
-   bio_integrity_free(bio);
+   /* Restore original bio completion handler */
+   bio->bi_end_io = 

Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Ming Lei
On Thu, Aug 3, 2017 at 6:47 PM, Mel Gorman  wrote:
> On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote:
>> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman  
>> wrote:
>> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> >> Hi Mel Gorman,
>> >>
>> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman  
>> >> wrote:
>> >> > Hi Christoph,
>> >> >
>> >> > I know the reasons for switching to MQ by default but just be aware 
>> >> > that it's
>> >> > not without hazards albeit it the biggest issues I've seen are switching
>> >> > CFQ to BFQ. On my home grid, there is some experimental automatic 
>> >> > testing
>> >> > running every few weeks searching for regressions. Yesterday, it noticed
>> >> > that creating some work files for a postgres simulator called pgioperf
>> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> >> > linearly writing two files for testing on another benchmark and is not
>> >> > remarkable. The relevant part of the report is
>> >>
>> >> We saw some SCSI-MQ performance issue too, please see if the following
>> >> patchset fixes your issue:
>> >>
>> >> http://marc.info/?l=linux-block=150151989915776=2
>> >>
>> >
>> > That series is dealing with problems with legacy-deadline vs mq-none where
>> > as the bulk of the problems reported in this mail are related to
>> > legacy-CFQ vs mq-BFQ.
>>
>> The serials deals with none and all mq schedulers, and you can see
>> the improvement on mq-deadline in cover letter, :-)
>>
>
> Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ
> that was not observed on other schedulers?

Actually if you look at the cover letter, you will see this patchset
increases by
> 10X sequential I/O IOPS on mq-deadline, so it would be reasonable to see
2x to 4x BFQ slowdown, but I didn't test BFQ.

Thanks,
Ming Lei


Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Mel Gorman
On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote:
> > For Paulo, if you want to try preemptively dealing with regression reports
> > before 4.13 releases then all the tests in question can be reproduced with
> > https://github.com/gormanm/mmtests . The most relevant test configurations
> > I've seen so far are
> > 
> > configs/config-global-dhp__io-dbench4-async
> > configs/config-global-dhp__io-fio-randread-async-randwrite
> > configs/config-global-dhp__io-fio-randread-async-seqwrite
> > configs/config-global-dhp__io-fio-randread-sync-heavywrite
> > configs/config-global-dhp__io-fio-randread-sync-randwrite
> > configs/config-global-dhp__pgioperf
> > 
> 
> Hi Mel,
> as it already happened with the latest Phoronix benchmark article (and
> with other test results reported several months ago on this list), bad
> results may be caused (also) by the fact that the low-latency, default
> configuration of BFQ is being used. 

I took that into account BFQ with low-latency was also tested and the
impact was not a universal improvement although it can be a noticable
improvement. From the same machine;

dbench4 Loadfile Execution Time
 4.12.0 4.12.0 
4.12.0
 legacy-cfq mq-bfq
mq-bfq-tput
Amean 180.67 (   0.00%)   83.68 (  -3.74%)   84.70 (  
-5.00%)
Amean 292.87 (   0.00%)  121.63 ( -30.96%)   88.74 (   
4.45%)
Amean 4   102.72 (   0.00%)  474.33 (-361.77%)  113.97 ( 
-10.95%)
Amean 32 2543.93 (   0.00%) 1927.65 (  24.23%) 2038.74 (  
19.86%)

However, it's not a universal gain and there are also fairness issues.
For example, this is a fio configuration with a single random reader and
a single random writer on the same machine

fio Throughput
  4.12.0 4.12.0 
4.12.0
  legacy-cfq mq-bfq 
   mq-bfq-tput
Hmean kb/sec-writer-write  398.15 (   0.00%) 4659.18 (1070.21%) 
4934.52 (1139.37%)
Hmean kb/sec-reader-read   507.00 (   0.00%)   66.36 ( -86.91%) 
  14.68 ( -97.10%)

With CFQ, there is some fairness between the readers and writers and
with BFQ, there is a strong preference to writers. Again, this is not
universal. It'll be a mix and sometimes it'll be classed as a gain and
sometimes a regression.

While I accept that BFQ can be tuned, tuning IO schedulers is not something
that normal users get right and they'll only look at "out of box" performance
which, right now, will trigger bug reports. This is neither good nor bad,
it simply is.

> This configuration is the default
> one because the motivation for yet-another-scheduler as BFQ is that it
> drastically reduces latency for interactive and soft real-time tasks
> (e.g., opening an app or playing/streaming a video), when there is
> some background I/O.  Low-latency heuristics are willing to sacrifice
> throughput when this provides a large benefit in terms of the above
> latency.
> 

I had seen this assertion so one of the fio configurations had multiple
heavy writers in the background and a random reader of small files to
simulate that scenario. The intent was to simulate heavy IO in the presence
of application startup

  4.12.0 4.12.0 
4.12.0
  legacy-cfq mq-bfq 
   mq-bfq-tput
Hmean kb/sec-writer-write 1997.75 (   0.00%) 2035.65 (   1.90%) 
2014.50 (   0.84%)
Hmean kb/sec-reader-read   128.50 (   0.00%)   79.46 ( -38.16%) 
  12.78 ( -90.06%)

Write throughput is steady-ish across each IO scheduler but readers get
starved badly which I expect would slow application startup and disabling
low_latency makes it much worse. The mmtests configuration in question
is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
a fresh XFS filesystem on a test partition.

This is not exactly equivalent to real application startup but that can
be difficult to quantify properly.

> Of course, BFQ may not be optimal for every workload, even if
> low-latency mode is switched off.  In addition, there may still be
> some bug.  I'll repeat your tests on a machine of mine ASAP.
> 

The intent here is not to rag on BFQ because I know it's going to have some
wins and some losses and will take time to fix up. The primary intent was
to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports
due to the switching of defaults that will bisect to a misleading commit.

-- 
Mel Gorman
SUSE Labs


Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Mel Gorman
On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote:
> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman  
> wrote:
> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
> >> Hi Mel Gorman,
> >>
> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman  
> >> wrote:
> >> > Hi Christoph,
> >> >
> >> > I know the reasons for switching to MQ by default but just be aware that 
> >> > it's
> >> > not without hazards albeit it the biggest issues I've seen are switching
> >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
> >> > running every few weeks searching for regressions. Yesterday, it noticed
> >> > that creating some work files for a postgres simulator called pgioperf
> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> >> > linearly writing two files for testing on another benchmark and is not
> >> > remarkable. The relevant part of the report is
> >>
> >> We saw some SCSI-MQ performance issue too, please see if the following
> >> patchset fixes your issue:
> >>
> >> http://marc.info/?l=linux-block=150151989915776=2
> >>
> >
> > That series is dealing with problems with legacy-deadline vs mq-none where
> > as the bulk of the problems reported in this mail are related to
> > legacy-CFQ vs mq-BFQ.
> 
> The serials deals with none and all mq schedulers, and you can see
> the improvement on mq-deadline in cover letter, :-)
> 

Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ
that was not observed on other schedulers?

-- 
Mel Gorman
SUSE Labs


Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Mel Gorman
On Thu, Aug 03, 2017 at 11:44:06AM +0200, Paolo Valente wrote:
> > That series is dealing with problems with legacy-deadline vs mq-none where
> > as the bulk of the problems reported in this mail are related to
> > legacy-CFQ vs mq-BFQ.
> > 
> 
> Out-of-curiosity: you get no regression with mq-none or mq-deadline?
> 

I didn't test mq-none as the underlying storage was not fast enough to
make a legacy-noop vs mq-none meaningful. legacy-deadline vs mq-deadline
did show small regressions on some workloads but not as dramatic and
small enough that it would go unmissed in some cases.

-- 
Mel Gorman
SUSE Labs


Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Ming Lei
On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman  wrote:
> On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>>
>> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman  
>> wrote:
>> > Hi Christoph,
>> >
>> > I know the reasons for switching to MQ by default but just be aware that 
>> > it's
>> > not without hazards albeit it the biggest issues I've seen are switching
>> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> > running every few weeks searching for regressions. Yesterday, it noticed
>> > that creating some work files for a postgres simulator called pgioperf
>> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> > linearly writing two files for testing on another benchmark and is not
>> > remarkable. The relevant part of the report is
>>
>> We saw some SCSI-MQ performance issue too, please see if the following
>> patchset fixes your issue:
>>
>> http://marc.info/?l=linux-block=150151989915776=2
>>
>
> That series is dealing with problems with legacy-deadline vs mq-none where
> as the bulk of the problems reported in this mail are related to
> legacy-CFQ vs mq-BFQ.

The serials deals with none and all mq schedulers, and you can see
the improvement on mq-deadline in cover letter, :-)

Thanks,
Ming Lei


Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Mel Gorman
On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
> Hi Mel Gorman,
> 
> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman  
> wrote:
> > Hi Christoph,
> >
> > I know the reasons for switching to MQ by default but just be aware that 
> > it's
> > not without hazards albeit it the biggest issues I've seen are switching
> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
> > running every few weeks searching for regressions. Yesterday, it noticed
> > that creating some work files for a postgres simulator called pgioperf
> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> > linearly writing two files for testing on another benchmark and is not
> > remarkable. The relevant part of the report is
> 
> We saw some SCSI-MQ performance issue too, please see if the following
> patchset fixes your issue:
> 
> http://marc.info/?l=linux-block=150151989915776=2
> 

That series is dealing with problems with legacy-deadline vs mq-none where
as the bulk of the problems reported in this mail are related to
legacy-CFQ vs mq-BFQ.

-- 
Mel Gorman
SUSE Labs


Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Ming Lei
On Thu, Aug 3, 2017 at 5:17 PM, Ming Lei  wrote:
> Hi Mel Gorman,
>
> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman  
> wrote:
>> Hi Christoph,
>>
>> I know the reasons for switching to MQ by default but just be aware that it's
>> not without hazards albeit it the biggest issues I've seen are switching
>> CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> running every few weeks searching for regressions. Yesterday, it noticed
>> that creating some work files for a postgres simulator called pgioperf
>> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> linearly writing two files for testing on another benchmark and is not
>> remarkable. The relevant part of the report is
>
> We saw some SCSI-MQ performance issue too, please see if the following
> patchset fixes your issue:
>
> http://marc.info/?l=linux-block=150151989915776=2

BTW, the above patches(V1) can be found in the following tree:

  https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V1

V2 has already been done but not posted out yet, because the performance test
on SRP isn't completed:

  https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V2


Thanks,
Ming Lei


Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Paolo Valente

> Il giorno 03 ago 2017, alle ore 10:51, Mel Gorman 
>  ha scritto:
> 
> Hi Christoph,
> 
> I know the reasons for switching to MQ by default but just be aware that it's
> not without hazards albeit it the biggest issues I've seen are switching
> CFQ to BFQ. On my home grid, there is some experimental automatic testing
> running every few weeks searching for regressions. Yesterday, it noticed
> that creating some work files for a postgres simulator called pgioperf
> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> linearly writing two files for testing on another benchmark and is not
> remarkable. The relevant part of the report is
> 
> Last good/First bad commit
> ==
> Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
> First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
> From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Fri, 16 Jun 2017 10:27:55 +0200
> Subject: [PATCH] scsi: default to scsi-mq
> Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
> path now that we had plenty of testing, and have I/O schedulers for
> blk-mq.  The module option to disable the blk-mq path is kept around for
> now.
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Martin K. Petersen 
> drivers/scsi/Kconfig | 11 ---
> drivers/scsi/scsi.c  |  4 
> 2 files changed, 15 deletions(-)
> 
> Comparison
> ==
>initialinitial 
>   last  penup  first
> good-v4.12   bad-16f73eb02d7e  
> good-6d311fa7  good-d06c587d   bad-5c279bd9
> Usermin 0.06 (   0.00%)0.14 (-133.33%)0.14 
> (-133.33%)0.06 (   0.00%)0.19 (-216.67%)
> Usermean0.06 (   0.00%)0.14 (-133.33%)0.14 
> (-133.33%)0.06 (   0.00%)0.19 (-216.67%)
> Userstddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> Usercoeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> Usermax 0.06 (   0.00%)0.14 (-133.33%)0.14 
> (-133.33%)0.06 (   0.00%)0.19 (-216.67%)
> System  min10.04 (   0.00%)   10.75 (  -7.07%)   10.05 (  
> -0.10%)   10.16 (  -1.20%)   10.73 (  -6.87%)
> System  mean   10.04 (   0.00%)   10.75 (  -7.07%)   10.05 (  
> -0.10%)   10.16 (  -1.20%)   10.73 (  -6.87%)
> System  stddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> System  coeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> System  max10.04 (   0.00%)   10.75 (  -7.07%)   10.05 (  
> -0.10%)   10.16 (  -1.20%)   10.73 (  -6.87%)
> Elapsed min   251.53 (   0.00%)  351.05 ( -39.57%)  252.83 (  
> -0.52%)  252.96 (  -0.57%)  347.93 ( -38.33%)
> Elapsed mean  251.53 (   0.00%)  351.05 ( -39.57%)  252.83 (  
> -0.52%)  252.96 (  -0.57%)  347.93 ( -38.33%)
> Elapsed stddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> Elapsed coeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> Elapsed max   251.53 (   0.00%)  351.05 ( -39.57%)  252.83 (  
> -0.52%)  252.96 (  -0.57%)  347.93 ( -38.33%)
> CPU min 4.00 (   0.00%)3.00 (  25.00%)4.00 (  
>  0.00%)4.00 (   0.00%)3.00 (  25.00%)
> CPU mean4.00 (   0.00%)3.00 (  25.00%)4.00 (  
>  0.00%)4.00 (   0.00%)3.00 (  25.00%)
> CPU stddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> CPU coeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (  
>  0.00%)0.00 (   0.00%)0.00 (   0.00%)
> CPU max 4.00 (   0.00%)3.00 (  25.00%)4.00 (  
>  0.00%)4.00 (   0.00%)3.00 (  25.00%)
> 
> The "Elapsed mean" line is what the testing and auto-bisection was paying
> attention to. Commit 16f73eb02d7e is simply the head commit at the time
> the continuous testing started. The first "bad commit" is the last column.
> 
> It's not the only slowdown that has been observed from other testing when
> examining whether it's ok to switch to MQ by default. The biggest slowdown
> observed was with a modified version of 

Re: Switching to MQ by default may generate some bug reports

2017-08-03 Thread Ming Lei
Hi Mel Gorman,

On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman  wrote:
> Hi Christoph,
>
> I know the reasons for switching to MQ by default but just be aware that it's
> not without hazards albeit it the biggest issues I've seen are switching
> CFQ to BFQ. On my home grid, there is some experimental automatic testing
> running every few weeks searching for regressions. Yesterday, it noticed
> that creating some work files for a postgres simulator called pgioperf
> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> linearly writing two files for testing on another benchmark and is not
> remarkable. The relevant part of the report is

We saw some SCSI-MQ performance issue too, please see if the following
patchset fixes your issue:

http://marc.info/?l=linux-block=150151989915776=2

Thanks,
Ming


Switching to MQ by default may generate some bug reports

2017-08-03 Thread Mel Gorman
Hi Christoph,

I know the reasons for switching to MQ by default but just be aware that it's
not without hazards albeit it the biggest issues I've seen are switching
CFQ to BFQ. On my home grid, there is some experimental automatic testing
running every few weeks searching for regressions. Yesterday, it noticed
that creating some work files for a postgres simulator called pgioperf
was 38.33% slower and it auto-bisected to the switch to MQ. This is just
linearly writing two files for testing on another benchmark and is not
remarkable. The relevant part of the report is

Last good/First bad commit
==
Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
>From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 16 Jun 2017 10:27:55 +0200
Subject: [PATCH] scsi: default to scsi-mq
Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
path now that we had plenty of testing, and have I/O schedulers for
blk-mq.  The module option to disable the blk-mq path is kept around for
now.
Signed-off-by: Christoph Hellwig 
Signed-off-by: Martin K. Petersen 
 drivers/scsi/Kconfig | 11 ---
 drivers/scsi/scsi.c  |  4 
 2 files changed, 15 deletions(-)

Comparison
==
initialinitial  
 last  penup  first
 good-v4.12   bad-16f73eb02d7e  
good-6d311fa7  good-d06c587d   bad-5c279bd9
Usermin 0.06 (   0.00%)0.14 (-133.33%)0.14 
(-133.33%)0.06 (   0.00%)0.19 (-216.67%)
Usermean0.06 (   0.00%)0.14 (-133.33%)0.14 
(-133.33%)0.06 (   0.00%)0.19 (-216.67%)
Userstddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
Usercoeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
Usermax 0.06 (   0.00%)0.14 (-133.33%)0.14 
(-133.33%)0.06 (   0.00%)0.19 (-216.67%)
System  min10.04 (   0.00%)   10.75 (  -7.07%)   10.05 (  
-0.10%)   10.16 (  -1.20%)   10.73 (  -6.87%)
System  mean   10.04 (   0.00%)   10.75 (  -7.07%)   10.05 (  
-0.10%)   10.16 (  -1.20%)   10.73 (  -6.87%)
System  stddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
System  coeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
System  max10.04 (   0.00%)   10.75 (  -7.07%)   10.05 (  
-0.10%)   10.16 (  -1.20%)   10.73 (  -6.87%)
Elapsed min   251.53 (   0.00%)  351.05 ( -39.57%)  252.83 (  
-0.52%)  252.96 (  -0.57%)  347.93 ( -38.33%)
Elapsed mean  251.53 (   0.00%)  351.05 ( -39.57%)  252.83 (  
-0.52%)  252.96 (  -0.57%)  347.93 ( -38.33%)
Elapsed stddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
Elapsed coeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
Elapsed max   251.53 (   0.00%)  351.05 ( -39.57%)  252.83 (  
-0.52%)  252.96 (  -0.57%)  347.93 ( -38.33%)
CPU min 4.00 (   0.00%)3.00 (  25.00%)4.00 (   
0.00%)4.00 (   0.00%)3.00 (  25.00%)
CPU mean4.00 (   0.00%)3.00 (  25.00%)4.00 (   
0.00%)4.00 (   0.00%)3.00 (  25.00%)
CPU stddev  0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
CPU coeffvar0.00 (   0.00%)0.00 (   0.00%)0.00 (   
0.00%)0.00 (   0.00%)0.00 (   0.00%)
CPU max 4.00 (   0.00%)3.00 (  25.00%)4.00 (   
0.00%)4.00 (   0.00%)3.00 (  25.00%)

The "Elapsed mean" line is what the testing and auto-bisection was paying
attention to. Commit 16f73eb02d7e is simply the head commit at the time
the continuous testing started. The first "bad commit" is the last column.

It's not the only slowdown that has been observed from other testing when
examining whether it's ok to switch to MQ by default. The biggest slowdown
observed was with a modified version of dbench4 -- the modifications use
shorter, but representative, load files to avoid timing artifacts and
reports time to complete a load file instead of throughput as throughput
is kind of meaningless for dbench4

dbench4 Loadfile Execution Time