[PATCH BUGFIX/IMPROVEMENT V2 0/2] block, bfq: improve and refactor throughput-boosting logic
Hi, these two patches improve throughput-boosting logic in two aspects. The first patch refactors the parts of the device-idling logic, related to throughput boosting, that are still scattered across the source file bfq-iosched.c. The patch concetrates all the logic in one function. The second patch fixes/improves device idling for flash-based devices that have no internal queueing of I/O requests. The contribution in the first patch has been triggered by that in the second patch: finding the change made by the second patch has been more difficult than it had to be, because the logic that decides whether to idle the device is scattered across three functions. The second patch provides a significant throghput boost, for random I/O with flash-based non-queueing devices. For example, on a HiKey board, throughput increases by up to 125%, growing, e.g., from 6.9MB/s to 15.6MB/s with two or three random readers in parallel. Thanks, Paolo Paolo Valente (2): block,bfq: refactor device-idling logic block, bfq: boost throughput with flash-based non-queueing devices block/bfq-iosched.c | 144 block/bfq-iosched.h | 12 ++--- 2 files changed, 85 insertions(+), 71 deletions(-) -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT V2 2/2] block, bfq: boost throughput with flash-based non-queueing devices
When a queue associated with a process remains empty, there are cases where throughput gets boosted if the device is idled to await the arrival of a new I/O request for that queue. Currently, BFQ assumes that one of these cases is when the device has no internal queueing (regardless of the properties of the I/O being served). Unfortunately, this condition has proved to be too general. So, this commit refines it as "the device has no internal queueing and is rotational". This refinement provides a significant throughput boost with random I/O, on flash-based storage without internal queueing. For example, on a HiKey board, throughput increases by up to 125%, growing, e.g., from 6.9MB/s to 15.6MB/s with two or three random readers in parallel. Signed-off-by: Paolo ValenteSigned-off-by: Luca Miccio --- block/bfq-iosched.c | 29 +++-- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index ccdc9e6..509f399 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -3114,7 +3114,10 @@ static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq) static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) { struct bfq_data *bfqd = bfqq->bfqd; - bool idling_boosts_thr, idling_boosts_thr_without_issues, + bool rot_without_queueing = + !blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag, + bfqq_sequential_and_IO_bound, + idling_boosts_thr, idling_boosts_thr_without_issues, idling_needed_for_service_guarantees, asymmetric_scenario; @@ -3133,28 +3136,34 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) bfq_class_idle(bfqq)) return false; + bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) && + bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq); + /* * The next variable takes into account the cases where idling * boosts the throughput. * * The value of the variable is computed considering, first, that * idling is virtually always beneficial for the throughput if: -* (a) the device is not NCQ-capable, or -* (b) regardless of the presence of NCQ, the device is rotational -* and the request pattern for bfqq is I/O-bound and sequential. +* (a) the device is not NCQ-capable and rotational, or +* (b) regardless of the presence of NCQ, the device is rotational and +* the request pattern for bfqq is I/O-bound and sequential, or +* (c) regardless of whether it is rotational, the device is +* not NCQ-capable and the request pattern for bfqq is +* I/O-bound and sequential. * * Secondly, and in contrast to the above item (b), idling an * NCQ-capable flash-based device would not boost the * throughput even with sequential I/O; rather it would lower * the throughput in proportion to how fast the device * is. Accordingly, the next variable is true if any of the -* above conditions (a) and (b) is true, and, in particular, -* happens to be false if bfqd is an NCQ-capable flash-based -* device. +* above conditions (a), (b) or (c) is true, and, in +* particular, happens to be false if bfqd is an NCQ-capable +* flash-based device. */ - idling_boosts_thr = !bfqd->hw_tag || - (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) && -bfq_bfqq_has_short_ttime(bfqq)); + idling_boosts_thr = rot_without_queueing || + ((!blk_queue_nonrot(bfqd->queue) || !bfqd->hw_tag) && +bfqq_sequential_and_IO_bound); /* * The value of the next variable, -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT V2 1/2] block,bfq: refactor device-idling logic
The logic that decides whether to idle the device is scattered across three functions. Almost all of the logic is in the function bfq_bfqq_may_idle, but (1) part of the decision is made in bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may switch off idling regardless of the output of bfq_bfqq_may_idle. In addition, both bfq_update_idle_window and bfq_bfqq_must_idle make their decisions as a function of parameters that are used, for similar purposes, also in bfq_bfqq_may_idle. This commit addresses these issues by moving all the logic into bfq_bfqq_may_idle. Signed-off-by: Paolo Valente--- block/bfq-iosched.c | 117 +++- block/bfq-iosched.h | 12 +++--- 2 files changed, 67 insertions(+), 62 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 436b6ca..ccdc9e6 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -128,7 +128,7 @@ BFQ_BFQQ_FNS(busy); BFQ_BFQQ_FNS(wait_request); BFQ_BFQQ_FNS(non_blocking_wait_rq); BFQ_BFQQ_FNS(fifo_expire); -BFQ_BFQQ_FNS(idle_window); +BFQ_BFQQ_FNS(has_short_ttime); BFQ_BFQQ_FNS(sync); BFQ_BFQQ_FNS(IO_bound); BFQ_BFQQ_FNS(in_large_burst); @@ -731,10 +731,10 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd, unsigned int old_wr_coeff = bfqq->wr_coeff; bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq); - if (bic->saved_idle_window) - bfq_mark_bfqq_idle_window(bfqq); + if (bic->saved_has_short_ttime) + bfq_mark_bfqq_has_short_ttime(bfqq); else - bfq_clear_bfqq_idle_window(bfqq); + bfq_clear_bfqq_has_short_ttime(bfqq); if (bic->saved_IO_bound) bfq_mark_bfqq_IO_bound(bfqq); @@ -2012,7 +2012,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq) return; bic->saved_ttime = bfqq->ttime; - bic->saved_idle_window = bfq_bfqq_idle_window(bfqq); + bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq); bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq); bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq); bic->was_in_burst_list = !hlist_unhashed(>burst_list_node); @@ -3038,8 +3038,8 @@ void bfq_bfqq_expire(struct bfq_data *bfqd, } bfq_log_bfqq(bfqd, bfqq, - "expire (%d, slow %d, num_disp %d, idle_win %d)", reason, - slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq)); + "expire (%d, slow %d, num_disp %d, short_ttime %d)", reason, + slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq)); /* * Increase, decrease or leave budget unchanged according to @@ -3122,6 +3122,18 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) return true; /* +* Idling is performed only if slice_idle > 0. In addition, we +* do not idle if +* (a) bfqq is async +* (b) bfqq is in the idle io prio class: in this case we do +* not idle because we want to minimize the bandwidth that +* queues in this class can steal to higher-priority queues +*/ + if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) || + bfq_class_idle(bfqq)) + return false; + + /* * The next variable takes into account the cases where idling * boosts the throughput. * @@ -3142,7 +3154,7 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) */ idling_boosts_thr = !bfqd->hw_tag || (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) && -bfq_bfqq_idle_window(bfqq)); +bfq_bfqq_has_short_ttime(bfqq)); /* * The value of the next variable, @@ -3313,16 +3325,13 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq); /* -* We have now all the components we need to compute the return -* value of the function, which is true only if both the following -* conditions hold: -* 1) bfqq is sync, because idling make sense only for sync queues; -* 2) idling either boosts the throughput (without issues), or -*is necessary to preserve service guarantees. +* We have now all the components we need to compute the +* return value of the function, which is true only if idling +* either boosts the throughput (without issues), or is +* necessary to preserve service guarantees. */ - return bfq_bfqq_sync(bfqq) && - (idling_boosts_thr_without_issues || -idling_needed_for_service_guarantees); + return idling_boosts_thr_without_issues || + idling_needed_for_service_guarantees; } /* @@ -3338,10 +3347,7 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) */ static
Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
On 08/03/2017 03:29 PM, Bart Van Assche wrote: > On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: >> static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, >> @@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx >> *hctx, >> { >> struct mq_inflight *mi = priv; >> >> -if (rq->part == mi->part) >> -mi->inflight++; >> +if (!test_bit(REQ_ATOM_STARTED, >atomic_flags)) >> +return; > > Should the REQ_ATOM_STARTED test perhaps have been introduced in patch 3/4 > instead of in this patch? It should, moved. >> +if (rq->part == mi->part1) { >> +mi->inflight[0]++; >> +if (mi->part1->partno && >> +_to_disk(mi->part1)->part0 == mi->part2) >> +mi->inflight[1]++; >> +} else if (rq->part == mi->part2) >> +mi->inflight[1]++; >> } > > So mi->part2 may represent part0 but mi->part1 not? Does that deserve a > comment? > > Additionally, shouldn't the mi->part2 == part0 test be moved out of the > if-statement > such that all requests are counted for part0 instead of storing the same > count in > inflight[0] and inflight[1] if mi->part2 == part0? I think I'll just clean up the whole thing and get rid of part1/part2. The two can only exist together, if part1 is a partition. So will be easier to just unconditionally sum part+root, if part is a partition. Then we only need to pass in 'part', not part1/2. >> -unsigned int blk_mq_in_flight(struct request_queue *q, >> - struct hd_struct *part) >> +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1, >> + struct hd_struct *part2, unsigned int *inflight) >> { > > Should inflight be declared as an array to make it clear that it is a pointer > to > an array with two elements? Sure, I can do that. -- Jens Axboe
Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
On 08/03/2017 03:25 PM, Bart Van Assche wrote: > On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: >> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, >> + struct request *rq, void *priv, >> + bool reserved) >> +{ >> +struct mq_inflight *mi = priv; >> + >> +if (rq->part == mi->part) >> +mi->inflight++; >> +} >> [ ... ] >> -static inline void part_inc_in_flight(struct request_queue *q, >> - struct hd_struct *part, int rw) >> -{ >> -atomic_inc(>in_flight[rw]); >> -if (part->partno) >> -atomic_inc(_to_disk(part)->part0.in_flight[rw]); >> -} > > Hello Jens, > > The existing part_inc_in_flight() code includes all requests in the in_flight > statistics for part0 but the new code in blk_mq_check_inflight() not. Is that > on purpose? Should the rq->part == mi->part check perhaps be skipped if > mi->part represents part0? The existing code increments always for the partition in question, and for the root if it's a partition. I'll take a look at that logic, and ensure it's all correct. -- Jens Axboe
Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: > We only need to iterate the queues and tags once, we if just pass Hello Jens, Regarding the patch description: it seems to me like in the text "we if" these two words appear in the wrong order? Bart.
Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: > static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, > @@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx > *hctx, > { > struct mq_inflight *mi = priv; > > - if (rq->part == mi->part) > - mi->inflight++; > + if (!test_bit(REQ_ATOM_STARTED, >atomic_flags)) > + return; Should the REQ_ATOM_STARTED test perhaps have been introduced in patch 3/4 instead of in this patch? > + if (rq->part == mi->part1) { > + mi->inflight[0]++; > + if (mi->part1->partno && > + _to_disk(mi->part1)->part0 == mi->part2) > + mi->inflight[1]++; > + } else if (rq->part == mi->part2) > + mi->inflight[1]++; > } So mi->part2 may represent part0 but mi->part1 not? Does that deserve a comment? Additionally, shouldn't the mi->part2 == part0 test be moved out of the if-statement such that all requests are counted for part0 instead of storing the same count in inflight[0] and inflight[1] if mi->part2 == part0? > -unsigned int blk_mq_in_flight(struct request_queue *q, > -struct hd_struct *part) > +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1, > + struct hd_struct *part2, unsigned int *inflight) > { Should inflight be declared as an array to make it clear that it is a pointer to an array with two elements? Thanks, Bart.
Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: > +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, > + struct request *rq, void *priv, > + bool reserved) > +{ > + struct mq_inflight *mi = priv; > + > + if (rq->part == mi->part) > + mi->inflight++; > +} > [ ... ] > -static inline void part_inc_in_flight(struct request_queue *q, > - struct hd_struct *part, int rw) > -{ > - atomic_inc(>in_flight[rw]); > - if (part->partno) > - atomic_inc(_to_disk(part)->part0.in_flight[rw]); > -} Hello Jens, The existing part_inc_in_flight() code includes all requests in the in_flight statistics for part0 but the new code in blk_mq_check_inflight() not. Is that on purpose? Should the rq->part == mi->part check perhaps be skipped if mi->part represents part0? Thanks, Bart.
Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
On 08/03/2017 02:50 PM, Bart Van Assche wrote: > On Thu, 2017-08-03 at 14:40 -0600, Jens Axboe wrote: >> On 08/03/2017 02:35 PM, Jens Axboe wrote: I agree with what you wrote in the description of this patch. However, since I have not yet found the code that clears tags->rqs[], would it be possible to show me that code? >>> >>> Since it's been a month since I wrote this code, I went and looked >>> too. My memory was that we set/clear it dynamically since we added >>> scheduling, but looks like we don't clear it. The race is still valid >>> for when someone runs a tag check in parallel with someone allocating >>> a tag, since there's a window of time where the tag bit is set, but >>> ->rqs[tag] isn't set yet. That's probably the race I hit, not the >>> completion race mentioned in the change log. >> >> Rewrote the commit message: >> >> http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=1908e43118e688e41ac8656edcf3e7a150f3f5081 > > Hello Jens, > > This is what I found in the updated commit: > > blk-mq-tag: check for NULL rq when iterating tags > > Since we introduced blk-mq-sched, the tags->rqs[] array has been > dynamically assigned. So we need to check for NULL when iterating, > since there's a window of time where the bit is set, but we haven't > dynamically assigned the tags->rqs[] array position yet. > > This is perfectly safe, since the memory backing of the request is > never going away while the device is alive. > > Does this mean that blk_mq_tagset_busy_iter() can skip requests that it > shouldn't skip and also that blk_mq_tagset_busy_iter() can pass a pointer to > the previous request that was associated with a tag instead of the current > request to its busy_tag_iter_fn argument? Shouldn't these races be fixed, > e.g. by swapping the order in which the tag are set and tags->rqs[] are > assigned such that the correct request pointer is passed to the > busy_tag_iter_fn argument? We can't swap them. We need to reserve a bit first, which entails setting that bit. Once that's set, we can assign ->rqs[]. The race isn't a big deal, and I don't want to add any code to prevent it, since that would mean locking. Since we have the luxury of the request itself always being valid memory, we can deal with stale info or having a NULL because of the ordering. It's a conscious trade off. -- Jens Axboe
Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
On Thu, 2017-08-03 at 14:45 -0600, Jens Axboe wrote: > The inc/dec goes away for mq, the non-mq path still has to use them. I just > move them as well. Could be a prep patch, but it's just moving the code out > of the header and into a normal C file instead. Hello Jens, I misread that part of the patch. Now I have had another look these changes look fine to me. Hence: Reviewed-by: Bart Van Assche
Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
On Thu, 2017-08-03 at 14:40 -0600, Jens Axboe wrote: > On 08/03/2017 02:35 PM, Jens Axboe wrote: > > > I agree with what you wrote in the description of this patch. > > > However, since I have not yet found the code that clears tags->rqs[], > > > would it be possible to show me that code? > > > > Since it's been a month since I wrote this code, I went and looked > > too. My memory was that we set/clear it dynamically since we added > > scheduling, but looks like we don't clear it. The race is still valid > > for when someone runs a tag check in parallel with someone allocating > > a tag, since there's a window of time where the tag bit is set, but > > ->rqs[tag] isn't set yet. That's probably the race I hit, not the > > completion race mentioned in the change log. > > Rewrote the commit message: > > http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=1908e43118e688e41ac8656edcf3e7a150f3f5081 Hello Jens, This is what I found in the updated commit: blk-mq-tag: check for NULL rq when iterating tags Since we introduced blk-mq-sched, the tags->rqs[] array has been dynamically assigned. So we need to check for NULL when iterating, since there's a window of time where the bit is set, but we haven't dynamically assigned the tags->rqs[] array position yet. This is perfectly safe, since the memory backing of the request is never going away while the device is alive. Does this mean that blk_mq_tagset_busy_iter() can skip requests that it shouldn't skip and also that blk_mq_tagset_busy_iter() can pass a pointer to the previous request that was associated with a tag instead of the current request to its busy_tag_iter_fn argument? Shouldn't these races be fixed, e.g. by swapping the order in which the tag are set and tags->rqs[] are assigned such that the correct request pointer is passed to the busy_tag_iter_fn argument? Thanks, Bart.
Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
On 08/03/2017 02:41 PM, Bart Van Assche wrote: > On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: >> We don't have to inc/dec some counter, since we can just >> iterate the tags. That makes inc/dec a noop, but means we >> have to iterate busy tags to get an in-flight count. >> [ ... ] >> +unsigned int blk_mq_in_flight(struct request_queue *q, >> + struct hd_struct *part) >> +{ >> +struct mq_inflight mi = { .part = part, .inflight = 0 }; > > Hello Jens, > > A minor stylistic comment: since a C compiler is required to > initialize to zero all member variables that have not been initialized > explicitly I think ".inflight = 0" can be left out. It can, I'll kill it. >> diff --git a/block/genhd.c b/block/genhd.c >> index f735af67a0c9..ad5dc567d57f 100644 >> --- a/block/genhd.c >> +++ b/block/genhd.c >> @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk); >> static void disk_del_events(struct gendisk *disk); >> static void disk_release_events(struct gendisk *disk); >> >> +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, >> int rw) >> +{ >> +if (q->mq_ops) >> +return; >> + >> +atomic_inc(>in_flight[rw]); >> +if (part->partno) >> +atomic_inc(_to_disk(part)->part0.in_flight[rw]); >> +} >> [ ... ] >> diff --git a/include/linux/genhd.h b/include/linux/genhd.h >> index 7f7427e00f9c..f2c5096b3a7e 100644 >> --- a/include/linux/genhd.h >> +++ b/include/linux/genhd.h >> @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct >> *part) >> #define part_stat_sub(cpu, gendiskp, field, subnd) \ >> part_stat_add(cpu, gendiskp, field, -subnd) >> >> -static inline void part_inc_in_flight(struct request_queue *q, >> - struct hd_struct *part, int rw) >> -{ >> -atomic_inc(>in_flight[rw]); >> -if (part->partno) >> -atomic_inc(_to_disk(part)->part0.in_flight[rw]); >> -} >> [ ... ] > > Sorry but to me it seems like this part of the patch does match with > the patch description? The patch description mentions that inc and dec > become a noop but it seems to me that these functions have been > uninlined instead of making these a noop? The inc/dec goes away for mq, the non-mq path still has to use them. I just move them as well. Could be a prep patch, but it's just moving the code out of the header and into a normal C file instead. -- Jens Axboe
Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: > We don't have to inc/dec some counter, since we can just > iterate the tags. That makes inc/dec a noop, but means we > have to iterate busy tags to get an in-flight count. > [ ... ] > +unsigned int blk_mq_in_flight(struct request_queue *q, > +struct hd_struct *part) > +{ > + struct mq_inflight mi = { .part = part, .inflight = 0 }; Hello Jens, A minor stylistic comment: since a C compiler is required to initialize to zero all member variables that have not been initialized explicitly I think ".inflight = 0" can be left out. > diff --git a/block/genhd.c b/block/genhd.c > index f735af67a0c9..ad5dc567d57f 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk); > static void disk_del_events(struct gendisk *disk); > static void disk_release_events(struct gendisk *disk); > > +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int > rw) > +{ > + if (q->mq_ops) > + return; > + > + atomic_inc(>in_flight[rw]); > + if (part->partno) > + atomic_inc(_to_disk(part)->part0.in_flight[rw]); > +} > [ ... ] > diff --git a/include/linux/genhd.h b/include/linux/genhd.h > index 7f7427e00f9c..f2c5096b3a7e 100644 > --- a/include/linux/genhd.h > +++ b/include/linux/genhd.h > @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct > *part) > #define part_stat_sub(cpu, gendiskp, field, subnd) \ > part_stat_add(cpu, gendiskp, field, -subnd) > > -static inline void part_inc_in_flight(struct request_queue *q, > - struct hd_struct *part, int rw) > -{ > - atomic_inc(>in_flight[rw]); > - if (part->partno) > - atomic_inc(_to_disk(part)->part0.in_flight[rw]); > -} > [ ... ] Sorry but to me it seems like this part of the patch does match with the patch description? The patch description mentions that inc and dec become a noop but it seems to me that these functions have been uninlined instead of making these a noop? Bart.
Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
On 08/03/2017 02:35 PM, Jens Axboe wrote: >> I agree with what you wrote in the description of this patch. >> However, since I have not yet found the code that clears tags->rqs[], >> would it be possible to show me that code? > > Since it's been a month since I wrote this code, I went and looked > too. My memory was that we set/clear it dynamically since we added > scheduling, but looks like we don't clear it. The race is still valid > for when someone runs a tag check in parallel with someone allocating > a tag, since there's a window of time where the tag bit is set, but > ->rqs[tag] isn't set yet. That's probably the race I hit, not the > completion race mentioned in the change log. Rewrote the commit message: http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=1908e43118e688e41ac8656edcf3e7a150f3f5081 -- Jens Axboe
Re: [PATCH 2/4] block: pass in queue to inflight accounting
On 08/03/2017 02:35 PM, Bart Van Assche wrote: > On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: >> No functional change in this patch, just in preparation for >> basing the inflight mechanism on the queue in question. >> >> Signed-off-by: Jens Axboe>> --- >> block/bio.c | 16 >> block/blk-core.c | 22 -- >> block/blk-merge.c | 4 ++-- >> block/genhd.c | 4 ++-- >> block/partition-generic.c | 5 +++-- >> drivers/block/drbd/drbd_req.c | 10 +++--- >> drivers/block/rsxx/dev.c | 6 +++--- >> drivers/block/zram/zram_drv.c | 5 +++-- >> drivers/md/bcache/request.c | 7 --- >> drivers/md/dm.c | 6 +++--- >> drivers/nvdimm/nd.h | 5 +++-- >> include/linux/bio.h | 9 + >> include/linux/genhd.h | 14 +- >> 13 files changed, 64 insertions(+), 49 deletions(-) > > Hello Jens, > > Should the DRBD, rsxx, zram, md and nvdimm driver maintainers be CC-ed? Yeah, they should, forgot to do so. I'll make sure they see it, since this is the initial posting, I'm assuming I'll have to do a v2 anyway. -- Jens Axboe
Re: [PATCH 2/4] block: pass in queue to inflight accounting
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: > No functional change in this patch, just in preparation for > basing the inflight mechanism on the queue in question. > > Signed-off-by: Jens Axboe> --- > block/bio.c | 16 > block/blk-core.c | 22 -- > block/blk-merge.c | 4 ++-- > block/genhd.c | 4 ++-- > block/partition-generic.c | 5 +++-- > drivers/block/drbd/drbd_req.c | 10 +++--- > drivers/block/rsxx/dev.c | 6 +++--- > drivers/block/zram/zram_drv.c | 5 +++-- > drivers/md/bcache/request.c | 7 --- > drivers/md/dm.c | 6 +++--- > drivers/nvdimm/nd.h | 5 +++-- > include/linux/bio.h | 9 + > include/linux/genhd.h | 14 +- > 13 files changed, 64 insertions(+), 49 deletions(-) Hello Jens, Should the DRBD, rsxx, zram, md and nvdimm driver maintainers be CC-ed? Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
On 08/03/2017 02:29 PM, Bart Van Assche wrote: > On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: >> Since we introduced blk-mq-sched, the tags->rqs[] array has been >> dynamically assigned. So we need to check for NULL when iterating, >> since we could be racing with completion. >> >> This is perfectly safe, since the memory backing of the request is >> never going away while the device is alive. Only the pointer in >> ->rqs[] may be reset. >> >> Signed-off-by: Jens Axboe>> --- >> block/blk-mq-tag.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index d0be72ccb091..b856b2827157 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int >> bitnr, void *data) >> bitnr += tags->nr_reserved_tags; >> rq = tags->rqs[bitnr]; >> >> -if (rq->q == hctx->queue) >> +if (rq && rq->q == hctx->queue) >> iter_data->fn(hctx, rq, iter_data->data, reserved); >> return true; >> } >> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, >> unsigned int bitnr, void *data) >> if (!reserved) >> bitnr += tags->nr_reserved_tags; >> rq = tags->rqs[bitnr]; >> - >> -iter_data->fn(rq, iter_data->data, reserved); >> +if (rq) >> +iter_data->fn(rq, iter_data->data, reserved); >> return true; >> } > > Hello Jens, > > I agree with what you wrote in the description of this patch. However, since > I have not yet found the code that clears tags->rqs[], would it be possible > to show me that code? Since it's been a month since I wrote this code, I went and looked too. My memory was that we set/clear it dynamically since we added scheduling, but looks like we don't clear it. The race is still valid for when someone runs a tag check in parallel with someone allocating a tag, since there's a window of time where the tag bit is set, but ->rqs[tag] isn't set yet. That's probably the race I hit, not the completion race mentioned in the change log. -- Jens Axboe
Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote: > Since we introduced blk-mq-sched, the tags->rqs[] array has been > dynamically assigned. So we need to check for NULL when iterating, > since we could be racing with completion. > > This is perfectly safe, since the memory backing of the request is > never going away while the device is alive. Only the pointer in > ->rqs[] may be reset. > > Signed-off-by: Jens Axboe> --- > block/blk-mq-tag.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index d0be72ccb091..b856b2827157 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int > bitnr, void *data) > bitnr += tags->nr_reserved_tags; > rq = tags->rqs[bitnr]; > > - if (rq->q == hctx->queue) > + if (rq && rq->q == hctx->queue) > iter_data->fn(hctx, rq, iter_data->data, reserved); > return true; > } > @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned > int bitnr, void *data) > if (!reserved) > bitnr += tags->nr_reserved_tags; > rq = tags->rqs[bitnr]; > - > - iter_data->fn(rq, iter_data->data, reserved); > + if (rq) > + iter_data->fn(rq, iter_data->data, reserved); > return true; > } Hello Jens, I agree with what you wrote in the description of this patch. However, since I have not yet found the code that clears tags->rqs[], would it be possible to show me that code? Thanks, Bart.
[PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
We only need to iterate the queues and tags once, we if just pass in both partition structs. Add part_in_flight_double() for returning this value. Signed-off-by: Jens Axboe--- block/blk-core.c | 34 +++--- block/blk-mq.c| 25 + block/blk-mq.h| 3 ++- block/genhd.c | 23 +-- include/linux/genhd.h | 2 ++ 5 files changed, 65 insertions(+), 22 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index ba88d71b442f..d111a8f5bdf1 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1470,14 +1470,9 @@ static void add_acct_request(struct request_queue *q, struct request *rq, } static void part_round_stats_single(struct request_queue *q, int cpu, - struct hd_struct *part, unsigned long now) + struct hd_struct *part, unsigned long now, + unsigned int inflight) { - int inflight; - - if (now == part->stamp) - return; - - inflight = part_in_flight(q, part); if (inflight) { __part_stat_add(cpu, part, time_in_queue, inflight * (now - part->stamp)); @@ -1505,12 +1500,29 @@ static void part_round_stats_single(struct request_queue *q, int cpu, */ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part) { + struct hd_struct *part2 = NULL; unsigned long now = jiffies; + unsigned int inflight[2]; + int stats = 0; + + if (part->stamp != now) + stats |= 1; + + if (part->partno) { + part2 = _to_disk(part)->part0; + if (part2->stamp != now) + stats |= 2; + } + + if (!stats) + return; + + part_in_flight_double(q, part, part2, inflight); - if (part->partno) - part_round_stats_single(q, cpu, _to_disk(part)->part0, - now); - part_round_stats_single(q, cpu, part, now); + if (stats & 2) + part_round_stats_single(q, cpu, part2, now, inflight[1]); + if (stats & 1) + part_round_stats_single(q, cpu, part, now, inflight[0]); } EXPORT_SYMBOL_GPL(part_round_stats); diff --git a/block/blk-mq.c b/block/blk-mq.c index 37035891e120..03d164917160 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -87,8 +87,9 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, } struct mq_inflight { - struct hd_struct *part; - unsigned int inflight; + struct hd_struct *part1; + struct hd_struct *part2; + unsigned int *inflight; }; static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, @@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, { struct mq_inflight *mi = priv; - if (rq->part == mi->part) - mi->inflight++; + if (!test_bit(REQ_ATOM_STARTED, >atomic_flags)) + return; + if (rq->part == mi->part1) { + mi->inflight[0]++; + if (mi->part1->partno && + _to_disk(mi->part1)->part0 == mi->part2) + mi->inflight[1]++; + } else if (rq->part == mi->part2) + mi->inflight[1]++; } -unsigned int blk_mq_in_flight(struct request_queue *q, - struct hd_struct *part) +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1, + struct hd_struct *part2, unsigned int *inflight) { - struct mq_inflight mi = { .part = part, .inflight = 0 }; + struct mq_inflight mi = { .part1 = part1, .part2 = part2, + .inflight = inflight }; + inflight[0] = inflight[1] = 0; blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, ); - return mi.inflight; } void blk_freeze_queue_start(struct request_queue *q) diff --git a/block/blk-mq.h b/block/blk-mq.h index cade1a512a01..c5264a64fb7c 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -138,6 +138,7 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) return hctx->nr_ctx && hctx->tags; } -unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part); +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1, + struct hd_struct *part2, unsigned int *inflight); #endif diff --git a/block/genhd.c b/block/genhd.c index ad5dc567d57f..6faacccd8453 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -67,13 +67,32 @@ void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw) int part_in_flight(struct request_queue *q, struct hd_struct *part) { - if (q->mq_ops) - return blk_mq_in_flight(q, part); + if (q->mq_ops) { + unsigned int inflight[2]; +
[PATCH 0/4] block: more scalable inflight tracking
One scaling issue we currently have in the block code is the inflight accounting. It's based on a per-device atomic count for reads and writes, which means that even for an mq device with lots of hardware queues, we end up dirtying a per-device cacheline for each IO. The issue can easily be observed by using null_blk: modprobe null_blk submit_queues=48 queue_mode=2 and running a fio job that has 32 jobs doing sync reads on the device (average of 3 runs, though deviation is low): stats IOPSusr sys -- on 2.6M5.4% 94.6% off 21.0M 33.7% 67.3% which shows a 10x slowdown with stats enabled. If we look at the profile for stats on, the top entries are: 37.38% fio [kernel.vmlinux][k] blk_account_io_done 14.65% fio [kernel.vmlinux][k] blk_account_io_start 14.29% fio [kernel.vmlinux][k] part_round_stats_single 11.81% fio [kernel.vmlinux][k] blk_account_io_completion 3.62% fio [kernel.vmlinux][k] part_round_stats 0.81% fio [kernel.vmlinux][k] __blkdev_direct_IO_simple which shows the system time being dominated by the stats accounting. This patch series replaces the atomic counter with using the tags hamming weight for tracking infligh counts. This means we don't have to do anything when IO starts or completes, and for reading the value we just have to check how many bits we have set in the tag maps on the queues. This part is limited to 1000 per second (for HZ=1000). Using this approach, running the same test again results in: stats IOPSusr sys -- on 20.4M 30.9% 69.0% off 21.4M 32.4% 67.4% and doing a profiled run with stats on, the top of stats reporting is now: 1.23% fio [kernel.vmlinux] [k] blk_account_io_done 0.83% fio [kernel.vmlinux] [k] blk_account_io_start 0.55% fio [kernel.vmlinux] [k] blk_account_io_completion which is a lot more reasonable. The difference between stats on and off is now also neglible. -- Jens Axboe
[PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
Since we introduced blk-mq-sched, the tags->rqs[] array has been dynamically assigned. So we need to check for NULL when iterating, since we could be racing with completion. This is perfectly safe, since the memory backing of the request is never going away while the device is alive. Only the pointer in ->rqs[] may be reset. Signed-off-by: Jens Axboe--- block/blk-mq-tag.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index d0be72ccb091..b856b2827157 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) bitnr += tags->nr_reserved_tags; rq = tags->rqs[bitnr]; - if (rq->q == hctx->queue) + if (rq && rq->q == hctx->queue) iter_data->fn(hctx, rq, iter_data->data, reserved); return true; } @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; rq = tags->rqs[bitnr]; - - iter_data->fn(rq, iter_data->data, reserved); + if (rq) + iter_data->fn(rq, iter_data->data, reserved); return true; } -- 2.7.4
[PATCH 2/4] block: pass in queue to inflight accounting
No functional change in this patch, just in preparation for basing the inflight mechanism on the queue in question. Signed-off-by: Jens Axboe--- block/bio.c | 16 block/blk-core.c | 22 -- block/blk-merge.c | 4 ++-- block/genhd.c | 4 ++-- block/partition-generic.c | 5 +++-- drivers/block/drbd/drbd_req.c | 10 +++--- drivers/block/rsxx/dev.c | 6 +++--- drivers/block/zram/zram_drv.c | 5 +++-- drivers/md/bcache/request.c | 7 --- drivers/md/dm.c | 6 +++--- drivers/nvdimm/nd.h | 5 +++-- include/linux/bio.h | 9 + include/linux/genhd.h | 14 +- 13 files changed, 64 insertions(+), 49 deletions(-) diff --git a/block/bio.c b/block/bio.c index 9cf98b29588a..4d751a5b56f0 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1729,29 +1729,29 @@ void bio_check_pages_dirty(struct bio *bio) } } -void generic_start_io_acct(int rw, unsigned long sectors, - struct hd_struct *part) +void generic_start_io_acct(struct request_queue *q, int rw, + unsigned long sectors, struct hd_struct *part) { int cpu = part_stat_lock(); - part_round_stats(cpu, part); + part_round_stats(q, cpu, part); part_stat_inc(cpu, part, ios[rw]); part_stat_add(cpu, part, sectors[rw], sectors); - part_inc_in_flight(part, rw); + part_inc_in_flight(q, part, rw); part_stat_unlock(); } EXPORT_SYMBOL(generic_start_io_acct); -void generic_end_io_acct(int rw, struct hd_struct *part, -unsigned long start_time) +void generic_end_io_acct(struct request_queue *q, int rw, +struct hd_struct *part, unsigned long start_time) { unsigned long duration = jiffies - start_time; int cpu = part_stat_lock(); part_stat_add(cpu, part, ticks[rw], duration); - part_round_stats(cpu, part); - part_dec_in_flight(part, rw); + part_round_stats(q, cpu, part); + part_dec_in_flight(q, part, rw); part_stat_unlock(); } diff --git a/block/blk-core.c b/block/blk-core.c index af393d5a9680..ba88d71b442f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1469,15 +1469,15 @@ static void add_acct_request(struct request_queue *q, struct request *rq, __elv_add_request(q, rq, where); } -static void part_round_stats_single(int cpu, struct hd_struct *part, - unsigned long now) +static void part_round_stats_single(struct request_queue *q, int cpu, + struct hd_struct *part, unsigned long now) { int inflight; if (now == part->stamp) return; - inflight = part_in_flight(part); + inflight = part_in_flight(q, part); if (inflight) { __part_stat_add(cpu, part, time_in_queue, inflight * (now - part->stamp)); @@ -1488,6 +1488,7 @@ static void part_round_stats_single(int cpu, struct hd_struct *part, /** * part_round_stats() - Round off the performance stats on a struct disk_stats. + * @q: target block queue * @cpu: cpu number for stats access * @part: target partition * @@ -1502,13 +1503,14 @@ static void part_round_stats_single(int cpu, struct hd_struct *part, * /proc/diskstats. This accounts immediately for all queue usage up to * the current jiffies and restarts the counters again. */ -void part_round_stats(int cpu, struct hd_struct *part) +void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part) { unsigned long now = jiffies; if (part->partno) - part_round_stats_single(cpu, _to_disk(part)->part0, now); - part_round_stats_single(cpu, part, now); + part_round_stats_single(q, cpu, _to_disk(part)->part0, + now); + part_round_stats_single(q, cpu, part, now); } EXPORT_SYMBOL_GPL(part_round_stats); @@ -2434,8 +2436,8 @@ void blk_account_io_done(struct request *req) part_stat_inc(cpu, part, ios[rw]); part_stat_add(cpu, part, ticks[rw], duration); - part_round_stats(cpu, part); - part_dec_in_flight(part, rw); + part_round_stats(req->q, cpu, part); + part_dec_in_flight(req->q, part, rw); hd_struct_put(part); part_stat_unlock(); @@ -2492,8 +2494,8 @@ void blk_account_io_start(struct request *rq, bool new_io) part = >rq_disk->part0; hd_struct_get(part); } - part_round_stats(cpu, part); - part_inc_in_flight(part, rw); + part_round_stats(rq->q, cpu, part); + part_inc_in_flight(rq->q, part, rw);
[PATCH 3/4] blk-mq: provide internal in-flight variant
We don't have to inc/dec some counter, since we can just iterate the tags. That makes inc/dec a noop, but means we have to iterate busy tags to get an in-flight count. Signed-off-by: Jens Axboe--- block/blk-mq.c| 24 block/blk-mq.h| 2 ++ block/genhd.c | 29 + include/linux/genhd.h | 25 +++-- 4 files changed, 58 insertions(+), 22 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 05dfa3f270ae..37035891e120 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, sbitmap_clear_bit(>ctx_map, ctx->index_hw); } +struct mq_inflight { + struct hd_struct *part; + unsigned int inflight; +}; + +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx, + struct request *rq, void *priv, + bool reserved) +{ + struct mq_inflight *mi = priv; + + if (rq->part == mi->part) + mi->inflight++; +} + +unsigned int blk_mq_in_flight(struct request_queue *q, + struct hd_struct *part) +{ + struct mq_inflight mi = { .part = part, .inflight = 0 }; + + blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, ); + return mi.inflight; +} + void blk_freeze_queue_start(struct request_queue *q) { int freeze_depth; diff --git a/block/blk-mq.h b/block/blk-mq.h index 1a06fdf9fd4d..cade1a512a01 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -138,4 +138,6 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) return hctx->nr_ctx && hctx->tags; } +unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part); + #endif diff --git a/block/genhd.c b/block/genhd.c index f735af67a0c9..ad5dc567d57f 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk); static void disk_del_events(struct gendisk *disk); static void disk_release_events(struct gendisk *disk); +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw) +{ + if (q->mq_ops) + return; + + atomic_inc(>in_flight[rw]); + if (part->partno) + atomic_inc(_to_disk(part)->part0.in_flight[rw]); +} + +void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw) +{ + if (q->mq_ops) + return; + + atomic_dec(>in_flight[rw]); + if (part->partno) + atomic_dec(_to_disk(part)->part0.in_flight[rw]); +} + +int part_in_flight(struct request_queue *q, struct hd_struct *part) +{ + if (q->mq_ops) + return blk_mq_in_flight(q, part); + + return atomic_read(>in_flight[0]) + + atomic_read(>in_flight[1]); +} + /** * disk_get_part - get partition * @disk: disk to look partition from diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 7f7427e00f9c..f2c5096b3a7e 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct *part) #define part_stat_sub(cpu, gendiskp, field, subnd) \ part_stat_add(cpu, gendiskp, field, -subnd) -static inline void part_inc_in_flight(struct request_queue *q, - struct hd_struct *part, int rw) -{ - atomic_inc(>in_flight[rw]); - if (part->partno) - atomic_inc(_to_disk(part)->part0.in_flight[rw]); -} - -static inline void part_dec_in_flight(struct request_queue *q, - struct hd_struct *part, int rw) -{ - atomic_dec(>in_flight[rw]); - if (part->partno) - atomic_dec(_to_disk(part)->part0.in_flight[rw]); -} - -static inline int part_in_flight(struct request_queue *q, -struct hd_struct *part) -{ - return atomic_read(>in_flight[0]) + - atomic_read(>in_flight[1]); -} +int part_in_flight(struct request_queue *q, struct hd_struct *part); +void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw); +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw); static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk) { -- 2.7.4
Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue
On Thu, 2017-08-03 at 11:13 +0800, Ming Lei wrote: > On Thu, Aug 03, 2017 at 01:35:29AM +, Bart Van Assche wrote: > > On Wed, 2017-08-02 at 11:31 +0800, Ming Lei wrote: > > > On Tue, Aug 01, 2017 at 03:11:42PM +, Bart Van Assche wrote: > > > > On Tue, 2017-08-01 at 18:50 +0800, Ming Lei wrote: > > > > > On Tue, Aug 01, 2017 at 06:17:18PM +0800, Ming Lei wrote: > > > > > > How can we get the accurate 'number of requests in progress' > > > > > > efficiently? > > > > > > > > Hello Ming, > > > > > > > > How about counting the number of bits that have been set in the tag set? > > > > I am aware that these bits can be set and/or cleared concurrently with > > > > the > > > > dispatch code but that count is probably a good starting point. > > > > > > It has to be atomic_t, which is too too heavy for us, please see the > > > report: > > > > > > http://marc.info/?t=14986844843=1=2 > > > > > > Both Jens and I want to kill hd_struct.in_flight, but looks still no > > > good way. > > > > Hello Ming, > > > > Sorry but I disagree that a new atomic variable should be added to keep > > track > > of the number of busy requests. Counting the number of bits that are set in > > the tag set should be good enough in this context. > > That won't work because the tag set is host wide and shared by all LUNs. Hello Ming, Are you aware that the SCSI core already keeps track of the number of busy requests per LUN? See also the device_busy member of struct scsi_device. How about giving the block layer core access in some way to that counter? Bart.
Re: [PATCH BUGFIX/IMPROVEMENT] block, bfq: improve and refactor throughput-boosting logic
On 08/03/2017 10:48 AM, Paolo Valente wrote: > When a queue associated with a process remains empty, there are cases > where throughput gets boosted if the device is idled to await the > arrival of a new I/O request for that queue. Currently, BFQ assumes > that one of these cases is when the device has no internal queueing > (regardless of the properties of the I/O being served). Unfortunately, > this condition has proved to be too general. So, this commit refines it > as "the device has no internal queueing and is rotational". > > This refinement provides a significant throughput boost with random > I/O, on flash-based storage without internal queueing. For example, on > a HiKey board, throughput increases by up to 125%, growing, e.g., from > 6.9MB/s to 15.6MB/s with two or three random readers in parallel. > > This commit also refactors the code related to device idling, for the > following reason. Finding the change that provides the above large > improvement has been slightly more difficult than it had to be, > because the logic that decides whether to idle the device is still > scattered across three functions. Almost all of the logic is in the > function bfq_bfqq_may_idle, but (1) part of the decision is made in > bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may > switch off idling regardless of the output of bfq_bfqq_may_idle. In > addition, both bfq_update_idle_window and bfq_bfqq_must_idle make > their decisions as a function of parameters that are used, for similar > purposes, also in bfq_bfqq_may_idle. This commit addresses this issue > by moving all the logic into bfq_bfqq_may_idle. This should be split into two patches - do one refactor patch that doesn't change anything, then your functional change on top of that. -- Jens Axboe
[PATCH BUGFIX/IMPROVEMENT] block, bfq: improve and refactor throughput-boosting logic
When a queue associated with a process remains empty, there are cases where throughput gets boosted if the device is idled to await the arrival of a new I/O request for that queue. Currently, BFQ assumes that one of these cases is when the device has no internal queueing (regardless of the properties of the I/O being served). Unfortunately, this condition has proved to be too general. So, this commit refines it as "the device has no internal queueing and is rotational". This refinement provides a significant throughput boost with random I/O, on flash-based storage without internal queueing. For example, on a HiKey board, throughput increases by up to 125%, growing, e.g., from 6.9MB/s to 15.6MB/s with two or three random readers in parallel. This commit also refactors the code related to device idling, for the following reason. Finding the change that provides the above large improvement has been slightly more difficult than it had to be, because the logic that decides whether to idle the device is still scattered across three functions. Almost all of the logic is in the function bfq_bfqq_may_idle, but (1) part of the decision is made in bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may switch off idling regardless of the output of bfq_bfqq_may_idle. In addition, both bfq_update_idle_window and bfq_bfqq_must_idle make their decisions as a function of parameters that are used, for similar purposes, also in bfq_bfqq_may_idle. This commit addresses this issue by moving all the logic into bfq_bfqq_may_idle. Signed-off-by: Paolo ValenteSigned-off-by: Luca Miccio --- block/bfq-iosched.c | 144 block/bfq-iosched.h | 12 ++--- 2 files changed, 85 insertions(+), 71 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 436b6ca..509f399 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -128,7 +128,7 @@ BFQ_BFQQ_FNS(busy); BFQ_BFQQ_FNS(wait_request); BFQ_BFQQ_FNS(non_blocking_wait_rq); BFQ_BFQQ_FNS(fifo_expire); -BFQ_BFQQ_FNS(idle_window); +BFQ_BFQQ_FNS(has_short_ttime); BFQ_BFQQ_FNS(sync); BFQ_BFQQ_FNS(IO_bound); BFQ_BFQQ_FNS(in_large_burst); @@ -731,10 +731,10 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd, unsigned int old_wr_coeff = bfqq->wr_coeff; bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq); - if (bic->saved_idle_window) - bfq_mark_bfqq_idle_window(bfqq); + if (bic->saved_has_short_ttime) + bfq_mark_bfqq_has_short_ttime(bfqq); else - bfq_clear_bfqq_idle_window(bfqq); + bfq_clear_bfqq_has_short_ttime(bfqq); if (bic->saved_IO_bound) bfq_mark_bfqq_IO_bound(bfqq); @@ -2012,7 +2012,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq) return; bic->saved_ttime = bfqq->ttime; - bic->saved_idle_window = bfq_bfqq_idle_window(bfqq); + bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq); bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq); bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq); bic->was_in_burst_list = !hlist_unhashed(>burst_list_node); @@ -3038,8 +3038,8 @@ void bfq_bfqq_expire(struct bfq_data *bfqd, } bfq_log_bfqq(bfqd, bfqq, - "expire (%d, slow %d, num_disp %d, idle_win %d)", reason, - slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq)); + "expire (%d, slow %d, num_disp %d, short_ttime %d)", reason, + slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq)); /* * Increase, decrease or leave budget unchanged according to @@ -3114,7 +3114,10 @@ static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq) static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) { struct bfq_data *bfqd = bfqq->bfqd; - bool idling_boosts_thr, idling_boosts_thr_without_issues, + bool rot_without_queueing = + !blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag, + bfqq_sequential_and_IO_bound, + idling_boosts_thr, idling_boosts_thr_without_issues, idling_needed_for_service_guarantees, asymmetric_scenario; @@ -3122,27 +3125,45 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) return true; /* +* Idling is performed only if slice_idle > 0. In addition, we +* do not idle if +* (a) bfqq is async +* (b) bfqq is in the idle io prio class: in this case we do +* not idle because we want to minimize the bandwidth that +* queues in this class can steal to higher-priority queues +*/ + if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) || + bfq_class_idle(bfqq)) + return false; + + bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) && +
Re: [PATCH 17/19] bcache: fix for gc and write-back race
On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote: > From: Tang Junhui> > gc and write-back get raced (see the email "bcache get stucked" I sended > before): > gc thread write-back thread > | |bch_writeback_thread() > |bch_gc_thread() | > | |==>read_dirty() > |==>bch_btree_gc()| > |==>btree_root() //get btree root | > | node write locker | > |==>bch_btree_gc_root() | > | |==>read_dirty_submit() > | |==>write_dirty() > | |==>continue_at(cl, > write_dirty_finish, system_wq); > | > |==>write_dirty_finish()//excute in system_wq > | |==>bch_btree_insert() > | > |==>bch_btree_map_leaf_nodes() > | > |==>__bch_btree_map_nodes() > | |==>btree_root //try to > get btree root node read lock > | |-stuck here > |==>bch_btree_set_root() | > |==>bch_journal_meta()| > |==>bch_journal() | > |==>journal_try_write() | > |==>journal_write_unlocked() //journal_full(>journal) condition satisfied > |==>continue_at(cl, journal_write, system_wq); //try to excute journal_write > in system_wq > | //but work queue is excuting > write_dirty_finish() > |==>closure_sync(); //wait journal_write execute over and wake up gc, > | --stuck here > |==>release root node write locker > > This patch alloc a separate work-queue for write-back thread to avoid such > race. > > Signed-off-by: Tang Junhui > Cc: sta...@vger.kernel.org Add a per-cached device work queue is a good idea, it's OK to me. Acked-by: Coly Li Thansk. Coly > --- > drivers/md/bcache/bcache.h| 1 + > drivers/md/bcache/super.c | 2 ++ > drivers/md/bcache/writeback.c | 8 ++-- > 3 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 44123e4..deb0a6c 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -333,6 +333,7 @@ struct cached_dev { > /* Limit number of writeback bios in flight */ > struct semaphorein_flight; > struct task_struct *writeback_thread; > + struct workqueue_struct *writeback_write_wq; > > struct keybuf writeback_keys; > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index e06641e..24cb9b7 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -1063,6 +1063,8 @@ static void cached_dev_free(struct closure *cl) > cancel_delayed_work_sync(>writeback_rate_update); > if (!IS_ERR_OR_NULL(dc->writeback_thread)) > kthread_stop(dc->writeback_thread); > + if (dc->writeback_write_wq) > + destroy_workqueue(dc->writeback_write_wq); > > mutex_lock(_register_lock); > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index 4104eaa..4bc5daa 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -189,7 +189,7 @@ static void write_dirty(struct closure *cl) > > closure_bio_submit(>bio, cl); > > - continue_at(cl, write_dirty_finish, system_wq); > + continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq); > } > > static void read_dirty_endio(struct bio *bio) > @@ -209,7 +209,7 @@ static void read_dirty_submit(struct closure *cl) > > closure_bio_submit(>bio, cl); > > - continue_at(cl, write_dirty, system_wq); > + continue_at(cl, write_dirty, io->dc->writeback_write_wq); > } > > static void read_dirty(struct cached_dev *dc) > @@ -527,6 +527,10 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) > > int bch_cached_dev_writeback_start(struct cached_dev *dc) > { > + dc->writeback_write_wq = alloc_workqueue("bcache_writeback_wq", > WQ_MEM_RECLAIM, 0); > + if (!dc->writeback_write_wq) > + return -ENOMEM; > + > dc->writeback_thread = kthread_create(bch_writeback_thread, dc, > "bcache_writeback"); > if (IS_ERR(dc->writeback_thread)) >
[PATCH] bio-integrity: revert "stop abusing bi_end_io"
On Wed, 2 Aug 2017, Christoph Hellwig wrote: > On Wed, Aug 02, 2017 at 02:27:50PM +0200, Milan Broz wrote: > > In dm-integrity target we register integrity profile that have > > both generate_fn and verify_fn callbacks set to NULL. > > > > This is used if dm-integrity is stacked under a dm-crypt device > > for authenticated encryption (integrity payload contains authentication > > tag and IV seed). > > > > In this case the verification is done through own crypto API > > processing inside dm-crypt; integrity profile is only holder > > of these data. (And memory is owned by dm-crypt as well.) > > Maybe that's where the problem lies? You're abusing the integrity > payload for something that is not end to end data integrity at all > and then wonder why it breaks? Also the commit that introduced your > code had absolutely no review by Martin or any of the core block > folks. > > The right fix is to revert the dm-crypt commit. That dm-crypt commit that uses bio integrity payload came 3 months before 7c20f11680a441df09de7235206f70115fbf6290 and it was already present in 4.12. The patch 7c20f11680a441df09de7235206f70115fbf6290 not only breaks dm-crypt and dm-integrity. It also breaks regular integrity payload usage when stacking drivers are used. Suppose that a bio goes the through the device mapper stack. At each level a clone bio is allocated and forwarded to a lower level. It eventually reaches the request-based physical disk driver. In the kernel 4.12, when the bio reaches the request-based driver, the function blk_queue_bio is called, it calls bio_integrity_prep, bio_integrity_prep saves the bio->bi_end_io in the structure bio_integrity_payload and sets bio->bi_end_io = bio_integrity_endio. When the bio is finished, bio_integrity_endio is called, it performs the verification (a call to bio_integrity_verify_fn), restores bio->bi_end_io and calls it. The patch 7c20f11680a441df09de7235206f70115fbf6290 changes it so that bio->bi_end_io is not changed and bio_integrity_endio is called directly from bio_endio if the bio has integrity payload. The unintended consequence of this change is that when a request with integrity payload is finished and bio_endio is called for each cloned bio, the verification routine bio_integrity_verify_fn is called for every level in the stack (it used to be called only for the lowest level in 4.12). At different levels in the stack, the bio may have different bi_sector value, so the repeated verification can't succeed. I think the patch 7c20f11680a441df09de7235206f70115fbf6290 should be reverted and it should be reworked for the next merge window, so that it calls bio_integrity_verify_fn only for the lowest level. Mikulas From: Mikulas PatockaThe patch 7c20f11680a441df09de7235206f70115fbf6290 ("bio-integrity: stop abusing bi_end_io") changes the code so that it doesn't replace bio_end_io with bio_integrity_endio and calls bio_integrity_endio directly from bio_endio. The unintended consequence of this change is that when a bio with integrity payload is passed through the device stack, bio_integrity_endio is called for each level of the stack (it used to be called only for the lowest level before). This breaks dm-integrity and dm-crypt and it also causes verification errors when a bio with regular integrity payload is passed through the device mapper stack. Signed-off-by: Mikulas Patocka diff --git a/block/bio-integrity.c b/block/bio-integrity.c index f733beab6ca2..8df4eb103ba9 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -102,7 +102,7 @@ EXPORT_SYMBOL(bio_integrity_alloc); * Description: Used to free the integrity portion of a bio. Usually * called from bio_free(). */ -static void bio_integrity_free(struct bio *bio) +void bio_integrity_free(struct bio *bio) { struct bio_integrity_payload *bip = bio_integrity(bio); struct bio_set *bs = bio->bi_pool; @@ -120,8 +120,8 @@ static void bio_integrity_free(struct bio *bio) } bio->bi_integrity = NULL; - bio->bi_opf &= ~REQ_INTEGRITY; } +EXPORT_SYMBOL(bio_integrity_free); /** * bio_integrity_add_page - Attach integrity metadata @@ -326,6 +326,12 @@ bool bio_integrity_prep(struct bio *bio) offset = 0; } + /* Install custom I/O completion handler if read verify is enabled */ + if (bio_data_dir(bio) == READ) { + bip->bip_end_io = bio->bi_end_io; + bio->bi_end_io = bio_integrity_endio; + } + /* Auto-generate integrity metadata if this is a write */ if (bio_data_dir(bio) == WRITE) { bio_integrity_process(bio, >bi_iter, @@ -369,12 +375,13 @@ static void bio_integrity_verify_fn(struct work_struct *work) bio->bi_status = BLK_STS_IOERR; } - bio_integrity_free(bio); + /* Restore original bio completion handler */ + bio->bi_end_io =
Re: Switching to MQ by default may generate some bug reports
On Thu, Aug 3, 2017 at 6:47 PM, Mel Gormanwrote: > On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote: >> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman >> wrote: >> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: >> >> Hi Mel Gorman, >> >> >> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman >> >> wrote: >> >> > Hi Christoph, >> >> > >> >> > I know the reasons for switching to MQ by default but just be aware >> >> > that it's >> >> > not without hazards albeit it the biggest issues I've seen are switching >> >> > CFQ to BFQ. On my home grid, there is some experimental automatic >> >> > testing >> >> > running every few weeks searching for regressions. Yesterday, it noticed >> >> > that creating some work files for a postgres simulator called pgioperf >> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just >> >> > linearly writing two files for testing on another benchmark and is not >> >> > remarkable. The relevant part of the report is >> >> >> >> We saw some SCSI-MQ performance issue too, please see if the following >> >> patchset fixes your issue: >> >> >> >> http://marc.info/?l=linux-block=150151989915776=2 >> >> >> > >> > That series is dealing with problems with legacy-deadline vs mq-none where >> > as the bulk of the problems reported in this mail are related to >> > legacy-CFQ vs mq-BFQ. >> >> The serials deals with none and all mq schedulers, and you can see >> the improvement on mq-deadline in cover letter, :-) >> > > Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ > that was not observed on other schedulers? Actually if you look at the cover letter, you will see this patchset increases by > 10X sequential I/O IOPS on mq-deadline, so it would be reasonable to see 2x to 4x BFQ slowdown, but I didn't test BFQ. Thanks, Ming Lei
Re: Switching to MQ by default may generate some bug reports
On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote: > > For Paulo, if you want to try preemptively dealing with regression reports > > before 4.13 releases then all the tests in question can be reproduced with > > https://github.com/gormanm/mmtests . The most relevant test configurations > > I've seen so far are > > > > configs/config-global-dhp__io-dbench4-async > > configs/config-global-dhp__io-fio-randread-async-randwrite > > configs/config-global-dhp__io-fio-randread-async-seqwrite > > configs/config-global-dhp__io-fio-randread-sync-heavywrite > > configs/config-global-dhp__io-fio-randread-sync-randwrite > > configs/config-global-dhp__pgioperf > > > > Hi Mel, > as it already happened with the latest Phoronix benchmark article (and > with other test results reported several months ago on this list), bad > results may be caused (also) by the fact that the low-latency, default > configuration of BFQ is being used. I took that into account BFQ with low-latency was also tested and the impact was not a universal improvement although it can be a noticable improvement. From the same machine; dbench4 Loadfile Execution Time 4.12.0 4.12.0 4.12.0 legacy-cfq mq-bfq mq-bfq-tput Amean 180.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) Amean 292.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) However, it's not a universal gain and there are also fairness issues. For example, this is a fio configuration with a single random reader and a single random writer on the same machine fio Throughput 4.12.0 4.12.0 4.12.0 legacy-cfq mq-bfq mq-bfq-tput Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%) Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%) With CFQ, there is some fairness between the readers and writers and with BFQ, there is a strong preference to writers. Again, this is not universal. It'll be a mix and sometimes it'll be classed as a gain and sometimes a regression. While I accept that BFQ can be tuned, tuning IO schedulers is not something that normal users get right and they'll only look at "out of box" performance which, right now, will trigger bug reports. This is neither good nor bad, it simply is. > This configuration is the default > one because the motivation for yet-another-scheduler as BFQ is that it > drastically reduces latency for interactive and soft real-time tasks > (e.g., opening an app or playing/streaming a video), when there is > some background I/O. Low-latency heuristics are willing to sacrifice > throughput when this provides a large benefit in terms of the above > latency. > I had seen this assertion so one of the fio configurations had multiple heavy writers in the background and a random reader of small files to simulate that scenario. The intent was to simulate heavy IO in the presence of application startup 4.12.0 4.12.0 4.12.0 legacy-cfq mq-bfq mq-bfq-tput Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%) Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%) Write throughput is steady-ish across each IO scheduler but readers get starved badly which I expect would slow application startup and disabling low_latency makes it much worse. The mmtests configuration in question is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create a fresh XFS filesystem on a test partition. This is not exactly equivalent to real application startup but that can be difficult to quantify properly. > Of course, BFQ may not be optimal for every workload, even if > low-latency mode is switched off. In addition, there may still be > some bug. I'll repeat your tests on a machine of mine ASAP. > The intent here is not to rag on BFQ because I know it's going to have some wins and some losses and will take time to fix up. The primary intent was to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports due to the switching of defaults that will bisect to a misleading commit. -- Mel Gorman SUSE Labs
Re: Switching to MQ by default may generate some bug reports
On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote: > On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman> wrote: > > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: > >> Hi Mel Gorman, > >> > >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman > >> wrote: > >> > Hi Christoph, > >> > > >> > I know the reasons for switching to MQ by default but just be aware that > >> > it's > >> > not without hazards albeit it the biggest issues I've seen are switching > >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing > >> > running every few weeks searching for regressions. Yesterday, it noticed > >> > that creating some work files for a postgres simulator called pgioperf > >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > >> > linearly writing two files for testing on another benchmark and is not > >> > remarkable. The relevant part of the report is > >> > >> We saw some SCSI-MQ performance issue too, please see if the following > >> patchset fixes your issue: > >> > >> http://marc.info/?l=linux-block=150151989915776=2 > >> > > > > That series is dealing with problems with legacy-deadline vs mq-none where > > as the bulk of the problems reported in this mail are related to > > legacy-CFQ vs mq-BFQ. > > The serials deals with none and all mq schedulers, and you can see > the improvement on mq-deadline in cover letter, :-) > Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ that was not observed on other schedulers? -- Mel Gorman SUSE Labs
Re: Switching to MQ by default may generate some bug reports
On Thu, Aug 03, 2017 at 11:44:06AM +0200, Paolo Valente wrote: > > That series is dealing with problems with legacy-deadline vs mq-none where > > as the bulk of the problems reported in this mail are related to > > legacy-CFQ vs mq-BFQ. > > > > Out-of-curiosity: you get no regression with mq-none or mq-deadline? > I didn't test mq-none as the underlying storage was not fast enough to make a legacy-noop vs mq-none meaningful. legacy-deadline vs mq-deadline did show small regressions on some workloads but not as dramatic and small enough that it would go unmissed in some cases. -- Mel Gorman SUSE Labs
Re: Switching to MQ by default may generate some bug reports
On Thu, Aug 3, 2017 at 5:42 PM, Mel Gormanwrote: > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: >> Hi Mel Gorman, >> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman >> wrote: >> > Hi Christoph, >> > >> > I know the reasons for switching to MQ by default but just be aware that >> > it's >> > not without hazards albeit it the biggest issues I've seen are switching >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing >> > running every few weeks searching for regressions. Yesterday, it noticed >> > that creating some work files for a postgres simulator called pgioperf >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just >> > linearly writing two files for testing on another benchmark and is not >> > remarkable. The relevant part of the report is >> >> We saw some SCSI-MQ performance issue too, please see if the following >> patchset fixes your issue: >> >> http://marc.info/?l=linux-block=150151989915776=2 >> > > That series is dealing with problems with legacy-deadline vs mq-none where > as the bulk of the problems reported in this mail are related to > legacy-CFQ vs mq-BFQ. The serials deals with none and all mq schedulers, and you can see the improvement on mq-deadline in cover letter, :-) Thanks, Ming Lei
Re: Switching to MQ by default may generate some bug reports
On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: > Hi Mel Gorman, > > On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman> wrote: > > Hi Christoph, > > > > I know the reasons for switching to MQ by default but just be aware that > > it's > > not without hazards albeit it the biggest issues I've seen are switching > > CFQ to BFQ. On my home grid, there is some experimental automatic testing > > running every few weeks searching for regressions. Yesterday, it noticed > > that creating some work files for a postgres simulator called pgioperf > > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > > linearly writing two files for testing on another benchmark and is not > > remarkable. The relevant part of the report is > > We saw some SCSI-MQ performance issue too, please see if the following > patchset fixes your issue: > > http://marc.info/?l=linux-block=150151989915776=2 > That series is dealing with problems with legacy-deadline vs mq-none where as the bulk of the problems reported in this mail are related to legacy-CFQ vs mq-BFQ. -- Mel Gorman SUSE Labs
Re: Switching to MQ by default may generate some bug reports
On Thu, Aug 3, 2017 at 5:17 PM, Ming Leiwrote: > Hi Mel Gorman, > > On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman > wrote: >> Hi Christoph, >> >> I know the reasons for switching to MQ by default but just be aware that it's >> not without hazards albeit it the biggest issues I've seen are switching >> CFQ to BFQ. On my home grid, there is some experimental automatic testing >> running every few weeks searching for regressions. Yesterday, it noticed >> that creating some work files for a postgres simulator called pgioperf >> was 38.33% slower and it auto-bisected to the switch to MQ. This is just >> linearly writing two files for testing on another benchmark and is not >> remarkable. The relevant part of the report is > > We saw some SCSI-MQ performance issue too, please see if the following > patchset fixes your issue: > > http://marc.info/?l=linux-block=150151989915776=2 BTW, the above patches(V1) can be found in the following tree: https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V1 V2 has already been done but not posted out yet, because the performance test on SRP isn't completed: https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V2 Thanks, Ming Lei
Re: Switching to MQ by default may generate some bug reports
> Il giorno 03 ago 2017, alle ore 10:51, Mel Gorman >ha scritto: > > Hi Christoph, > > I know the reasons for switching to MQ by default but just be aware that it's > not without hazards albeit it the biggest issues I've seen are switching > CFQ to BFQ. On my home grid, there is some experimental automatic testing > running every few weeks searching for regressions. Yesterday, it noticed > that creating some work files for a postgres simulator called pgioperf > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > linearly writing two files for testing on another benchmark and is not > remarkable. The relevant part of the report is > > Last good/First bad commit > == > Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5 > First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa > From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig > Date: Fri, 16 Jun 2017 10:27:55 +0200 > Subject: [PATCH] scsi: default to scsi-mq > Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O > path now that we had plenty of testing, and have I/O schedulers for > blk-mq. The module option to disable the blk-mq path is kept around for > now. > Signed-off-by: Christoph Hellwig > Signed-off-by: Martin K. Petersen > drivers/scsi/Kconfig | 11 --- > drivers/scsi/scsi.c | 4 > 2 files changed, 15 deletions(-) > > Comparison > == >initialinitial > last penup first > good-v4.12 bad-16f73eb02d7e > good-6d311fa7 good-d06c587d bad-5c279bd9 > Usermin 0.06 ( 0.00%)0.14 (-133.33%)0.14 > (-133.33%)0.06 ( 0.00%)0.19 (-216.67%) > Usermean0.06 ( 0.00%)0.14 (-133.33%)0.14 > (-133.33%)0.06 ( 0.00%)0.19 (-216.67%) > Userstddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > Usercoeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > Usermax 0.06 ( 0.00%)0.14 (-133.33%)0.14 > (-133.33%)0.06 ( 0.00%)0.19 (-216.67%) > System min10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( > -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) > System mean 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( > -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) > System stddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > System coeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > System max10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( > -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) > Elapsed min 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( > -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) > Elapsed mean 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( > -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) > Elapsed stddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > Elapsed coeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > Elapsed max 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( > -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) > CPU min 4.00 ( 0.00%)3.00 ( 25.00%)4.00 ( > 0.00%)4.00 ( 0.00%)3.00 ( 25.00%) > CPU mean4.00 ( 0.00%)3.00 ( 25.00%)4.00 ( > 0.00%)4.00 ( 0.00%)3.00 ( 25.00%) > CPU stddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > CPU coeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( > 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) > CPU max 4.00 ( 0.00%)3.00 ( 25.00%)4.00 ( > 0.00%)4.00 ( 0.00%)3.00 ( 25.00%) > > The "Elapsed mean" line is what the testing and auto-bisection was paying > attention to. Commit 16f73eb02d7e is simply the head commit at the time > the continuous testing started. The first "bad commit" is the last column. > > It's not the only slowdown that has been observed from other testing when > examining whether it's ok to switch to MQ by default. The biggest slowdown > observed was with a modified version of
Re: Switching to MQ by default may generate some bug reports
Hi Mel Gorman, On Thu, Aug 3, 2017 at 4:51 PM, Mel Gormanwrote: > Hi Christoph, > > I know the reasons for switching to MQ by default but just be aware that it's > not without hazards albeit it the biggest issues I've seen are switching > CFQ to BFQ. On my home grid, there is some experimental automatic testing > running every few weeks searching for regressions. Yesterday, it noticed > that creating some work files for a postgres simulator called pgioperf > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > linearly writing two files for testing on another benchmark and is not > remarkable. The relevant part of the report is We saw some SCSI-MQ performance issue too, please see if the following patchset fixes your issue: http://marc.info/?l=linux-block=150151989915776=2 Thanks, Ming
Switching to MQ by default may generate some bug reports
Hi Christoph, I know the reasons for switching to MQ by default but just be aware that it's not without hazards albeit it the biggest issues I've seen are switching CFQ to BFQ. On my home grid, there is some experimental automatic testing running every few weeks searching for regressions. Yesterday, it noticed that creating some work files for a postgres simulator called pgioperf was 38.33% slower and it auto-bisected to the switch to MQ. This is just linearly writing two files for testing on another benchmark and is not remarkable. The relevant part of the report is Last good/First bad commit == Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5 First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa >From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001 From: Christoph HellwigDate: Fri, 16 Jun 2017 10:27:55 +0200 Subject: [PATCH] scsi: default to scsi-mq Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O path now that we had plenty of testing, and have I/O schedulers for blk-mq. The module option to disable the blk-mq path is kept around for now. Signed-off-by: Christoph Hellwig Signed-off-by: Martin K. Petersen drivers/scsi/Kconfig | 11 --- drivers/scsi/scsi.c | 4 2 files changed, 15 deletions(-) Comparison == initialinitial last penup first good-v4.12 bad-16f73eb02d7e good-6d311fa7 good-d06c587d bad-5c279bd9 Usermin 0.06 ( 0.00%)0.14 (-133.33%)0.14 (-133.33%)0.06 ( 0.00%)0.19 (-216.67%) Usermean0.06 ( 0.00%)0.14 (-133.33%)0.14 (-133.33%)0.06 ( 0.00%)0.19 (-216.67%) Userstddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) Usercoeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) Usermax 0.06 ( 0.00%)0.14 (-133.33%)0.14 (-133.33%)0.06 ( 0.00%)0.19 (-216.67%) System min10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) System mean 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) System stddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) System coeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) System max10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) Elapsed min 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) Elapsed mean 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) Elapsed stddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) Elapsed coeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) Elapsed max 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) CPU min 4.00 ( 0.00%)3.00 ( 25.00%)4.00 ( 0.00%)4.00 ( 0.00%)3.00 ( 25.00%) CPU mean4.00 ( 0.00%)3.00 ( 25.00%)4.00 ( 0.00%)4.00 ( 0.00%)3.00 ( 25.00%) CPU stddev 0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) CPU coeffvar0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%)0.00 ( 0.00%) CPU max 4.00 ( 0.00%)3.00 ( 25.00%)4.00 ( 0.00%)4.00 ( 0.00%)3.00 ( 25.00%) The "Elapsed mean" line is what the testing and auto-bisection was paying attention to. Commit 16f73eb02d7e is simply the head commit at the time the continuous testing started. The first "bad commit" is the last column. It's not the only slowdown that has been observed from other testing when examining whether it's ok to switch to MQ by default. The biggest slowdown observed was with a modified version of dbench4 -- the modifications use shorter, but representative, load files to avoid timing artifacts and reports time to complete a load file instead of throughput as throughput is kind of meaningless for dbench4 dbench4 Loadfile Execution Time