[PATCH] bcache: writeback rate shouldn't artifically clamp

2017-10-07 Thread Michael Lyle
The previous code artificially limited writeback rate to 100
blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
hardware.  The rate limiting code works fine (though with decreased
precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.

Additionally, ensure that uint32_t is used as a type for rate throughout
the rate management so that type checking/clamp_t can work properly.

bch_next_delay should be rewritten for increased precision and better
handling of high rates and long sleep periods, but this is adequate for
now.

Signed-off-by: Michael Lyle 
Reported-by: Coly Li 
---
 drivers/md/bcache/util.h  | 4 ++--
 drivers/md/bcache/writeback.c | 7 ---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index cb8d2ccbb6c6..8f509290bb02 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -441,10 +441,10 @@ struct bch_ratelimit {
uint64_tnext;
 
/*
-* Rate at which we want to do work, in units per nanosecond
+* Rate at which we want to do work, in units per second
 * The units here correspond to the units passed to bch_next_delay()
 */
-   unsignedrate;
+   uint32_trate;
 };
 
 static inline void bch_ratelimit_reset(struct bch_ratelimit *d)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 3f8c4c6bee03..a1608c45bb12 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -46,7 +46,8 @@ static void __update_writeback_rate(struct cached_dev *dc)
int64_t error = dirty - target;
int64_t proportional_scaled =
div_s64(error, dc->writeback_rate_p_term_inverse);
-   int64_t integral_scaled, new_rate;
+   int64_t integral_scaled;
+   uint32_t new_rate;
 
if ((error < 0 && dc->writeback_rate_integral > 0) ||
time_before64(local_clock(),
@@ -67,8 +68,8 @@ static void __update_writeback_rate(struct cached_dev *dc)
integral_scaled = div_s64(dc->writeback_rate_integral,
dc->writeback_rate_i_term_inverse);
 
-   new_rate = clamp_t(int64_t, (proportional_scaled + integral_scaled),
-   dc->writeback_rate_minimum, NSEC_PER_MSEC);
+   new_rate = clamp_t(uint32_t, (proportional_scaled + integral_scaled),
+   dc->writeback_rate_minimum, NSEC_PER_SEC);
 
dc->writeback_rate_proportional = proportional_scaled;
dc->writeback_rate_integral_scaled = integral_scaled;
-- 
2.11.0



Re: [PATCH v3 2/5] bcache: implement PI controller for writeback rate

2017-10-07 Thread Coly Li
On 2017/10/8 下午12:57, Michael Lyle wrote:
> Coly--
> 
> 
> On 10/07/2017 09:22 PM, Coly Li wrote:
> [snip]
>> rate:    488.2M/sec
>> dirty:    91.7G
>> target:    152.3G
>> proportional:    -1.5G
>> integral:    10.9G
>> change:    0.0k/sec
>> next io:    0ms
> [snip]
> 
>> The backing cached device size is 7.2TB, cache device is 1.4TB, block
>> size is 8kB only. I write 700G (50% of cache device size) dirty data
>> onto the cache device, and start writeback by echo 1 to
>> writeback_running file.
>>
>> In my test, writeback spent 89 minutes to decrease dirty number from
>> 700G to 147G (dirty target number is 152G). At this moment writeback
>> rate was still displayed as 488.2M/sec. And after 22 minutes writeback
>> rate jumped to 4.0k/sec. During the 22 minutes, (147-15.8=) 131.2G dirty
>> data written out.
> 
> I see it-- if we can write faster than 488MB/sec, we inappropriately
> clamp the write rate to 488MB/sec-- this is from the old code.  In turn,
> if we're keeping up at that speed, the integral term can wind up.  I
> will fix this and clean up a couple related things.
> 
>> Is it as expected ?
> 
> It is supposed to overshoot the target, but not by this much.
> 
> I think the implementation must have changed at some point in the past
> for bch_ratelimit, because the clamping doesn't match. bch_ratelimit
> really needs a rewrite for other reasons, but I'll send the minimal fix
> now.

Hi Mike,

Copied, I will continue my test. And when the new version comes, I will
run same workload again to confirm.

Thanks.


-- 
Coly Li


Re: [PATCH v3 2/5] bcache: implement PI controller for writeback rate

2017-10-07 Thread Michael Lyle

Coly--


On 10/07/2017 09:22 PM, Coly Li wrote:
[snip]

rate:   488.2M/sec
dirty:  91.7G
target: 152.3G
proportional:   -1.5G
integral:   10.9G
change: 0.0k/sec
next io:0ms

[snip]


The backing cached device size is 7.2TB, cache device is 1.4TB, block
size is 8kB only. I write 700G (50% of cache device size) dirty data
onto the cache device, and start writeback by echo 1 to
writeback_running file.

In my test, writeback spent 89 minutes to decrease dirty number from
700G to 147G (dirty target number is 152G). At this moment writeback
rate was still displayed as 488.2M/sec. And after 22 minutes writeback
rate jumped to 4.0k/sec. During the 22 minutes, (147-15.8=) 131.2G dirty
data written out.


I see it-- if we can write faster than 488MB/sec, we inappropriately 
clamp the write rate to 488MB/sec-- this is from the old code.  In turn, 
if we're keeping up at that speed, the integral term can wind up.  I 
will fix this and clean up a couple related things.



Is it as expected ?


It is supposed to overshoot the target, but not by this much.

I think the implementation must have changed at some point in the past 
for bch_ratelimit, because the clamping doesn't match. bch_ratelimit 
really needs a rewrite for other reasons, but I'll send the minimal fix now.




Thanks.



Mike


Re: [PATCH v3 2/5] bcache: implement PI controller for writeback rate

2017-10-07 Thread Coly Li
On 2017/9/28 上午1:41, Michael Lyle wrote:
> bcache uses a control system to attempt to keep the amount of dirty data
> in cache at a user-configured level, while not responding excessively to
> transients and variations in write rate.  Previously, the system was a
> PD controller; but the output from it was integrated, turning the
> Proportional term into an Integral term, and turning the Derivative term
> into a crude Proportional term.  Performance of the controller has been
> uneven in production, and it has tended to respond slowly, oscillate,
> and overshoot.
> 
> This patch set replaces the current control system with an explicit PI
> controller and tuning that should be correct for most hardware.  By
> default, it attempts to write at a rate that would retire 1/40th of the
> current excess blocks per second.  An integral term in turn works to
> remove steady state errors.
> 
> IMO, this yields benefits in simplicity (removing weighted average
> filtering, etc) and system performance.
> 
> Another small change is a tunable parameter is introduced to allow the
> user to specify a minimum rate at which dirty blocks are retired.
> 
> There is a slight difference from earlier versions of the patch in
> integral handling to prevent excessive negative integral windup.
> 
> Signed-off-by: Michael Lyle 
> Reviewed-by: Coly Li 


Hi Mike,

I am testing all the 5 patches these days for the writeback performance.
I just find when dirty number is much smaller then dirty target,
writeback rate is still maximum as 488.2M/sec.

Here is part of the output:

rate:   488.2M/sec
dirty:  91.7G
target: 152.3G
proportional:   -1.5G
integral:   10.9G
change: 0.0k/sec
next io:0ms



rate:   488.2M/sec
dirty:  85.3G
target: 152.3G
proportional:   -1.6G
integral:   10.6G
change: 0.0k/sec
next io:-7ms



rate:   488.2M/sec
dirty:  79.3G
target: 152.3G
proportional:   -1.8G
integral:   10.1G
change: 0.0k/sec
next io:-26ms



rate:   488.2M/sec
dirty:  73.1G
target: 152.3G
proportional:   -1.9G
integral:   9.7G
change: 0.0k/sec
next io:-1ms



rate:   488.2M/sec
dirty:  66.9G
target: 152.3G
proportional:   -2.1G
integral:   9.2G
change: 0.0k/sec
next io:-66ms



rate:   488.2M/sec
dirty:  61.1G
target: 152.3G
proportional:   -2.2G
integral:   8.7G
change: 0.0k/sec
next io:-6ms



rate:   488.2M/sec
dirty:  55.6G
target: 152.3G
proportional:   -2.4G
integral:   8.1G
change: 0.0k/sec
next io:-5ms



rate:   488.2M/sec
dirty:  49.4G
target: 152.3G
proportional:   -2.5G
integral:   7.5G
change: 0.0k/sec
next io:0ms



rate:   488.2M/sec
dirty:  43.1G
target: 152.3G
proportional:   -2.7G
integral:   7.0G
change: 0.0k/sec
next io:-1ms



rate:   488.2M/sec
dirty:  37.3G
target: 152.3G
proportional:   -2.8G
integral:   6.3G
change: 0.0k/sec
next io:-2ms



rate:   488.2M/sec
dirty:  31.7G
target: 152.3G
proportional:   -3.0G
integral:   5.6G
change: 0.0k/sec
next io:-17ms

The backing cached device size is 7.2TB, cache device is 1.4TB, block
size is 8kB only. I write 700G (50% of cache device size) dirty data
onto the cache device, and start writeback by echo 1 to
writeback_running file.

In my test, writeback spent 89 minutes to decrease dirty number from
700G to 147G (dirty target number is 152G). At this moment writeback
rate was still displayed as 488.2M/sec. And after 22 minutes writeback
rate jumped to 4.0k/sec. During the 22 minutes, (147-15.8=) 131.2G dirty
data written out.

Is it as expected ?

Thanks.

-- 
Coly Li


Re: [PATCH] bcache: rewrite multiple partitions support

2017-10-07 Thread Coly Li
On 2017/10/8 上午5:42, Michael Lyle wrote:
> Indeed fixes a bug where minors and indices are interchanged on
> cleanup-- good catch.  Factoring out the index conversion is nice.
> 

Hi Mike,

> Hopefully no one is adversely affected by the stride change (16->128)
> and different naming, but they should be using udev + labels/uuids
> anyways.
> 

Yes, I hope so, too.

> Reviewed-by: Michael Lyle 
> 

Thanks for the review. We have one more code reviewer, yeah!

Coly Li


Re: [PATCH] bcache: check ca->alloc_thread initialized before wake up it

2017-10-07 Thread Michael Lyle
On Tue, Sep 19, 2017 at 3:11 PM, Coly Li  wrote:
> In bcache code, sysfs entries are created before all resources get
> allocated, e.g. allocation thread of a cache set.
>
> There is posibility for NULL pointer deference if a resource is accessed
> but which is not initialized yet. Indeed Jorg Bornschein catches one on
> cache set allocation thread and gets a kernel oops.
> Signed-off-by: Coly Li 
> Reported-by: Jorg Bornschein 
> Cc: Kent Overstreet 
> Cc: sta...@vger.kernel.org

Reviewed-by: Michael Lyle 


Re: [PATCH] bcache: rewrite multiple partitions support

2017-10-07 Thread Michael Lyle
Indeed fixes a bug where minors and indices are interchanged on
cleanup-- good catch.  Factoring out the index conversion is nice.

Hopefully no one is adversely affected by the stride change (16->128)
and different naming, but they should be using udev + labels/uuids
anyways.

Reviewed-by: Michael Lyle 


[PATCH] bfq: Fix bool initialization/comparison

2017-10-07 Thread Thomas Meyer
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer 
---

diff -u -p a/block/bfq-iosched.c b/block/bfq-iosched.c
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4986,7 +4986,7 @@ static ssize_t bfq_low_latency_store(str
 
if (__data > 1)
__data = 1;
-   if (__data == 0 && bfqd->low_latency != 0)
+   if (__data == 0 && bfqd->low_latency)
bfq_end_wr(bfqd);
bfqd->low_latency = __data;
 


Re: [PATCH v4] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-10-07 Thread weiping zhang
On Sat, Sep 30, 2017 at 10:57:31PM +0800, weiping zhang wrote:
> On Fri, Sep 22, 2017 at 11:36:28PM +0800, weiping zhang wrote:
> > if blk-mq use "none" io scheduler, nr_request get a wrong value when
> > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> > the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> > wrong value.
> > 
> > Reproduce:
> > 
> > echo none > /sys/block/nvme0n1/queue/scheduler
> > echo 100 > /sys/block/nvme0n1/queue/nr_requests
> > cat /sys/block/nvme0n1/queue/nr_requests
> > 100
> > 
> > Signed-off-by: weiping zhang 
> > ---
> > 
> > Changes since v4:
> >  * fix typo in commit message(queue/ioscheduler => queue/scheduler)
> > 
> > Changes since v3:
> >  * remove compare nr with tags->qdepth, pass nr to blk_mq_tag_update_depth
> > directly
> > 
> >  * remove return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ
> > 
> > Changes since v2:
> >  * add return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ
> >  * remove pr_warn, and return EINVAL, if input number is too large
> > 
> >  block/blk-mq.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 98a1860..491e336 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2642,8 +2642,7 @@ int blk_mq_update_nr_requests(struct request_queue 
> > *q, unsigned int nr)
> >  * queue depth. This is similar to what the old code would do.
> >  */
> > if (!hctx->sched_tags) {
> > -   ret = blk_mq_tag_update_depth(hctx, >tags,
> > -   min(nr, 
> > set->queue_depth),
> > +   ret = blk_mq_tag_update_depth(hctx, >tags, nr,
> > false);
> > } else {
> > ret = blk_mq_tag_update_depth(hctx, >sched_tags,
> > -- 
> > 2.9.4
> > 
> 
> Hi Jens,
> 
> As you say before:
> > blk_mq_tag_update_depth() should already return
> > -EINVAL for the case where we can't grow the tags. Looks like this patch
> > should simply remove the min(nr, set->queue_depth) and just pass in 'nr'.
> I also find that hctx->tags->nr_tags equal to tag_set->queue_depth no matter 
> use
> hctx->sched_tags or not. So I think it's safe to verify 'nr' by
> hctx->tags->nr_tags.
> 
Hello Jens,

ping