Re: testing io.low limit for blk-throttle

2018-04-26 Thread Joseph Qi
Hi Paolo,

On 18/4/27 01:27, Paolo Valente wrote:
> 
> 
>> Il giorno 25 apr 2018, alle ore 14:13, Joseph Qi <jiangqi...@gmail.com> ha 
>> scritto:
>>
>> Hi Paolo,
>>
> 
> Hi Joseph
> 
>> ...
>> Could you run blktrace as well when testing your case? There are several
>> throtl traces to help analyze whether it is caused by frequently
>> upgrade/downgrade.
> 
> Certainly.  You can find a trace attached.  Unfortunately, I'm not
> familiar with the internals of blk-throttle and low limit, so, if you
> want me to analyze the trace, give me some hints on what I have to
> look for.  Otherwise, I'll be happy to learn from your analysis.
> 

I've taken a glance at your blktrace attached. It is only upgrade at first and
then downgrade (just adjust limit, not to LIMIT_LOW) frequently.
But I don't know why it always thinks throttle group is not idle.

For example:
fio-2336  [004] d...   428.458249:   8,16   m   N throtl avg_idle=90, 
idle_threshold=1000, bad_bio=10, total_bio=84, is_idle=0, scale=9
fio-2336  [004] d...   428.458251:   8,16   m   N throtl downgrade, scale 4

In throtl_tg_is_idle():
is_idle = ... ||
(tg->latency_target && tg->bio_cnt &&
 tg->bad_bio_cnt * 5 < tg->bio_cnt);

It should be idle and allow run more bandwidth. But here the result shows not
idle (is_idle=0). I have to do more investigation to figure it out why. 

You can also filter these logs using:
grep throtl trace | grep -E 'upgrade|downgrade|is_idle'

Thanks,
Joseph


Re: testing io.low limit for blk-throttle

2018-04-26 Thread Joseph Qi
Hi Jianchao,

On 18/4/27 10:09, jianchao.wang wrote:
> Hi Tejun and Joseph
> 
> On 04/27/2018 02:32 AM, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Apr 24, 2018 at 02:12:51PM +0200, Paolo Valente wrote:
>>> +Tejun (I guess he might be interested in the results below)
>>
>> Our experiments didn't work out too well either.  At this point, it
>> isn't clear whether io.low will ever leave experimental state.  We're
>> trying to find a working solution.
> 
> Would you please take a look at the following two patches.
> 
> https://marc.info/?l=linux-block=152325456307423=2
> https://marc.info/?l=linux-block=152325457607425=2
> 
> In addition, when I tested blk-throtl io.low on NVMe card, I always got
> even if the iops has been lower than io.low limit for a while, but the
> due to group is not idle, the downgrade always fails.
> 
>tg->latency_target && tg->bio_cnt &&
>   tg->bad_bio_cnt * 5 < tg->bio_cn
> 

I'm afraid the latency check is a must for io.low. Because idle time
check can only apply to simple scenarios from my test.

Yes, in some cases last_low_overflow_time does have problems.
And for not downgrade properly, I've also posted two patches before,
waiting Shaohua's review. You can also have a try.

https://patchwork.kernel.org/patch/10177185/
https://patchwork.kernel.org/patch/10177187/

Thanks,
Joseph

> the latency always looks well even the sum of two groups's iops has reached 
> the top.
> so I disable this check on my test, plus the 2 patches above, the io.low
> could basically works.
> 
> My NVMe card's max bps is ~600M, and max iops is ~160k.
> Here is my config
> io.low riops=5 wiops=5 rbps=209715200 wbps=209715200 idle=200 
> latency=10
> io.max riops=15
> There are two cgroups in my test, both of them have same config.
> 
> In addition, saying "basically work" is due to the iops of the two cgroup 
> will jump up and down.
> such as, I launched one fio test per cgroup, the iops will wave as following:
> 
> group0   30k  50k   70k   60k  40k
> group1   120k 100k  80k   90k  110k
> 
> however, if I launched two fio tests only in one cgroup, the iops of two test 
> could stay 
> about 70k~80k.
> 
> Could help to explain this scenario ?
> 
> Thanks in advance
> Jianchao
> 


Re: testing io.low limit for blk-throttle

2018-04-25 Thread Joseph Qi
Hi Paolo,

On 18/4/24 20:12, Paolo Valente wrote:
> 
> 
>> Il giorno 23 apr 2018, alle ore 11:01, Joseph Qi <jiangqi...@gmail.com> ha 
>> scritto:
>>
>>
>>
>> On 18/4/23 15:35, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 23 apr 2018, alle ore 08:05, Joseph Qi <jiangqi...@gmail.com> ha 
>>>> scritto:
>>>>
>>>> Hi Paolo,
>>>
>>> Hi Joseph,
>>> thanks for chiming in.
>>>
>>>> What's your idle and latency config?
>>>
>>> I didn't set them at all, as the only (explicit) requirement in my
>>> basic test is that one of the group is guaranteed a minimum bps.
>>>
>>>
>>>> IMO, io.low will allow others run more bandwidth if cgroup's average
>>>> idle time is high or latency is low.
>>>
>>> What you say here makes me think that I simply misunderstood the
>>> purpose of io.low.  So, here is my problem/question: "I only need to
>>> guarantee at least a minimum bandwidth, in bps, to a group.  Is the
>>> io.low limit the way to go?"
>>>
>>> I know that I can use just io.max (unless I misunderstood the goal of
>>> io.max too :( ), but my extra purpose would be to not waste bandwidth
>>> when some group is idle.  Yet, as for now, io.low is not working even
>>> for the first, simpler goal, i.e., guaranteeing a minimum bandwidth to
>>> one group when all groups are active.
>>>
>>> Am I getting something wrong?
>>>
>>> Otherwise, if there are some special values for idle and latency
>>> parameters that would make throttle work for my test, I'll be of
>>> course happy to try them.
>>>
>> I think you can try idle time with 1000us for all cgroups, and latency
>> target 100us for cgroup with low limit 100MB/s and 2000us for cgroups
>> with low limit 10MB/s. That means cgroup with low latency target will
>> be preferred.
>> BTW, from my expeierence the parameters are not easy to set because
>> they are strongly correlated to the cgroup IO behavior.
>>
> 
> +Tejun (I guess he might be interested in the results below)
> 
> Hi Joseph,
> thanks for chiming in. Your suggestion did work!
> 
> At first, I thought I had also understood the use of latency from the
> outcome of your suggestion: "want low limit really guaranteed for a
> group?  set target latency to a low value for it." But then, as a
> crosscheck, I repeated the same exact test, but reversing target
> latencies: I gave 2000 to the interfered (the group with 100MB/s
> limit) and 100 to the interferers.  And the interfered still got more
> than 100MB/s!  So I exaggerated: 2 to the interfered.
> Same outcome :(
> 
> I tried really many other combinations, to try to figure this out, but
> results seemed more or less random w.r.t. to latency values.  I
> didn't even start to test different values for idle.
> 
> So, the only sound lesson that I seem to have learned is: if I want
> low limits to be enforced, I have to set target latency and idle
> explicitly.  The actual values of latencies matter little, or not at
> all. At least this holds for my simple tests.
> 
> At any rate, thanks to your help, Joseph, I could move to the most
> interesting part for me: how effective is blk-throttle with low
> limits?  I could well be wrong again, but my results do not seem that
> good.  With the simplest type of non-toy example I considered, I
> recorded throughput losses, apparently caused mainly by blk-throttle,
> and ranging from 64% to 75%.
> 
> Here is a worst-case example.  For each step, I'm reporting below the
> command by which you can reproduce that step with the
> thr-lat-with-interference benchmark of the S suite [1].  I just split
> bandwidth equally among five groups, on my SSD.  The device showed a
> peak rate of ~515MB/s in this test, so I set rpbs to 100MB/s for each
> group (and tried various values, and combinations of values, for the
> target latency, without any effect on the results).  To begin, I made
> every group do sequential reads.  Everything worked perfectly fine.
> 
> But then I made one group do random I/O [2], and troubles began.  Even
> if the group doing random I/O was given a target latency of 100usec
> (or lower), while the other had a target latency of 2000usec, the poor
> random-I/O group got only 4.7 MB/s!  (A single process doing 4k sync
> random I/O reaches 25MB/s on my SSD.)
> 
> I guess things broke because low limits did not comply any longer with
> the lower speed that device reached with the new, mixed workload: the
> device reached 376MB/s, w

Re: testing io.low limit for blk-throttle

2018-04-23 Thread Joseph Qi


On 18/4/23 15:35, Paolo Valente wrote:
> 
> 
>> Il giorno 23 apr 2018, alle ore 08:05, Joseph Qi <jiangqi...@gmail.com> ha 
>> scritto:
>>
>> Hi Paolo,
> 
> Hi Joseph,
> thanks for chiming in.
> 
>> What's your idle and latency config?
> 
> I didn't set them at all, as the only (explicit) requirement in my
> basic test is that one of the group is guaranteed a minimum bps.
> 
> 
>> IMO, io.low will allow others run more bandwidth if cgroup's average
>> idle time is high or latency is low.
> 
> What you say here makes me think that I simply misunderstood the
> purpose of io.low.  So, here is my problem/question: "I only need to
> guarantee at least a minimum bandwidth, in bps, to a group.  Is the
> io.low limit the way to go?"
> 
> I know that I can use just io.max (unless I misunderstood the goal of
> io.max too :( ), but my extra purpose would be to not waste bandwidth
> when some group is idle.  Yet, as for now, io.low is not working even
> for the first, simpler goal, i.e., guaranteeing a minimum bandwidth to
> one group when all groups are active.
> 
> Am I getting something wrong?
> 
> Otherwise, if there are some special values for idle and latency
> parameters that would make throttle work for my test, I'll be of
> course happy to try them.
> 
I think you can try idle time with 1000us for all cgroups, and latency
target 100us for cgroup with low limit 100MB/s and 2000us for cgroups
with low limit 10MB/s. That means cgroup with low latency target will
be preferred.
BTW, from my expeierence the parameters are not easy to set because
they are strongly correlated to the cgroup IO behavior.

Thanks,
Joseph


Re: testing io.low limit for blk-throttle

2018-04-23 Thread Joseph Qi
Hi Paolo,
What's your idle and latency config?
IMO, io.low will allow others run more bandwidth if cgroup's average
idle time is high or latency is low. In such cases, low limit won't get
guaranteed.

Thanks,
Joseph

On 18/4/22 17:23, Paolo Valente wrote:
> Hi Shaohua, all,
> at last, I started testing your io.low limit for blk-throttle.  One of
> the things I'm interested in is how good throttling is in achieving a
> high throughput in the presence of realistic, variable workloads.
> 
> However, I seem to have bumped into a totally different problem.  The
> io.low parameter doesn't seem to guarantee what I understand it is meant
> to guarantee: minimum per-group bandwidths.  For example, with
> - one group, the interfered, containing one process that does sequential
>   reads with fio
> - io.low set to 100MB/s for the interfered
> - six other groups, the interferers, with each interferer containing one
>   process doing sequential read with fio
> - io.low set to 10MB/s for each interferer
> - the workload executed on an SSD, with a 500MB/s of overall throughput
> the interfered gets only 75MB/s.
> 
> In particular, the throughput of the interfered becomes lower and
> lower as the number of interferers is increased.  So you can make it
> become even much lower than the 75MB/s in the example above.  There
> seems to be no control on bandwidth.
> 
> Am I doing something wrong?  Or did I simply misunderstand the goal of
> io.low, and the only parameter for guaranteeing the desired bandwidth to
> a group is io.max (to be used indirectly, by limiting the bandwidth of
> the interferers)?
> 
> If useful for you, you can reproduce the above test very quickly, by
> using the S suite [1] and typing:
> 
> cd thr-lat-with-interference
> sudo ./thr-lat-with-interference.sh -b t -w 1 -W "1000 1000 
> 1000 1000 1000 1000" -n 6 -T "read read read read read read" 
> -R "0 0 0 0 0 0"
> 
> Looking forward to your feedback,
> Paolo
> 
> [1] 
> 


Re: [PATCH v3] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-12 Thread Joseph Qi


On 18/4/11 07:02, Bart Van Assche wrote:
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> call with blk_queue_enter() / blk_queue_exit().
> 
> Reported-by: Ming Lei <ming@redhat.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Ming Lei <ming@redhat.com>
> Cc: Joseph Qi <joseph...@linux.alibaba.com>

I've tested using the following steps:
1) start a fio job with buffered write;
2) then remove the scsi device that fio write to:
echo "scsi remove-single-device ${dev}" > /proc/scsi/scsi

After applying this patch, the reported oops has gone.

Tested-by: Joseph Qi <joseph...@linux.alibaba.com>

> ---
> 
> Changes compared to v2: converted two ternary expressions into if-statements.
> 
> Changes compared to v1: guarded the blk_queue_exit() inside the loop with "if 
> (q)".
> 
>  block/blk-core.c | 35 +--
>  1 file changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 34e2f2227fd9..39308e874ffa 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2386,8 +2386,20 @@ blk_qc_t generic_make_request(struct bio *bio)
>* yet.
>*/
>   struct bio_list bio_list_on_stack[2];
> + blk_mq_req_flags_t flags = 0;
> + struct request_queue *q = bio->bi_disk->queue;
>   blk_qc_t ret = BLK_QC_T_NONE;
>  
> + if (bio->bi_opf & REQ_NOWAIT)
> + flags = BLK_MQ_REQ_NOWAIT;
> + if (blk_queue_enter(q, flags) < 0) {
> + if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> + bio_wouldblock_error(bio);
> + else
> + bio_io_error(bio);
> + return ret;
> + }
> +
>   if (!generic_make_request_checks(bio))
>   goto out;
>  
> @@ -2424,11 +2436,22 @@ blk_qc_t generic_make_request(struct bio *bio)
>   bio_list_init(_list_on_stack[0]);
>   current->bio_list = bio_list_on_stack;
>   do {
> - struct request_queue *q = bio->bi_disk->queue;
> - blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
> - BLK_MQ_REQ_NOWAIT : 0;
> + bool enter_succeeded = true;
> +
> + if (unlikely(q != bio->bi_disk->queue)) {
> + if (q)
> + blk_queue_exit(q);
> + q = bio->bi_disk->queue;
> + flags = 0;
> + if (bio->bi_opf & REQ_NOWAIT)
> + flags = BLK_MQ_REQ_NOWAIT;
> + if (blk_queue_enter(q, flags) < 0) {
> + enter_succeeded = false;
> + q = NULL;
> + }
> + }
>  
> - if (likely(blk_queue_enter(q, flags) == 0)) {
> + if (enter_succeeded) {
>   struct bio_list lower, same;
>  
>   /* Create a fresh bio_list for all subordinate requests 
> */
> @@ -2436,8 +2459,6 @@ blk_qc_t generic_make_request(struct bio *bio)
>   bio_list_init(_list_on_stack[0]);
>   ret = q->make_request_fn(q, bio);
>  
> - blk_queue_exit(q);
> -
>   /* sort new bios into those for a lower level
>* and those for the same level
>*/
> @@ -2464,6 +2485,8 @@ blk_qc_t generic_make_request(struct bio *bio)
>   current->bio_list = NULL; /* deactivate */
>  
>  out:
> + if (q)
> + blk_queue_exit(q);
>   return ret;
>  }
>  EXPORT_SYMBOL(generic_make_request);
> 


Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-09 Thread Joseph Qi
Hi Bart,

On 18/4/9 12:47, Bart Van Assche wrote:
> On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote:
>> The following kernel oops is triggered by 'removing scsi device' during
>> heavy IO.
> 
> Is the below patch sufficient to fix this?
> 
> Thanks,
> 
> Bart.
> 
> 
> Subject: blk-mq: Avoid that submitting a bio concurrently with device removal 
> triggers a crash
> 
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence check earlier in generic_make_request()
> whether the queue has been marked as "dying".

The oops happens during generic_make_request_checks(), in
blk_throtl_bio() exactly.
So if we want to bypass dying queue, we have to check this before
generic_make_request_checks(), I think.

Thanks,
Joseph

> ---
>  block/blk-core.c | 72 
> +---
>  1 file changed, 37 insertions(+), 35 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index aa8c99fae527..3ac9dd25e04e 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2385,10 +2385,21 @@ blk_qc_t generic_make_request(struct bio *bio)
>* yet.
>*/
>   struct bio_list bio_list_on_stack[2];
> + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
> + BLK_MQ_REQ_NOWAIT : 0;
> + struct request_queue *q = bio->bi_disk->queue;
>   blk_qc_t ret = BLK_QC_T_NONE;
>  
>   if (!generic_make_request_checks(bio))
> - goto out;
> + return ret;
> +
> + if (blk_queue_enter(q, flags) < 0) {
> + if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)))
> + bio_wouldblock_error(bio);
> + else
> + bio_io_error(bio);
> + return ret;
> + }
>  
>   /*
>* We only want one ->make_request_fn to be active at a time, else
> @@ -2423,46 +2434,37 @@ blk_qc_t generic_make_request(struct bio *bio)
>   bio_list_init(_list_on_stack[0]);
>   current->bio_list = bio_list_on_stack;
>   do {
> - struct request_queue *q = bio->bi_disk->queue;
> - blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
> - BLK_MQ_REQ_NOWAIT : 0;
> -
> - if (likely(blk_queue_enter(q, flags) == 0)) {
> - struct bio_list lower, same;
> -
> - /* Create a fresh bio_list for all subordinate requests 
> */
> - bio_list_on_stack[1] = bio_list_on_stack[0];
> - bio_list_init(_list_on_stack[0]);
> - ret = q->make_request_fn(q, bio);
> -
> - blk_queue_exit(q);
> -
> - /* sort new bios into those for a lower level
> -  * and those for the same level
> -  */
> - bio_list_init();
> - bio_list_init();
> - while ((bio = bio_list_pop(_list_on_stack[0])) != 
> NULL)
> - if (q == bio->bi_disk->queue)
> - bio_list_add(, bio);
> - else
> - bio_list_add(, bio);
> - /* now assemble so we handle the lowest level first */
> - bio_list_merge(_list_on_stack[0], );
> - bio_list_merge(_list_on_stack[0], );
> - bio_list_merge(_list_on_stack[0], 
> _list_on_stack[1]);
> - } else {
> - if (unlikely(!blk_queue_dying(q) &&
> - (bio->bi_opf & REQ_NOWAIT)))
> - bio_wouldblock_error(bio);
> + struct bio_list lower, same;
> +
> + WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT) &&
> +  (bio->bi_opf & REQ_NOWAIT));
> + WARN_ON_ONCE(q != bio->bi_disk->queue);
> + q = bio->bi_disk->queue;
> + /* Create a fresh bio_list for all subordinate requests */
> + bio_list_on_stack[1] = bio_list_on_stack[0];
> + bio_list_init(_list_on_stack[0]);
> + ret = q->make_request_fn(q, bio);
> +
> + /* sort new bios into those for a lower level
> +  * and those for the same level
> +  */
> + bio_list_init();
> + bio_list_init();
> + while ((bio = bio_list_pop(_list_on_stack[0])) != NULL)
> + if (q == bio->bi_disk->queue)
> + bio_list_add(, bio);
>   else
> - bio_io_error(bio);
> - }
> + bio_list_add(, bio);
> + /* now assemble so we handle the lowest level first */
> + bio_list_merge(_list_on_stack[0], );
> + bio_list_merge(_list_on_stack[0], 

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Joseph Qi
Hi Bart,

On 18/4/8 22:50, Bart Van Assche wrote:
> On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote:
>> The following kernel oops is triggered by 'removing scsi device' during
>> heavy IO.
> 
> How did you trigger this oops?
> 

I can reproduce this oops by the following steps:
1) start a fio job with buffered write;
2) remove the scsi device fio write to:
echo "scsi remove-single-device ${dev}" > /proc/scsi/scsi


[PATCH] blk-throttle: fix typo in blkg_print_stat_ios() comments

2018-04-08 Thread Joseph Qi
Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
---
 block/blk-cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1c16694..87367d4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -640,7 +640,7 @@ int blkg_print_stat_bytes(struct seq_file *sf, void *v)
 EXPORT_SYMBOL_GPL(blkg_print_stat_bytes);
 
 /**
- * blkg_print_stat_bytes - seq_show callback for blkg->stat_ios
+ * blkg_print_stat_ios - seq_show callback for blkg->stat_ios
  * @sf: seq_file to print to
  * @v: unused
  *
-- 
1.8.3.1



Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Joseph Qi
This is because scsi_remove_device() will call blk_cleanup_queue(), and
then all blkgs have been destroyed and root_blkg is NULL.
Thus tg is NULL and trigger NULL pointer dereference when get td from
tg (tg->td).
It seems that we cannot simply move blkcg_exit_queue() up to
blk_cleanup_queue().

Thanks,
Joseph

On 18/4/8 12:21, Ming Lei wrote:
> Hi,
> 
> The following kernel oops is triggered by 'removing scsi device' during
> heavy IO.
> 
> 'git bisect' shows that commit a063057d7c731cffa7d10740(block: Fix a race
> between request queue removal and the block cgroup controller)
> introduced this regression:
> 
> [   42.268257] BUG: unable to handle kernel NULL pointer dereference at 
> 0028
> [   42.269339] PGD 26bd9f067 P4D 26bd9f067 PUD 26bfec067 PMD 0 
> [   42.270077] Oops:  [#1] PREEMPT SMP NOPTI
> [   42.270681] Dumping ftrace buffer:
> [   42.271141](ftrace buffer empty)
> [   42.271641] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support 
> crc32c_intel i2c_i801 i2c_core lpc_ich mfd_core usb_storage nvme shpchp 
> nvme_core virtio_scsi qemu_fw_cfg ip_tables
> [   42.273770] CPU: 5 PID: 1076 Comm: fio Not tainted 4.16.0+ #49
> [   42.274530] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [   42.275634] RIP: 0010:blk_throtl_bio+0x41/0x904
> [   42.276225] RSP: 0018:c900033cfaa0 EFLAGS: 00010246
> [   42.276907] RAX: 8000 RBX: 8801bdcc5118 RCX: 
> 0001
> [   42.277818] RDX: 8801bdcc5118 RSI:  RDI: 
> 8802641f8870
> [   42.278733] RBP:  R08: 0001 R09: 
> c900033cfb94
> [   42.279651] R10: c900033cfc00 R11: 06ea R12: 
> 8802641f8870
> [   42.280567] R13: 88026f34f000 R14:  R15: 
> 8801bdcc5118
> [   42.281489] FS:  7fc123922d40() GS:880272f4() 
> knlGS:
> [   42.282525] CS:  0010 DS:  ES:  CR0: 80050033
> [   42.283270] CR2: 0028 CR3: 00026d7ac004 CR4: 
> 007606e0
> [   42.284194] DR0:  DR1:  DR2: 
> 
> [   42.285116] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   42.286036] PKRU: 5554
> [   42.286393] Call Trace:
> [   42.286725]  ? try_to_wake_up+0x3a3/0x3c9
> [   42.287255]  ? blk_mq_hctx_notify_dead+0x135/0x135
> [   42.287880]  ? gup_pud_range+0xb5/0x7e1
> [   42.288381]  generic_make_request_checks+0x3cf/0x539
> [   42.289027]  ? gup_pgd_range+0x8e/0xaa
> [   42.289515]  generic_make_request+0x38/0x25b
> [   42.290078]  ? submit_bio+0x103/0x11f
> [   42.290555]  submit_bio+0x103/0x11f
> [   42.291018]  ? bio_iov_iter_get_pages+0xe4/0x104
> [   42.291620]  blkdev_direct_IO+0x2a3/0x3af
> [   42.292151]  ? kiocb_free+0x34/0x34
> [   42.292607]  ? ___preempt_schedule+0x16/0x18
> [   42.293168]  ? preempt_schedule_common+0x4c/0x65
> [   42.293771]  ? generic_file_read_iter+0x96/0x110
> [   42.294377]  generic_file_read_iter+0x96/0x110
> [   42.294962]  aio_read+0xca/0x13b
> [   42.295388]  ? preempt_count_add+0x6d/0x8c
> [   42.295926]  ? aio_read_events+0x287/0x2d6
> [   42.296460]  ? do_io_submit+0x4d2/0x62c
> [   42.296964]  do_io_submit+0x4d2/0x62c
> [   42.297446]  ? do_syscall_64+0x9d/0x15e
> [   42.297950]  do_syscall_64+0x9d/0x15e
> [   42.298431]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [   42.299090] RIP: 0033:0x7fc12244e687
> [   42.299556] RSP: 002b:7ffe18388a68 EFLAGS: 0202 ORIG_RAX: 
> 00d1
> [   42.300528] RAX: ffda RBX: 7fc0fde08670 RCX: 
> 7fc12244e687
> [   42.301442] RDX: 01d1b388 RSI: 0001 RDI: 
> 7fc123782000
> [   42.302359] RBP: 22d8 R08: 0001 R09: 
> 01c461e0
> [   42.303275] R10:  R11: 0202 R12: 
> 7fc0fde08670
> [   42.304195] R13:  R14: 01d1d0c0 R15: 
> 01b872f0
> [   42.305117] Code: 48 85 f6 48 89 7c 24 10 75 0e 48 8b b7 b8 05 00 00 31 ed 
> 48 85 f6 74 0f 48 63 05 75 a4 e4 00 48 8b ac c6 28 02 00 00 f6 43 15 02 <48> 
> 8b 45 28 48 89 04 24 0f 85 28 08 00 00 8b 43 10 45 31 e4 83 
> [   42.307553] RIP: blk_throtl_bio+0x41/0x904 RSP: c900033cfaa0
> [   42.308328] CR2: 0028
> [   42.308920] ---[ end trace f53a144979f63b29 ]---
> [   42.309520] Kernel panic - not syncing: Fatal exception
> [   42.310635] Dumping ftrace buffer:
> [   42.311087](ftrace buffer empty)
> [   42.311583] Kernel Offset: disabled
> [   42.312163] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 


[PATCH v6] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-16 Thread Joseph Qi
We've triggered a WARNING in blk_throtl_bio() when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get(),
and then kernel crashes with invalid page request.
After investigating this issue, we've found it is caused by a race
between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
below:

writeback kworker   cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)
blk_throtl_bio
spin_lock_irq(q->queue_lock)
...
spin_unlock_irq(q->queue_lock)
  rcu_read_unlock

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
blkg release.
Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
And then the corresponding blkg_put() will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will
lookup first and then try to lookup/create again with queue_lock. Since
revive this logic is a bit drastic, so fix it by only offlining pd during
blkcg_css_offline(), and move the rest destruction (especially
blkg_put()) into blkcg_css_free(), which should be the right way as
discussed.

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
blkcg_bio_issue_check()")
Reported-by: Jiufei Xue <jiufei@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
Acked-by: Tejun Heo <t...@kernel.org>
---
 block/blk-cgroup.c | 78 --
 include/linux/blk-cgroup.h |  1 +
 2 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c2033a2..1c16694 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -307,11 +307,28 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
}
 }
 
+static void blkg_pd_offline(struct blkcg_gq *blkg)
+{
+   int i;
+
+   lockdep_assert_held(blkg->q->queue_lock);
+   lockdep_assert_held(>blkcg->lock);
+
+   for (i = 0; i < BLKCG_MAX_POLS; i++) {
+   struct blkcg_policy *pol = blkcg_policy[i];
+
+   if (blkg->pd[i] && !blkg->pd[i]->offline &&
+   pol->pd_offline_fn) {
+   pol->pd_offline_fn(blkg->pd[i]);
+   blkg->pd[i]->offline = true;
+   }
+   }
+}
+
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
struct blkcg *blkcg = blkg->blkcg;
struct blkcg_gq *parent = blkg->parent;
-   int i;
 
lockdep_assert_held(blkg->q->queue_lock);
lockdep_assert_held(>lock);
@@ -320,13 +337,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
WARN_ON_ONCE(list_empty(>q_node));
WARN_ON_ONCE(hlist_unhashed(>blkcg_node));
 
-   for (i = 0; i < BLKCG_MAX_POLS; i++) {
-   struct blkcg_policy *pol = blkcg_policy[i];
-
-   if (blkg->pd[i] && pol->pd_offline_fn)
-   pol->pd_offline_fn(blkg->pd[i]);
-   }
-
if (parent) {
blkg_rwstat_add_aux(>stat_bytes, >stat_bytes);
blkg_rwstat_add_aux(>stat_ios, >stat_ios);
@@ -369,6 +379,7 @@ static void blkg_destroy_all(struct request_queue *q)
struct blkcg *blkcg = blkg->blkcg;
 
spin_lock(>lock);
+   blkg_pd_offline(blkg);
blkg_destroy(blkg);
spin_unlock(>lock);
}
@@ -995,25 +1006,25 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
  * @css: css of interest
  *
  * This function is called when @css is about to go away and responsible
- * for shooting down all blkgs associated with @css.  blkgs should be
- * removed while holding both q and blkcg locks.  As blkcg lock is nested
- * inside q lock, this function performs reverse double lock dancing.
+ * for offlining all blkgs pd and killing all wbs associated with @css.
+ * blkgs pd offline should be done while holding both q and blkcg locks.
+ * As blkcg lock is nested inside q lock, this function performs reverse
+ * double lock dancing.
  *
  * This is the blkcg counterpart of ioc_release_fn().
  */
 static void blkcg

Re: [PATCH v5] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-14 Thread Joseph Qi
Hello Tejun,

Thanks for your quick response.

On 18/3/14 22:09, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 14, 2018 at 02:18:04PM +0800, Joseph Qi wrote:
>> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
>> blkcg_bio_issue_check()")
>> Reported-by: Jiufei Xue <jiufei@linux.alibaba.com>
>> Cc: sta...@vger.kernel.org #4.3+
> 
> I'm a bit nervous about tagging it for -stable.  Given the low rate of
> this actually occurring, I'm not sure the benefits outweigh the risks.
> Let's at least cook it for a couple releases before sending it to
> -stable.
> 
>> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
>> index 69bea82..dccd102 100644
>> --- a/include/linux/blk-cgroup.h
>> +++ b/include/linux/blk-cgroup.h
>> @@ -88,6 +88,7 @@ struct blkg_policy_data {
>>  /* the blkg and policy id this per-policy data belongs to */
>>  struct blkcg_gq *blkg;
>>  int plid;
>> +boolofflined;
>>  };
> 
> This is pure bike-shedding but offlined reads kinda weird to me, maybe
> just offline would read better?  Other than that,
> 
Do I need to resend a new version for this?

Thanks,
Joseph

>  Acked-by: Tejun Heo <t...@kernel.org>
> 
> Thanks a lot for seeing this through.
> 


[PATCH v5] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-14 Thread Joseph Qi
We've triggered a WARNING in blk_throtl_bio() when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get(),
and then kernel crashes with invalid page request.
After investigating this issue, we've found it is caused by a race
between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
below:

writeback kworker   cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)
blk_throtl_bio
spin_lock_irq(q->queue_lock)
...
spin_unlock_irq(q->queue_lock)
  rcu_read_unlock

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
blkg release.
Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
And then the corresponding blkg_put() will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will
lookup first and then try to lookup/create again with queue_lock. Since
revive this logic is a bit drastic, so fix it by only offlining pd during
blkcg_css_offline(), and move the rest destruction (especially
blkg_put()) into blkcg_css_free(), which should be the right way as
discussed.

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
blkcg_bio_issue_check()")
Reported-by: Jiufei Xue <jiufei@linux.alibaba.com>
Cc: sta...@vger.kernel.org #4.3+
Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
---
 block/blk-cgroup.c | 78 --
 include/linux/blk-cgroup.h |  1 +
 2 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c2033a2..92112f4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -307,11 +307,28 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
}
 }
 
+static void blkg_pd_offline(struct blkcg_gq *blkg)
+{
+   int i;
+
+   lockdep_assert_held(blkg->q->queue_lock);
+   lockdep_assert_held(>blkcg->lock);
+
+   for (i = 0; i < BLKCG_MAX_POLS; i++) {
+   struct blkcg_policy *pol = blkcg_policy[i];
+
+   if (blkg->pd[i] && !blkg->pd[i]->offlined &&
+   pol->pd_offline_fn) {
+   pol->pd_offline_fn(blkg->pd[i]);
+   blkg->pd[i]->offlined = true;
+   }
+   }
+}
+
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
struct blkcg *blkcg = blkg->blkcg;
struct blkcg_gq *parent = blkg->parent;
-   int i;
 
lockdep_assert_held(blkg->q->queue_lock);
lockdep_assert_held(>lock);
@@ -320,13 +337,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
WARN_ON_ONCE(list_empty(>q_node));
WARN_ON_ONCE(hlist_unhashed(>blkcg_node));
 
-   for (i = 0; i < BLKCG_MAX_POLS; i++) {
-   struct blkcg_policy *pol = blkcg_policy[i];
-
-   if (blkg->pd[i] && pol->pd_offline_fn)
-   pol->pd_offline_fn(blkg->pd[i]);
-   }
-
if (parent) {
blkg_rwstat_add_aux(>stat_bytes, >stat_bytes);
blkg_rwstat_add_aux(>stat_ios, >stat_ios);
@@ -369,6 +379,7 @@ static void blkg_destroy_all(struct request_queue *q)
struct blkcg *blkcg = blkg->blkcg;
 
spin_lock(>lock);
+   blkg_pd_offline(blkg);
blkg_destroy(blkg);
spin_unlock(>lock);
}
@@ -995,25 +1006,25 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
  * @css: css of interest
  *
  * This function is called when @css is about to go away and responsible
- * for shooting down all blkgs associated with @css.  blkgs should be
- * removed while holding both q and blkcg locks.  As blkcg lock is nested
- * inside q lock, this function performs reverse double lock dancing.
+ * for offlining all blkgs pd and killing all wbs associated with @css.
+ * blkgs pd offline should be done while holding both q and blkcg locks.
+ * As blkcg lock is nested inside q lock, this function performs reverse
+ * double lock dancing.
  *
  * This is the blkcg counterpart of ioc_release_fn().
  */
 static void blkcg_css_offline(struct 

Re: [PATCH v3] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-13 Thread Joseph Qi
Fine, I will do the corresponding changes and post v5.

Thanks,
Joseph

On 18/3/14 04:19, Tejun Heo wrote:
> Hello, Joseph.
> 
> On Mon, Mar 05, 2018 at 11:03:16PM +0800, Joseph Qi wrote:
>> +static void blkg_pd_offline(struct blkcg_gq *blkg)
>> +{
>> +int i;
>> +
>> +lockdep_assert_held(blkg->q->queue_lock);
>> +lockdep_assert_held(>blkcg->lock);
>> +
>> +for (i = 0; i < BLKCG_MAX_POLS; i++) {
>> +struct blkcg_policy *pol = blkcg_policy[i];
>> +
>> +if (blkg->pd[i] && !blkg->pd_offline[i] && pol->pd_offline_fn) {
>> +pol->pd_offline_fn(blkg->pd[i]);
>> +blkg->pd_offline[i] = true;
> 
> Can we move this flag into blkg_policy_data?
> 
>> +while (!hlist_empty(>blkg_list)) {
>> +struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first,
>> +struct blkcg_gq,
>> +blkcg_node);
>> +struct request_queue *q = blkg->q;
>> +
>> +if (spin_trylock(q->queue_lock)) {
>> +blkg_destroy(blkg);
>> +spin_unlock(q->queue_lock);
>> +} else {
>> +spin_unlock_irq(>lock);
>> +cpu_relax();
>> +spin_lock_irq(>lock);
>> +}
> 
> Can we factor out the above loop?  It's something subtle and painful
> and I think it'd be better to have it separated out and documented.
> 
> Other than that, looks great.
> 
> Thanks.
> 


Re: [PATCH v4] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-08 Thread Joseph Qi
Hello Tejun,
Could you please help review this version?

Thanks,
Joseph

On 18/3/6 11:53, Joseph Qi wrote:
> We've triggered a WARNING in blk_throtl_bio() when throttling writeback
> io, which complains blkg->refcnt is already 0 when calling blkg_get(),
> and then kernel crashes with invalid page request.
> After investigating this issue, we've found it is caused by a race
> between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
> below:
> 
> writeback kworker   cgroup_rmdir
>   cgroup_destroy_locked
> kill_css
>   css_killed_ref_fn
> css_killed_work_fn
>   offline_css
> blkcg_css_offline
>   blkcg_bio_issue_check
> rcu_read_lock
> blkg_lookup
>   spin_trylock(q->queue_lock)
>   blkg_destroy
>   spin_unlock(q->queue_lock)
> blk_throtl_bio
> spin_lock_irq(q->queue_lock)
> ...
> spin_unlock_irq(q->queue_lock)
>   rcu_read_unlock
> 
> Since rcu can only prevent blkg from releasing when it is being used,
> the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
> blkg release.
> Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
> And then the corresponding blkg_put() will schedule blkg release again,
> which result in double free.
> This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
> creation in blkcg_bio_issue_check()"). Before this commit, it will
> lookup first and then try to lookup/create again with queue_lock. Since
> revive this logic is a bit drastic, so fix it by only offlining pd during
> blkcg_css_offline(), and move the rest destruction (especially
> blkg_put()) into blkcg_css_free(), which should be the right way as
> discussed.
> 
> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
> blkcg_bio_issue_check()")
> Reported-by: Jiufei Xue <jiufei@linux.alibaba.com>
> Cc: sta...@vger.kernel.org #4.3+
> Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
> ---
>  block/blk-cgroup.c | 57 
> +++---
>  include/linux/blk-cgroup.h |  2 ++
>  2 files changed, 46 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index c2033a2..450cefd 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -307,11 +307,27 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>   }
>  }
>  
> +static void blkg_pd_offline(struct blkcg_gq *blkg)
> +{
> + int i;
> +
> + lockdep_assert_held(blkg->q->queue_lock);
> + lockdep_assert_held(>blkcg->lock);
> +
> + for (i = 0; i < BLKCG_MAX_POLS; i++) {
> + struct blkcg_policy *pol = blkcg_policy[i];
> +
> + if (blkg->pd[i] && !blkg->pd_offline[i] && pol->pd_offline_fn) {
> + pol->pd_offline_fn(blkg->pd[i]);
> + blkg->pd_offline[i] = true;
> + }
> + }
> +}
> +
>  static void blkg_destroy(struct blkcg_gq *blkg)
>  {
>   struct blkcg *blkcg = blkg->blkcg;
>   struct blkcg_gq *parent = blkg->parent;
> - int i;
>  
>   lockdep_assert_held(blkg->q->queue_lock);
>   lockdep_assert_held(>lock);
> @@ -320,13 +336,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>   WARN_ON_ONCE(list_empty(>q_node));
>   WARN_ON_ONCE(hlist_unhashed(>blkcg_node));
>  
> - for (i = 0; i < BLKCG_MAX_POLS; i++) {
> - struct blkcg_policy *pol = blkcg_policy[i];
> -
> - if (blkg->pd[i] && pol->pd_offline_fn)
> - pol->pd_offline_fn(blkg->pd[i]);
> - }
> -
>   if (parent) {
>   blkg_rwstat_add_aux(>stat_bytes, >stat_bytes);
>   blkg_rwstat_add_aux(>stat_ios, >stat_ios);
> @@ -369,6 +378,7 @@ static void blkg_destroy_all(struct request_queue *q)
>   struct blkcg *blkcg = blkg->blkcg;
>  
>   spin_lock(>lock);
> + blkg_pd_offline(blkg);
>   blkg_destroy(blkg);
>   spin_unlock(>lock);
>   }
> @@ -1004,16 +1014,15 @@ static int blkcg_print_stat(struct seq_file *sf, void 
> *v)
>  static void blkcg_css_offline(struct cgroup_subsys_sta

[PATCH v3] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-05 Thread Joseph Qi
We've triggered a WARNING in blk_throtl_bio() when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get(),
and then kernel crashes with invalid page request.
After investigating this issue, we've found it is caused by a race
between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
below:

writeback kworker   cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)
blk_throtl_bio
spin_lock_irq(q->queue_lock)
...
spin_unlock_irq(q->queue_lock)
  rcu_read_unlock

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
blkg release.
Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
And then the corresponding blkg_put() will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will
lookup first and then try to lookup/create again with queue_lock. Since
revive this logic is a bit drastic, so fix it by only offlining pd during
blkcg_css_offline(), and move the rest destruction (especially
blkg_put()) into blkcg_css_free(), which should be the right way as
discussed.

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
blkcg_bio_issue_check()")
Reported-by: Jiufei Xue <jiufei@linux.alibaba.com>
Cc: sta...@vger.kernel.org #4.3+
Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
---
 block/blk-cgroup.c | 52 +-
 include/linux/blk-cgroup.h |  2 ++
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c2033a2..2e9f510 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -307,11 +307,27 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
}
 }
 
+static void blkg_pd_offline(struct blkcg_gq *blkg)
+{
+   int i;
+
+   lockdep_assert_held(blkg->q->queue_lock);
+   lockdep_assert_held(>blkcg->lock);
+
+   for (i = 0; i < BLKCG_MAX_POLS; i++) {
+   struct blkcg_policy *pol = blkcg_policy[i];
+
+   if (blkg->pd[i] && !blkg->pd_offline[i] && pol->pd_offline_fn) {
+   pol->pd_offline_fn(blkg->pd[i]);
+   blkg->pd_offline[i] = true;
+   }
+   }
+}
+
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
struct blkcg *blkcg = blkg->blkcg;
struct blkcg_gq *parent = blkg->parent;
-   int i;
 
lockdep_assert_held(blkg->q->queue_lock);
lockdep_assert_held(>lock);
@@ -320,13 +336,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
WARN_ON_ONCE(list_empty(>q_node));
WARN_ON_ONCE(hlist_unhashed(>blkcg_node));
 
-   for (i = 0; i < BLKCG_MAX_POLS; i++) {
-   struct blkcg_policy *pol = blkcg_policy[i];
-
-   if (blkg->pd[i] && pol->pd_offline_fn)
-   pol->pd_offline_fn(blkg->pd[i]);
-   }
-
if (parent) {
blkg_rwstat_add_aux(>stat_bytes, >stat_bytes);
blkg_rwstat_add_aux(>stat_ios, >stat_ios);
@@ -369,6 +378,7 @@ static void blkg_destroy_all(struct request_queue *q)
struct blkcg *blkcg = blkg->blkcg;
 
spin_lock(>lock);
+   blkg_pd_offline(blkg);
blkg_destroy(blkg);
spin_unlock(>lock);
}
@@ -1013,7 +1023,7 @@ static void blkcg_css_offline(struct cgroup_subsys_state 
*css)
struct request_queue *q = blkg->q;
 
if (spin_trylock(q->queue_lock)) {
-   blkg_destroy(blkg);
+   blkg_pd_offline(blkg);
spin_unlock(q->queue_lock);
} else {
spin_unlock_irq(>lock);
@@ -1032,6 +1042,26 @@ static void blkcg_css_free(struct cgroup_subsys_state 
*css)
struct blkcg *blkcg = css_to_blkcg(css);
int i;
 
+   spin_lock_irq(>lock);
+
+   while (!hlist_empty(>blkg_list)) {
+   struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first

Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-03-04 Thread Joseph Qi


On 18/3/5 04:23, Tejun Heo wrote:
> Hello, Joseph.
> 
> Sorry about late reply.
> 
> On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote:
>> In current code, I'm afraid pd_offline_fn() as well as the rest
>> destruction have to be called together under the same blkcg->lock and
>> q->queue_lock.
>> For example, if we split the pd_offline_fn() and radix_tree_delete()
>> into 2 phases, it may introduce a race between blkcg_deactivate_policy()
>> when exit queue and blkcg_css_free(), which will result in
>> pd_offline_fn() to be called twice.
> 
> So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need
> some restructuring to separate out blkg offlining and release, but it
> looks like that'd be the right thing to do, no?
> 
Agree, except the restriction above, as of now I don't find any more.
I'll try to fix in the way you suggested and post v3.

Thanks,
Joseph



Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-27 Thread Joseph Qi
Hi Tejun,

On 18/2/28 02:33, Tejun Heo wrote:
> Hello, Joseph.
> 
> On Sat, Feb 24, 2018 at 09:45:49AM +0800, Joseph Qi wrote:
>>> IIRC, as long as the blkcg and the device are there, the blkgs aren't
>>> gonna be destroyed.  So, if you have a ref to the blkcg through
>>> tryget, the blkg shouldn't go away.
>>>
>>
>> Maybe we have misunderstanding here.
>>
>> In this case, blkg doesn't go away as we have rcu protect, but
>> blkg_destroy() can be called, in which blkg_put() will put the last
>> refcnt and then schedule __blkg_release_rcu().
>>
>> css refcnt can't prevent blkcg css from offlining, instead it is css
>> online_cnt.
>>
>> css_tryget() will only get a refcnt of blkcg css, but can't be
>> guaranteed to fail when css is confirmed to kill.
> 
> Ah, you're right.  I was thinking we only destroy blkgs from blkcg
> release path.  Given that we primarily use blkcg refcnting to pin
> them, I believe that's what we should do - ie. only call
> pd_offline_fn() from blkcg_css_offline() path and do the rest of
> destruction from blkcg_css_free().  What do you think?
> 
In current code, I'm afraid pd_offline_fn() as well as the rest
destruction have to be called together under the same blkcg->lock and
q->queue_lock.
For example, if we split the pd_offline_fn() and radix_tree_delete()
into 2 phases, it may introduce a race between blkcg_deactivate_policy()
when exit queue and blkcg_css_free(), which will result in
pd_offline_fn() to be called twice.

Thanks,
Joseph




Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-26 Thread Joseph Qi
Hi Tejun,

Could you please help take a look at this and give a confirmation?

Thanks,
Joseph

On 18/2/24 09:45, Joseph Qi wrote:
> Hi Tejun,
> 
> On 18/2/23 22:23, Tejun Heo wrote:
>> Hello,
>>
>> On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote:
>>>> On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote:
>>>>> I still don't get how css_tryget can work here.
>>>>>
>>>>> The race happens when:
>>>>> 1) writeback kworker has found the blkg with rcu;
>>>>> 2) blkcg is during offlining and blkg_destroy() has already been called.
>>>>> Then, writeback kworker will take queue lock and access the blkg with
>>>>> refcount 0.
>>>>
>>>> Yeah, then tryget would fail and it should go through the root.
>>>>
>>> In this race, the refcount of blkg becomes zero and is destroyed.
>>> However css may still have refcount, and css_tryget can return success
>>> before other callers put the refcount.
>>> So I don't get how css_tryget can fix this race? Or I wonder if we can
>>> add another function blkg_tryget?
>>
>> IIRC, as long as the blkcg and the device are there, the blkgs aren't
>> gonna be destroyed.  So, if you have a ref to the blkcg through
>> tryget, the blkg shouldn't go away.
>>
> 
> Maybe we have misunderstanding here.
> 
> In this case, blkg doesn't go away as we have rcu protect, but
> blkg_destroy() can be called, in which blkg_put() will put the last
> refcnt and then schedule __blkg_release_rcu().
> 
> css refcnt can't prevent blkcg css from offlining, instead it is css
> online_cnt.
> 
> css_tryget() will only get a refcnt of blkcg css, but can't be
> guaranteed to fail when css is confirmed to kill.
> 
> The race sequence:
> writeback kworker   cgroup_rmdir
>   cgroup_destroy_locked
> kill_css
>   css_killed_ref_fn
> css_killed_work_fn
>   offline_css
> blkcg_css_offline
>   blkcg_bio_issue_check
> rcu_read_lock
> blkg_lookup
>   spin_trylock(q->queue_lock)
>   blkg_destroy
>   spin_unlock(q->queue_lock)
> blk_throtl_bio
>   spin_lock_irq(q->queue_lock)
>   spin_unlock_irq(q->queue_lock)
> rcu_read_unlock
> 
> Thanks,
> Joseph
> 


Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-23 Thread Joseph Qi
Hi Tejun,

On 18/2/23 22:23, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote:
>>> On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote:
>>>> I still don't get how css_tryget can work here.
>>>>
>>>> The race happens when:
>>>> 1) writeback kworker has found the blkg with rcu;
>>>> 2) blkcg is during offlining and blkg_destroy() has already been called.
>>>> Then, writeback kworker will take queue lock and access the blkg with
>>>> refcount 0.
>>>
>>> Yeah, then tryget would fail and it should go through the root.
>>>
>> In this race, the refcount of blkg becomes zero and is destroyed.
>> However css may still have refcount, and css_tryget can return success
>> before other callers put the refcount.
>> So I don't get how css_tryget can fix this race? Or I wonder if we can
>> add another function blkg_tryget?
> 
> IIRC, as long as the blkcg and the device are there, the blkgs aren't
> gonna be destroyed.  So, if you have a ref to the blkcg through
> tryget, the blkg shouldn't go away.
> 

Maybe we have misunderstanding here.

In this case, blkg doesn't go away as we have rcu protect, but
blkg_destroy() can be called, in which blkg_put() will put the last
refcnt and then schedule __blkg_release_rcu().

css refcnt can't prevent blkcg css from offlining, instead it is css
online_cnt.

css_tryget() will only get a refcnt of blkcg css, but can't be
guaranteed to fail when css is confirmed to kill.

The race sequence:
writeback kworker   cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)
blk_throtl_bio
  spin_lock_irq(q->queue_lock)
  spin_unlock_irq(q->queue_lock)
rcu_read_unlock

Thanks,
Joseph


Re: [PATCH v4 6/6] block: Fix a race between request queue removal and the block cgroup controller

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Avoid that the following race can occur:
> 
> blk_cleanup_queue()   blkcg_print_blkgs()
>   spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
> q->queue_lock = >__queue_lock (3)
>   spin_unlock_irq(lock) (4)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) take driver lock;
> (2) busy loop for driver lock;
> (3) override driver lock with internal lock;
> (4) unlock driver lock;
> (5) can take driver lock now;
> (6) but unlock internal lock.
> 
> This change is safe because only the SCSI core and the NVME core keep
> a reference on a request queue after having called blk_cleanup_queue().
> Neither driver accesses any of the removed data structures between its
> blk_cleanup_queue() and blk_put_queue() calls.
> 
> Reported-by: Joseph Qi <joseph...@linux.alibaba.com>
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Jan Kara <j...@suse.com>

Looks good.
Reviewed-by: Joseph Qi <joseph...@linux.alibaba.com>


Re: [PATCH v4 5/6] block: Fix a race between the cgroup code and request queue initialization

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Initialize the request queue lock earlier such that the following
> race can no longer occur:
> 
> blk_init_queue_node() blkcg_print_blkgs()
>   blk_alloc_queue_node (1)
> q->queue_lock = >__queue_lock (2)
> blkcg_init_queue(q) (3)
> spin_lock_irq(blkg->q->queue_lock) (4)
>   q->queue_lock = lock (5)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) allocate an uninitialized queue;
> (2) initialize queue_lock to its default internal lock;
> (3) initialize blkcg part of request queue, which will create blkg and
> then insert it to blkg_list;
> (4) traverse blkg_list and find the created blkg, and then take its
> queue lock, here it is the default *internal lock*;
> (5) *race window*, now queue_lock is overridden with *driver specified
> lock*;
> (6) now unlock *driver specified lock*, not the locked *internal lock*,
> unlock balance breaks.
> 
> The changes in this patch are as follows:
> - Move the .queue_lock initialization from blk_init_queue_node() into
>   blk_alloc_queue_node().
> - Only override the .queue_lock pointer for legacy queues because it
>   is not useful for blk-mq queues to override this pointer.
> - For all all block drivers that initialize .queue_lock explicitly,
>   change the blk_alloc_queue() call in the driver into a
>   blk_alloc_queue_node() call and remove the explicit .queue_lock
>   initialization. Additionally, initialize the spin lock that will
>   be used as queue lock earlier if necessary.
> 
> Reported-by: Joseph Qi <joseph...@linux.alibaba.com>
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Joseph Qi <joseph...@linux.alibaba.com>
> Cc: Philipp Reisner <philipp.reis...@linbit.com>
> Cc: Ulf Hansson <ulf.hans...@linaro.org>
> Cc: Kees Cook <keesc...@chromium.org>

Looks good.
Reviewed-by: Joseph Qi <joseph...@linux.alibaba.com>


Re: [PATCH v4 3/6] zram: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Remove the disk, partition and bdi sysfs attributes before cleaning up
> the request queue associated with the disk.
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Nitin Gupta <ngu...@vflare.org>
> Cc: Sergey Senozhatsky <sergey.senozhatsky.w...@gmail.com>

Looks good.
Reviewed-by: Joseph Qi <joseph...@linux.alibaba.com>


Re: [PATCH v4 2/6] md: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Remove the disk, partition and bdi sysfs attributes before cleaning up
> the request queue associated with the disk.
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Shaohua Li <s...@kernel.org>

Looks good.
Reviewed-by: Joseph Qi <joseph...@linux.alibaba.com>


Re: [PATCH v4 1/6] block/loop: Delete gendisk before cleaning up the request queue

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> Remove the disk, partition and bdi sysfs attributes before cleaning up
> the request queue associated with the disk.
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Josef Bacik <jba...@fb.com>
> Cc: Shaohua Li <s...@fb.com>
> Cc: Omar Sandoval <osan...@fb.com>
> Cc: Hannes Reinecke <h...@suse.com>
> Cc: Ming Lei <ming@redhat.com>

Looks good.
Reviewed-by: Joseph Qi <joseph...@linux.alibaba.com>


Re: [PATCH v4 4/6] block: Add a third argument to blk_alloc_queue_node()

2018-02-23 Thread Joseph Qi


On 18/2/23 09:08, Bart Van Assche wrote:
> This patch does not change any functionality.
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Joseph Qi <joseph...@linux.alibaba.com>
> Cc: Philipp Reisner <philipp.reis...@linbit.com>
> Cc: Ulf Hansson <ulf.hans...@linaro.org>
> Cc: Kees Cook <keesc...@chromium.org>

Looks good.
Reviewed-by: Joseph Qi <joseph...@linux.alibaba.com>


Re: [PATCH] blk-throttle: avoid multiple counting for same bio

2018-02-22 Thread Joseph Qi


On 18/2/23 11:33, Chengguang Xu wrote:
> Hi Tejun,
> 
> Sorry for delayed reply, I was on vacation last week.
> 
> The problem still exists in current code of 4.16.0-rc2, 
> detail test information is below, if further info is needed please let me 
> know.
> 
> Thanks.
> 
That's true, the issue Shaohua has fixed is double charge, but double
stat issue still exists.

Jiufei has posted a fix, which has already been tested by Bo Liu:
[PATCH RESEND] blk-throttle: avoid double counted
https://www.mail-archive.com/linux-block@vger.kernel.org/msg18516.html

Thanks,
Joseph


Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-21 Thread Joseph Qi
Hi Tejun,

Sorry for the delayed reply.

On 18/2/13 01:11, Tejun Heo wrote:
> Hello, Joseph.
> 
> On Fri, Feb 09, 2018 at 10:15:19AM +0800, Joseph Qi wrote:
>> IIUC, we have to identify it is in blkcg_css_offline now which will
>> blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag
>> __PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if
>> __PERCPU_REF_DEAD is set now, we know blkcg css is in offline and
>> continue access blkg may risk double free. Thus we choose to skip these
>> ios.
>> I don't get how css_tryget works since it doesn't care the flag
>> __PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from
>> offlining since the race happens blkcg_css_offline is in progress.
>> Am I missing something here?
> 
> Once marked dead, the ref is in atomic mode and css_tryget() would hit
> the atomic counter.  Here, we don't care about the offlining and
> draining.  A draining memcg can still have a lot of memory to be
> written back attached to it and we don't want punt all of them to the
> root cgroup.

I still don't get how css_tryget can work here.

The race happens when:
1) writeback kworker has found the blkg with rcu;
2) blkcg is during offlining and blkg_destroy() has already been called.
Then, writeback kworker will take queue lock and access the blkg with
refcount 0.

So, I think we should take queue lock when lookup blkg if we want to
throttle the ios during offline (the way this patch tries), or use
css_tryget_online() to skip the further use of the risky blkg, which
means these ios won't be throttled either.

Thanks,
Joseph


Re: [PATCH v3 3/3] block: Fix a race between request queue removal and the block cgroup controller

2018-02-21 Thread Joseph Qi
Hi Bart,

Sorry for the delayed response since I was on holiday.

On 18/2/10 02:44, Bart Van Assche wrote:
> Avoid that the following race can occur:
> 
> blk_cleanup_queue()   blkcg_print_blkgs()
>   spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
> q->queue_lock = >__queue_lock (3)
>   spin_unlock_irq(lock) (4)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) take driver lock;
> (2) busy loop for driver lock;
> (3) override driver lock with internal lock;
> (4) unlock driver lock;
> (5) can take driver lock now;
> (6) but unlock internal lock.
> 
> This change is safe because only the SCSI core and the NVME core keep
> a reference on a request queue after having called blk_cleanup_queue().
> Neither driver accesses any of the removed data structures between its
> blk_cleanup_queue() and blk_put_queue() calls.
> 
> Reported-by: Joseph Qi <joseph...@linux.alibaba.com>
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Jan Kara <j...@suse.com>
> ---
>  block/blk-core.c  | 31 +++
>  block/blk-sysfs.c |  7 ---
>  2 files changed, 31 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 41c74b37be85..6febc69a58aa 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -719,6 +719,37 @@ void blk_cleanup_queue(struct request_queue *q)
>   del_timer_sync(>backing_dev_info->laptop_mode_wb_timer);
>   blk_sync_queue(q);
>  
> + /*
> +  * I/O scheduler exit is only safe after the sysfs scheduler attribute
> +  * has been removed.
> +  */
> + WARN_ON_ONCE(q->kobj.state_in_sysfs);
> +
I notice that several devices such as loop and zram will call
blk_cleanup_queue before del_gendisk, so it will hit this warning. Is
this normal?

Thanks,
Joseph

> + /*
> +  * Since the I/O scheduler exit code may access cgroup information,
> +  * perform I/O scheduler exit before disassociating from the block
> +  * cgroup controller.
> +  */
> + if (q->elevator) {
> + ioc_clear_queue(q);
> + elevator_exit(q, q->elevator);
> + q->elevator = NULL;
> + }
> +
> + /*
> +  * Remove all references to @q from the block cgroup controller before
> +  * restoring @q->queue_lock to avoid that restoring this pointer causes
> +  * e.g. blkcg_print_blkgs() to crash.
> +  */
> + blkcg_exit_queue(q);
> +
> + /*
> +  * Since the cgroup code may dereference the @q->backing_dev_info
> +  * pointer, only decrease its reference count after having removed the
> +  * association with the block cgroup controller.
> +  */
> + bdi_put(q->backing_dev_info);
> +
>   if (q->mq_ops)
>   blk_mq_free_queue(q);
>   percpu_ref_exit(>q_usage_counter);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index cbea895a5547..fd71a00c9462 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -798,13 +798,6 @@ static void __blk_release_queue(struct work_struct *work)
>   if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
>   blk_stat_remove_callback(q, q->poll_cb);
>   blk_stat_free_callback(q->poll_cb);
> - bdi_put(q->backing_dev_info);
> - blkcg_exit_queue(q);
> -
> - if (q->elevator) {
> - ioc_clear_queue(q);
> - elevator_exit(q, q->elevator);
> - }
>  
>   blk_free_queue_stats(q->stats);
>  
> 


Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-08 Thread Joseph Qi
Hi Tejun,

On 18/2/8 23:23, Tejun Heo wrote:
> Hello, Joseph.
> 
> On Thu, Feb 08, 2018 at 10:29:43AM +0800, Joseph Qi wrote:
>> So you mean checking css->refcnt to prevent the further use of
>> blkg_get? I think it makes sense.
> 
> Yes.
> 
>> IMO, we should use css_tryget_online instead, and rightly after taking
> 
> Not really.  An offline css still can have a vast amount of IO to
> drain and write out.
> 
IIUC, we have to identify it is in blkcg_css_offline now which will
blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag
__PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if
__PERCPU_REF_DEAD is set now, we know blkcg css is in offline and
continue access blkg may risk double free. Thus we choose to skip these
ios.
I don't get how css_tryget works since it doesn't care the flag
__PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from
offlining since the race happens blkcg_css_offline is in progress.
Am I missing something here?

Thanks,
Joseph

>> queue_lock. Because there may be more use of blkg_get in blk_throtl_bio
>> in the futher. Actually it already has two now. One is in
>> blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio.
>> What do you think of this?
> 
> As long as we avoid making the bypass paths expensive, whatever makes
> the code easier to read and maintain would be better.  css ref ops are
> extremely cheap anyway.
> 
> Thanks.
> 


Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-07 Thread Joseph Qi
Hi Tejun,
Thanks very much for reviewing this patch.

On 18/2/8 05:38, Tejun Heo wrote:
> Hello, Joseph.
> 
> On Wed, Feb 07, 2018 at 04:40:02PM +0800, Joseph Qi wrote:
>> writeback kworker
>>   blkcg_bio_issue_check
>> rcu_read_lock
>> blkg_lookup
>> <<< *race window*
>> blk_throtl_bio
>>   spin_lock_irq(q->queue_lock)
>>   spin_unlock_irq(q->queue_lock)
>> rcu_read_unlock
>>
>> cgroup_rmdir
>>   cgroup_destroy_locked
>> kill_css
>>   css_killed_ref_fn
>> css_killed_work_fn
>>   offline_css
>> blkcg_css_offline
>>   spin_trylock(q->queue_lock)
>>   blkg_destroy
>>   spin_unlock(q->queue_lock)
> 
> Ah, right.  Thanks for spotting the bug.
> 
>> Since rcu can only prevent blkg from releasing when it is being used,
>> the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule
>> blkg release.
>> Then trying to blkg_get in blk_throtl_bio will complains the WARNING.
>> And then the corresponding blkg_put will schedule blkg release again,
>> which result in double free.
>> This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
>> creation in blkcg_bio_issue_check()"). Before this commit, it will lookup
>> first and then try to lookup/create again with queue_lock. So revive
>> this logic to fix the race.
> 
> The change seems a bit drastic to me.  Can't we do something like the
> following instead?
> 
> blk_throtl_bio()
> {
>   ... non throttled cases ...
> 
>   /* out-of-limit, queue to @tg */
> 
>   /*
>* We can look up and retry but the race window is tiny here.
>* Just letting it through should be good enough.
>*/
>   if (!css_tryget(blkcg->css))
>   goto out;
> 
>   ... actual queueing ...
>   css_put(blkcg->css);
>   ...
> }
So you mean checking css->refcnt to prevent the further use of
blkg_get? I think it makes sense.
IMO, we should use css_tryget_online instead, and rightly after taking
queue_lock. Because there may be more use of blkg_get in blk_throtl_bio
in the futher. Actually it already has two now. One is in
blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio.
What do you think of this?

Thanks,
Joseph


[PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-07 Thread Joseph Qi
We've triggered a WARNING in blk_throtl_bio when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get, and
then kernel crashes with invalid page request.
After investigating this issue, we've found there is a race between
blkcg_bio_issue_check and cgroup_rmdir. The race is described below.

writeback kworker
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
<<< *race window*
blk_throtl_bio
  spin_lock_irq(q->queue_lock)
  spin_unlock_irq(q->queue_lock)
rcu_read_unlock

cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule
blkg release.
Then trying to blkg_get in blk_throtl_bio will complains the WARNING.
And then the corresponding blkg_put will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will lookup
first and then try to lookup/create again with queue_lock. So revive
this logic to fix the race.

v2: fix a potential NULL pointer dereference when stats

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
blkcg_bio_issue_check()")
Reported-by: Jiufei Xue <jiufei@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
Cc: sta...@vger.kernel.org
---
 block/blk-cgroup.c |  2 +-
 block/blk-throttle.c   | 30 ++
 block/cfq-iosched.c| 33 +++--
 include/linux/blk-cgroup.h | 27 +--
 4 files changed, 55 insertions(+), 37 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524..b1d22e5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -162,7 +162,6 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 
return NULL;
 }
-EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);
 
 /*
  * If @new_blkg is %NULL, this function tries to allocate a new one as
@@ -306,6 +305,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
return blkg;
}
 }
+EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c5a1316..decdd76 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2143,26 +2143,41 @@ static void blk_throtl_assoc_bio(struct throtl_grp *tg, 
struct bio *bio)
 #endif
 }
 
-bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
+bool blk_throtl_bio(struct request_queue *q, struct blkcg *blkcg,
struct bio *bio)
 {
+   struct throtl_data *td = q->td;
struct throtl_qnode *qn = NULL;
-   struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
+   struct throtl_grp *tg;
+   struct blkcg_gq *blkg;
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
-   struct throtl_data *td = tg->td;
 
WARN_ON_ONCE(!rcu_read_lock_held());
 
+   /*
+* If a group has no rules, just update the dispatch stats in lockless
+* manner and return.
+*/
+   blkg = blkg_lookup(blkcg, q);
+   tg = blkg_to_tg(blkg);
+   if (tg && !tg->has_rules[rw])
+   goto out;
+
/* see throtl_charge_bio() */
-   if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
+   if (bio_flagged(bio, BIO_THROTTLED))
goto out;
 
spin_lock_irq(q->queue_lock);
 
throtl_update_latency_buckets(td);
 
+   blkg = blkg_lookup_create(blkcg, q);
+   if (IS_ERR(blkg))
+   blkg = q->root_blkg;
+   tg = blkg_to_tg(blkg);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;
 
@@ -2253,6 +2268,13 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
if (throttled || !td->track_bio_latency)
bio->bi_issue_stat.stat |= SKIP_LATENCY;
 #endif
+   if (!throttled) {
+   blkg = blkg ?: q->root_blkg;
+   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
+   bio->bi_iter.bi_size);
+   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
+   }
+
return throttled;
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f342ef..60f53b5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1674,15 +1674,28 @@ static void cfq_pd_reset_stats(struct blkg_policy_data 
*pd)
cfqg_stats_reset(>stats);
 }
 
-static struct cfq_group *

[PATCH] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-05 Thread Joseph Qi
We've triggered a WARNING in blk_throtl_bio when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get, and
then kernel crashes with invalid page request.
After investigating this issue, we've found there is a race between
blkcg_bio_issue_check and cgroup_rmdir. The race is described below.

writeback kworker
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
<<< *race window*
blk_throtl_bio
  spin_lock_irq(q->queue_lock)
  spin_unlock_irq(q->queue_lock)
rcu_read_unlock

cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule
blkg release.
Then trying to blkg_get in blk_throtl_bio will complains the WARNING.
And then the corresponding blkg_put will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will lookup
first and then try to lookup/create again with queue_lock. So revive
this logic to fix the race.

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
blkcg_bio_issue_check()")
Reported-by: Jiufei Xue <jiufei@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
CC: sta...@vger.kernel.org
---
 block/blk-cgroup.c |  2 +-
 block/blk-throttle.c   | 29 +
 block/cfq-iosched.c| 33 +++--
 include/linux/blk-cgroup.h | 27 +--
 4 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524..b1d22e5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -162,7 +162,6 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 
return NULL;
 }
-EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);
 
 /*
  * If @new_blkg is %NULL, this function tries to allocate a new one as
@@ -306,6 +305,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
return blkg;
}
 }
+EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c5a1316..ec830be 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2143,26 +2143,41 @@ static void blk_throtl_assoc_bio(struct throtl_grp *tg, 
struct bio *bio)
 #endif
 }
 
-bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
+bool blk_throtl_bio(struct request_queue *q, struct blkcg *blkcg,
struct bio *bio)
 {
+   struct throtl_data *td = q->td;
struct throtl_qnode *qn = NULL;
-   struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
+   struct throtl_grp *tg;
+   struct blkcg_gq *blkg;
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
-   struct throtl_data *td = tg->td;
 
WARN_ON_ONCE(!rcu_read_lock_held());
 
+   /*
+* If a group has no rules, just update the dispatch stats in lockless
+* manner and return.
+*/
+   blkg = blkg_lookup(blkcg, q);
+   tg = blkg_to_tg(blkg);
+   if (tg && !tg->has_rules[rw])
+   goto out;
+
/* see throtl_charge_bio() */
-   if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
+   if (bio_flagged(bio, BIO_THROTTLED))
goto out;
 
spin_lock_irq(q->queue_lock);
 
throtl_update_latency_buckets(td);
 
+   blkg = blkg_lookup_create(blkcg, q);
+   if (IS_ERR(blkg))
+   blkg = q->root_blkg;
+   tg = blkg_to_tg(blkg);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;
 
@@ -2253,6 +2268,12 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
if (throttled || !td->track_bio_latency)
bio->bi_issue_stat.stat |= SKIP_LATENCY;
 #endif
+   if (!throttled) {
+   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
+   bio->bi_iter.bi_size);
+   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
+   }
+
return throttled;
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f342ef..60f53b5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1674,15 +1674,28 @@ static void cfq_pd_reset_stats(struct blkg_policy_data 
*pd)
cfqg_stats_reset(>stats);
 }
 
-static struct cfq_group *cfq_lookup_cfqg(struct cfq_data *cfqd,
-struct blkcg *blkcg)
+/

Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-02 Thread Joseph Qi
Hi Bart,

On 18/2/3 00:21, Bart Van Assche wrote:
> On Fri, 2018-02-02 at 09:02 +0800, Joseph Qi wrote:
>> We triggered this race when using single queue. I'm not sure if it
>> exists in multi-queue.
> 
> Regarding the races between modifying the queue_lock pointer and the code that
> uses that pointer, I think the following construct in blk_cleanup_queue() is
> sufficient to avoid races between the queue_lock pointer assignment and the 
> code
> that executes concurrently with blk_cleanup_queue():
> 
>   spin_lock_irq(lock);
>   if (q->queue_lock != >__queue_lock)
>   q->queue_lock = >__queue_lock;
>   spin_unlock_irq(lock);
> 
IMO, the race also exists.

blk_cleanup_queue   blkcg_print_blkgs
  spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
q->queue_lock = >__queue_lock (3)
  spin_unlock_irq(lock) (4)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) take driver lock;
(2) busy loop for driver lock;
(3) override driver lock with internal lock;
(4) unlock driver lock; 
(5) can take driver lock now;
(6) but unlock internal lock.

If we get blkg->q->queue_lock to local first like blk_cleanup_queue,
it indeed can fix the different lock use in lock/unlock. But since
blk_cleanup_queue has overridden queue lock to internal lock now, I'm
afraid we couldn't still use driver lock in blkcg_print_blkgs.

Thanks,
Joseph

> In other words, I think that this patch series should be sufficient to address
> all races between .queue_lock assignments and the code that uses that pointer.
> 
> Thanks,
> 
> Bart.
> 


Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-01 Thread Joseph Qi
Hi Bart,

On 18/2/2 00:16, Bart Van Assche wrote:
> On Thu, 2018-02-01 at 09:53 +0800, Joseph Qi wrote:
>> I'm afraid the risk may also exist in blk_cleanup_queue, which will
>> set queue_lock to to the default internal lock.
>>
>> spin_lock_irq(lock);
>> if (q->queue_lock != >__queue_lock)
>>  q->queue_lock = >__queue_lock;
>> spin_unlock_irq(lock);
>>
>> I'm thinking of getting blkg->q->queue_lock to local first, but this
>> will result in still using driver lock even the queue_lock has already
>> been set to the default internal lock.
> 
> Hello Joseph,
> 
> I think the race between the queue_lock assignment in blk_cleanup_queue()
> and the use of that pointer by cgroup attributes could be solved by
> removing the visibility of these attributes from blk_cleanup_queue() instead
> of __blk_release_queue(). However, last time I proposed to move code from
> __blk_release_queue() into blk_cleanup_queue() I received the feedback that
> from some kernel developers that they didn't like this.
> 
> Is the block driver that triggered the race on the q->queue_lock assignment
> using legacy (single queue) or multiqueue (blk-mq) mode? If that driver is
> using legacy mode, are you aware that there are plans to remove legacy mode
> from the upstream kernel? And if your driver is using multiqueue mode, how
> about the following change instead of the two patches in this patch series:
> 
We triggered this race when using single queue. I'm not sure if it
exists in multi-queue.
Do you mean upstream won't fix bugs any more in single queue?

Thanks,
Joseph

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1093,7 +1093,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
> *lock, int node_id)
>   return NULL;
>  
>   q->request_fn = rfn;
> - if (lock)
> + if (!q->mq_ops && lock)
>   q->queue_lock = lock;
>   if (blk_init_allocated_queue(q) < 0) {
>   blk_cleanup_queue(q);
> 
> Thanks,
> 
> Bart.
> 


Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-01-31 Thread Joseph Qi
Hi Bart,

On 18/2/1 07:53, Bart Van Assche wrote:
> Initialize the request queue lock earlier such that the following
> race can no longer occur:
> 
> blk_init_queue_node blkcg_print_blkgs
>   blk_alloc_queue_node (1)
> q->queue_lock = >__queue_lock (2)
> blkcg_init_queue(q) (3)
> spin_lock_irq(blkg->q->queue_lock) (4)
>   q->queue_lock = lock (5)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) allocate an uninitialized queue;
> (2) initialize queue_lock to its default internal lock;
> (3) initialize blkcg part of request queue, which will create blkg and
> then insert it to blkg_list;
> (4) traverse blkg_list and find the created blkg, and then take its
> queue lock, here it is the default *internal lock*;
> (5) *race window*, now queue_lock is overridden with *driver specified
> lock*;
> (6) now unlock *driver specified lock*, not the locked *internal lock*,
> unlock balance breaks.
> 
> The changes in this patch are as follows:
> - Move the .queue_lock initialization from blk_init_queue_node() into
>   blk_alloc_queue_node().
> - For all all block drivers that initialize .queue_lock explicitly,
>   change the blk_alloc_queue() call in the driver into a
>   blk_alloc_queue_node() call and remove the explicit .queue_lock
>   initialization. Additionally, initialize the spin lock that will
>   be used as queue lock earlier if necessary.
> 
I'm afraid the risk may also exist in blk_cleanup_queue, which will
set queue_lock to to the default internal lock.

spin_lock_irq(lock);
if (q->queue_lock != >__queue_lock)
q->queue_lock = >__queue_lock;
spin_unlock_irq(lock);

I'm thinking of getting blkg->q->queue_lock to local first, but this
will result in still using driver lock even the queue_lock has already
been set to the default internal lock.

Thanks,
Joseph


Re: [RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs

2018-01-30 Thread Joseph Qi
Hi Bart,
Thanks very much for the quick response.

On 18/1/31 05:19, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 19:21 +0800, Joseph Qi wrote:
>> Hi Jens and Folks,
>>
>> Recently we've gotten a hard LOCKUP issue. After investigating the issue
>> we've found a race between blk_init_queue_node and blkcg_print_blkgs.
>> The race is described below.
>>
>> blk_init_queue_node blkcg_print_blkgs
>>   blk_alloc_queue_node (1)
>> q->queue_lock = >__queue_lock (2)
>> blkcg_init_queue(q) (3)
>> spin_lock_irq(blkg->q->queue_lock) (4)
>>   q->queue_lock = lock (5)
>> spin_unlock_irq(blkg->q->queue_lock) (6)
>>
>> (1) allocate an uninitialized queue;
>> (2) initialize queue_lock to its default internal lock;
>> (3) initialize blkcg part of request queue, which will create blkg and
>> then insert it to blkg_list;
>> (4) traverse blkg_list and find the created blkg, and then take its
>> queue lock, here it is the default *internal lock*;
>> (5) *race window*, now queue_lock is overridden with *driver specified
>> lock*;
>> (6) now unlock *driver specified lock*, not the locked *internal lock*,
>> unlock balance breaks.
>>
>> For the issue above, I think blkcg_init_queue is a bit earlier. We
>> can't allow a further use before request queue is fully initialized.
>> Since blk_init_queue_node is a really common path and it allows driver
>> to override the default internal lock, I'm afraid several other places
>> may also have the same issue.
>> Am I missing something here?
> 
> What code triggered the blkcg_print_blkgs() call? Was it perhaps a function
> related to throttling? If so, I think there are two possible ways to fix this
> race:
Yes, it is from block throttle.

> 1. Moving the .queue_lock initialization from blk_init_queue_node() into
>blk_alloc_queue_node() such that it happens before the blkcg_init_queue()
>call. This however will require a tree-wide change of the blk_alloc_queue()
>and all blk_alloc_queue_node() calls in all block drivers.
> 2. Splitting the blk_throtl_init() function such that the 
> blkcg_activate_policy()
>call can occur after the .queue_lock pointer has been initialized, e.g. 
> from
>inside blk_init_allocated_queue() or from inside blk_register_queue() 
> instead
>of from inside blkcg_init_queue().
> 
At the first glance, I'm thinking of moving the blkcg_init_queue after
the queue_lock is overridden. But I don't have an idea yet to avoid the
risk during cleanup.
Since the initialization of request queue is so fundamental, I'm not
sure if there is the same risk in several other places.

Thanks,
Joseph

> I'm not sure what the best approach is. Maybe there is even a third approach
> that is better than the two approaches mentioned above.
> 
> Bart.
> 


[RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs

2018-01-30 Thread Joseph Qi
Hi Jens and Folks,

Recently we've gotten a hard LOCKUP issue. After investigating the issue
we've found a race between blk_init_queue_node and blkcg_print_blkgs.
The race is described below.

blk_init_queue_node blkcg_print_blkgs
  blk_alloc_queue_node (1)
q->queue_lock = >__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
  q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.

For the issue above, I think blkcg_init_queue is a bit earlier. We
can't allow a further use before request queue is fully initialized.
Since blk_init_queue_node is a really common path and it allows driver
to override the default internal lock, I'm afraid several other places
may also have the same issue.
Am I missing something here?

Thanks,
Joseph


[PATCH RESEND 3/3] blk-throttle: do downgrade/upgrade check when issuing io to lower layer

2018-01-21 Thread Joseph Qi
Currently downgrade/upgrade check is done when io firstly comes to block
throttle layer. In case of writeback, a large number of ios will firstly
be throttled in throttle queue and then dispatched when timer is kicked,
which won't be checked because REQ_THROTTLED is set. This will lead to
low limit not guaranteed most of time.
Fix this case by moving check logic down, in which we are ready to issue
io to lower layer.

Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
Reviewed-by: Jiufei Xue <jiufei@linux.alibaba.com>
---
 block/blk-throttle.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9c0b5ff..6207554 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1065,8 +1065,6 @@ static void throtl_charge_bio(struct throtl_grp *tg, 
struct bio *bio)
/* Charge the bio to the group */
tg->bytes_disp[rw] += bio_size;
tg->io_disp[rw]++;
-   tg->last_bytes_disp[rw] += bio_size;
-   tg->last_io_disp[rw]++;
 
/*
 * BIO_THROTTLED is used to prevent the same bio to be throttled
@@ -2166,7 +2164,8 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
struct bio *bio)
 {
struct throtl_qnode *qn = NULL;
-   struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
+   struct throtl_grp *orig_tg = blkg_to_tg(blkg ?: q->root_blkg);
+   struct throtl_grp *tg = orig_tg;
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
@@ -2174,11 +2173,11 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
 
WARN_ON_ONCE(!rcu_read_lock_held());
 
+   spin_lock_irq(q->queue_lock);
+
/* see throtl_charge_bio() */
if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
-   goto out;
-
-   spin_lock_irq(q->queue_lock);
+   goto out_unlock;
 
throtl_update_latency_buckets(td);
 
@@ -2194,15 +2193,12 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
while (true) {
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
-   throtl_downgrade_check(tg);
-   throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
 
/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
-   tg->last_low_overflow_time[rw] = jiffies;
if (throtl_can_upgrade(td, tg)) {
throtl_upgrade_state(td);
goto again;
@@ -2246,8 +2242,6 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
   tg->io_disp[rw], tg_iops_limit(tg, rw),
   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
-   tg->last_low_overflow_time[rw] = jiffies;
-
td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
throttled = true;
@@ -2264,8 +2258,13 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
}
 
 out_unlock:
+   throtl_downgrade_check(orig_tg);
+   throtl_upgrade_check(orig_tg);
+   if (!throttled) {
+   orig_tg->last_bytes_disp[rw] += throtl_bio_data_size(bio);
+   orig_tg->last_io_disp[rw]++;
+   }
spin_unlock_irq(q->queue_lock);
-out:
bio_set_flag(bio, BIO_THROTTLED);
 
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
-- 
1.9.4


[PATCH RESEND 1/3] blk-throttle: fix NULL pointer dereference risk in throtl_select_dispatch

2018-01-21 Thread Joseph Qi
tg in throtl_select_dispatch is used first and then do check. Since tg
may be NULL, it has potential NULL pointer dereference risk. So fix it.

Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
Reviewed-by: Jiufei Xue <jiufei@linux.alibaba.com>
---
 block/blk-throttle.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9052f4e..a0bae2c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1228,7 +1228,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
while (1) {
struct throtl_grp *tg = throtl_rb_first(parent_sq);
-   struct throtl_service_queue *sq = >service_queue;
+   struct throtl_service_queue *sq;
 
if (!tg)
break;
@@ -1240,6 +1240,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
nr_disp += throtl_dispatch_tg(tg);
 
+   sq = >service_queue;
if (sq->nr_queued[0] || sq->nr_queued[1])
tg_update_disptime(tg);
 
-- 
1.9.4



[PATCH RESEND 2/3] blk-throttle: support only configure leaf nodes

2018-01-21 Thread Joseph Qi
Currently if we only configure leaf nodes, downgrade check won't
function properly as parent's last low overflow time is meaningless and
always returns now, thus it will be treated always run above low limit
and doesn't have to do downgrade.
In this case, since parent doesn't have low limit, ignore checking it
and just depend on child's decision.
Note that we do not support only configure intermediate nodes.

Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
Reviewed-by: Jiufei Xue <jiufei@linux.alibaba.com>
---
 block/blk-throttle.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a0bae2c..9c0b5ff 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1959,6 +1959,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp 
*tg)
unsigned long now = jiffies;
 
/*
+* If parent doesn't have low limit, only consider child's
+* decision.
+*/
+   if (!list_empty(_to_blkg(tg)->blkcg->css.children) &&
+   !tg->bps[READ][LIMIT_LOW] && !tg->bps[WRITE][LIMIT_LOW] &&
+   !tg->iops[READ][LIMIT_LOW] && !tg->iops[WRITE][LIMIT_LOW])
+   return true;
+
+   /*
 * If cgroup is below low limit, consider downgrade and throttle other
 * cgroups
 */
-- 
1.9.4



Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device

2018-01-18 Thread Joseph Qi
Hi Mike,

On 18/1/19 11:29, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 10:09pm -0500,
> Joseph Qi <joseph...@linux.alibaba.com> wrote:
> 
>> From: Joseph Qi <joseph...@linux.alibaba.com>
>>
>> DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is
>> to mean, the previous initialization in blk_throtl_register_queue is
>> wrong in this case.
>> Fix it by checking and then updating the info during root tg
>> initialization as we don't have a better choice.
>>
>> Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
>> Reviewed-by: Shaohua Li <s...@kernel.org>
>> ---
>>  block/blk-throttle.c | 19 +++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index bf52035..7150f14 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
>>  if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
>>  sq->parent_sq = _to_tg(blkg->parent)->service_queue;
>>  tg->td = td;
>> +
>> +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>> +/*
>> + * DM device sets QUEUE_FLAG_NONROT after the queue is registered,
>> + * so the previous initialization is wrong in this case. Check and
>> + * update it here.
>> + */
>> +if (blk_queue_nonrot(blkg->q) &&
>> +td->filtered_latency != LATENCY_FILTERED_SSD) {
>> +int i;
>> +
>> +td->throtl_slice = DFL_THROTL_SLICE_SSD;
>> +td->filtered_latency = LATENCY_FILTERED_SSD;
>> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> +td->avg_buckets[READ][i].latency = 0;
>> +td->avg_buckets[WRITE][i].latency = 0;
>> +}
>> +}
>> +#endif
>>  }
>>  
>>  /*
>> -- 
>> 1.9.4
> 
> This should be fixed for 4.16, please see these block tree commits:
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b
> 
> The last commit's patch header even references the previous submission
> you had for this patch with:
> 
> "These changes also stave off the need to introduce new DM-specific
> workarounds in block core, e.g. this proposal:
> https://patchwork.kernel.org/patch/10067961/;
> 
Yes, if we call dm_table_set_restrictions before blk_register_queue now,
we can make sure the initialization is correct by checking whether flag
QUEUE_FLAG_NONROT is set or not. So my patch is no longer needed.
Jens, please ignore this patch, thanks.

Thanks,
Joseph


[PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device

2018-01-18 Thread Joseph Qi
From: Joseph Qi <joseph...@linux.alibaba.com>

DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is
to mean, the previous initialization in blk_throtl_register_queue is
wrong in this case.
Fix it by checking and then updating the info during root tg
initialization as we don't have a better choice.

Signed-off-by: Joseph Qi <joseph...@linux.alibaba.com>
Reviewed-by: Shaohua Li <s...@kernel.org>
---
 block/blk-throttle.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index bf52035..7150f14 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
sq->parent_sq = _to_tg(blkg->parent)->service_queue;
tg->td = td;
+
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+   /*
+* DM device sets QUEUE_FLAG_NONROT after the queue is registered,
+* so the previous initialization is wrong in this case. Check and
+* update it here.
+*/
+   if (blk_queue_nonrot(blkg->q) &&
+   td->filtered_latency != LATENCY_FILTERED_SSD) {
+   int i;
+
+   td->throtl_slice = DFL_THROTL_SLICE_SSD;
+   td->filtered_latency = LATENCY_FILTERED_SSD;
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   td->avg_buckets[READ][i].latency = 0;
+   td->avg_buckets[WRITE][i].latency = 0;
+   }
+   }
+#endif
 }
 
 /*
-- 
1.9.4


Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2018-01-18 Thread Joseph Qi
Sure, I will resend it.
Thanks.

Joseph

On 18/1/19 10:52, Jens Axboe wrote:
> On 1/18/18 7:11 PM, Joseph Qi wrote:
>> Hi Jens,
>> Could you please pick the two pending patches for 4.16?
>> They all have been reviewed by Shaohua.
> 
> Sorry, I guess I forgot about this series. I picked up 1/2,
> for some reason I don't have 2/2 and I can't find it on
> linux-block either. The thread only shows the first patch.
> 
> Can you resend 2/2?
> 


Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2018-01-18 Thread Joseph Qi
Hi Jens,
Could you please pick the two pending patches for 4.16?
They all have been reviewed by Shaohua.

Thanks,
Joseph

On 18/1/8 20:05, Joseph Qi wrote:
> A polite ping for the two pending patches...
> 
> Thanks,
> Joseph
> 
> On 17/11/24 13:13, Jens Axboe wrote:
>> On 11/23/2017 06:31 PM, Joseph Qi wrote:
>>> Hi Jens,
>>> Could you please give your advice for the two patches or pick them up if
>>> you think they are good?
>>
>> It looks OK to me, but my preference would be to push this until
>> 4.16.
>>


Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2018-01-08 Thread Joseph Qi
A polite ping for the two pending patches...

Thanks,
Joseph

On 17/11/24 13:13, Jens Axboe wrote:
> On 11/23/2017 06:31 PM, Joseph Qi wrote:
>> Hi Jens,
>> Could you please give your advice for the two patches or pick them up if
>> you think they are good?
> 
> It looks OK to me, but my preference would be to push this until
> 4.16.
> 


[PATCH 2/3] blk-throttle: support only configure leaf nodes

2018-01-04 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

Currently if we only configure leaf nodes, downgrade check won't
function properly as parent's last low overflow time is meaningless and
always returns now, thus it will be treated always run above low limit
and doesn't have to do downgrade.
In this case, since parent doesn't have low limit, ignore checking it
and just depend on child's decision.
Note that we do not support only configure intermediate nodes.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a0bae2c..9c0b5ff 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1959,6 +1959,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp 
*tg)
unsigned long now = jiffies;
 
/*
+* If parent doesn't have low limit, only consider child's
+* decision.
+*/
+   if (!list_empty(_to_blkg(tg)->blkcg->css.children) &&
+   !tg->bps[READ][LIMIT_LOW] && !tg->bps[WRITE][LIMIT_LOW] &&
+   !tg->iops[READ][LIMIT_LOW] && !tg->iops[WRITE][LIMIT_LOW])
+   return true;
+
+   /*
 * If cgroup is below low limit, consider downgrade and throttle other
 * cgroups
 */
-- 
1.9.4


[PATCH 1/3] blk-throttle: fix NULL pointer dereference risk in throtl_select_dispatch

2018-01-04 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

tg in throtl_select_dispatch is used first and then do check. Since tg
may be NULL, it has potential NULL pointer dereference risk. So fix it.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9052f4e..a0bae2c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1228,7 +1228,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
while (1) {
struct throtl_grp *tg = throtl_rb_first(parent_sq);
-   struct throtl_service_queue *sq = >service_queue;
+   struct throtl_service_queue *sq;
 
if (!tg)
break;
@@ -1240,6 +1240,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
nr_disp += throtl_dispatch_tg(tg);
 
+   sq = >service_queue;
if (sq->nr_queued[0] || sq->nr_queued[1])
tg_update_disptime(tg);
 
-- 
1.9.4


Re: [RFC] distinguish foreground and background IOs in block throttle

2017-12-27 Thread Joseph Qi
Hi Paolo,

On 17/12/27 20:36, Paolo Valente wrote:
> 
> 
>> Il giorno 25 dic 2017, alle ore 03:44, xuejiufei  ha 
>> scritto:
>>
>> Hi all,
>>
>> Cgroup writeback is supported since v4.2. I found there exists a
>> problem in the following case.
>>
>> A cgroup may send both buffer and direct/sync IOs. The foreground
>> thread will be stalled when periodic writeback IOs is flushed because
>> the service queue already has a plenty of writeback IOs, then
>> foreground IOs should be enqueued with its FIFO policy.
>>
>> I wonder if we can distinguish foreground and background IOs in block
>> throttle to fix the above problem.
>>
>> Any suggestion are always appreciated.
>>
> 
> Hi,
> to address similar issues, I have just sent a patch [1] for the BFQ
> I/O scheduler.  If you want to give it a try, it might solve, or at
> least mitigate, your problem (the patch does not involve groups,
> though, at least for the moment).
> 
> There are still pending patches related to the low_latency mode of
> BFQ, so I suggest you to try with low_latency disabled, i.e., as root:
> echo bfq > /sys/block//queue/scheduler
> echo 0 > /sys/block//queue/iosched/low_latency 
> 
> For your possible convenience, I have attached the patch, gzipped, to
> this email too.
> > Thanks,
> Paolo
> 
> [1] https://www.spinics.net/lists/kernel/msg2684463.html
> 
I don't get why the issue Jiufei described has relations with scheduler.

IMO, the core reason is current we only have read/write queues in block
throttle. That means all sync/async writes will go into the same queue.
Once writeback IOs flush out and throttle happens, sync write will also
have to be queued up. Since there are more async writes ahead of sync
writes in the queue, and the current policy is dispatching 6 reads and 2
writes during each round, sync writes will get significantly delay,
which we don't expect.

So like read/write queue design, we may add another queue in block
throttle to distinguish sync/async writes, and then we can dispatch more
sync writes than async writes.

Thanks,
Joseph


Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2017-11-23 Thread Joseph Qi
Hi Jens,
Could you please give your advice for the two patches or pick them up if
you think they are good?

Thanks,
Joseph

On 17/11/21 09:38, Joseph Qi wrote:
> From: Joseph Qi <qijiang...@alibaba-inc.com>
> 
> In mixed read/write workload on SSD, write latency is much lower than
> read. But now we only track and record read latency and then use it as
> threshold base for both read and write io latency accounting. As a
> result, write io latency will always be considered as good and
> bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
> tg to be checked will be treated as idle most of the time and still let
> others dispatch more ios, even it is truly running under low limit and
> wants its low limit to be guaranteed, which is not we expected in fact.
> So track read and write request individually, which can bring more
> precise latency control for low limit idle detection.
> 
> Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
> Reviewed-by: Shaohua Li <s...@fb.com>
> ---
>  block/blk-throttle.c | 134 
> ++-
>  1 file changed, 79 insertions(+), 55 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 96ad326..bf52035 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -216,9 +216,9 @@ struct throtl_data
>  
>   unsigned int scale;
>  
> - struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
> - struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
> - struct latency_bucket __percpu *latency_buckets;
> + struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
> + struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
> + struct latency_bucket __percpu *latency_buckets[2];
>   unsigned long last_calculate_time;
>   unsigned long filtered_latency;
>  
> @@ -2041,10 +2041,10 @@ static void blk_throtl_update_idletime(struct 
> throtl_grp *tg)
>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>  static void throtl_update_latency_buckets(struct throtl_data *td)
>  {
> - struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
> - int i, cpu;
> - unsigned long last_latency = 0;
> - unsigned long latency;
> + struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
> + int i, cpu, rw;
> + unsigned long last_latency[2] = { 0 };
> + unsigned long latency[2];
>  
>   if (!blk_queue_nonrot(td->queue))
>   return;
> @@ -2053,56 +2053,67 @@ static void throtl_update_latency_buckets(struct 
> throtl_data *td)
>   td->last_calculate_time = jiffies;
>  
>   memset(avg_latency, 0, sizeof(avg_latency));
> - for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> - struct latency_bucket *tmp = >tmp_buckets[i];
> -
> - for_each_possible_cpu(cpu) {
> - struct latency_bucket *bucket;
> -
> - /* this isn't race free, but ok in practice */
> - bucket = per_cpu_ptr(td->latency_buckets, cpu);
> - tmp->total_latency += bucket[i].total_latency;
> - tmp->samples += bucket[i].samples;
> - bucket[i].total_latency = 0;
> - bucket[i].samples = 0;
> - }
> + for (rw = READ; rw <= WRITE; rw++) {
> + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> + struct latency_bucket *tmp = >tmp_buckets[rw][i];
> +
> + for_each_possible_cpu(cpu) {
> + struct latency_bucket *bucket;
> +
> + /* this isn't race free, but ok in practice */
> + bucket = per_cpu_ptr(td->latency_buckets[rw],
> + cpu);
> + tmp->total_latency += bucket[i].total_latency;
> + tmp->samples += bucket[i].samples;
> + bucket[i].total_latency = 0;
> + bucket[i].samples = 0;
> + }
>  
> - if (tmp->samples >= 32) {
> - int samples = tmp->samples;
> + if (tmp->samples >= 32) {
> + int samples = tmp->samples;
>  
> - latency = tmp->total_latency;
> + latency[rw] = tmp->total_latency;
>  
> - tmp->total_latency = 0;
> - tmp->samples = 0;
> - latency /= samples;
> - if (latency == 0)
> - 

Re: [PATCH v2 2/2] blk-throttle: fix wrong initialization in case of dm device

2017-11-20 Thread Joseph Qi


On 17/11/21 01:15, Shaohua Li wrote:
> On Mon, Nov 20, 2017 at 10:00:27AM +0800, Joseph Qi wrote:
>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>
>> dm device set QUEUE_FLAG_NONROT in resume, which is after register
>> queue. That is to mean, the previous initialization in
>> blk_throtl_register_queue is wrong in this case.
>> Fix it by checking and then updating the info during root tg
>> initialization as we don't have a better choice.
>>
>> Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
>> ---
>>  block/blk-throttle.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index bf52035..6d6b220 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -541,6 +541,23 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
>>  if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
>>  sq->parent_sq = _to_tg(blkg->parent)->service_queue;
>>  tg->td = td;
>> +
>> +/*
>> + * dm device set QUEUE_FLAG_NONROT in resume, which is after resister
>> + * queue, so the previous initialization is wrong in this case. Check
>> + * and update it here.
>> + */
>> +if (blk_queue_nonrot(blkg->q) &&
>> +td->filtered_latency != LATENCY_FILTERED_SSD) {
>> +int i;
>> +
>> +td->throtl_slice = DFL_THROTL_SLICE_SSD;
> 
> if CONFIG_BLK_DEV_THROTTLING_LOW isn't not set, we use old slice, can you do
> the same thing here? Otherwise,
> Reviewed-by: Shaohua Li <s...@kernel.org>
> 
Sure, I will update and resend it. Thanks for pointing this out.

Thanks,
Joseph

>> +td->filtered_latency = LATENCY_FILTERED_SSD;
>> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> +td->avg_buckets[READ][i].latency = 0;
>> +td->avg_buckets[WRITE][i].latency = 0;
>> +}
>> +}
>>  }
>>  
>>  /*
>> -- 
>> 1.9.4


Re: [PATCH v2 2/2] blk-throttle: fix wrong initialization in case of dm device

2017-11-20 Thread Joseph Qi
Hi Mike,

On 17/11/21 05:02, Mike Snitzer wrote:
> On Sun, Nov 19, 2017 at 9:00 PM, Joseph Qi <jiangqi...@gmail.com> wrote:
>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>
>> dm device set QUEUE_FLAG_NONROT in resume, which is after register
>> queue. That is to mean, the previous initialization in
>> blk_throtl_register_queue is wrong in this case.
>> Fix it by checking and then updating the info during root tg
>> initialization as we don't have a better choice.
> 
> Given DM motivated this change, curious why you didn't send this to dm-devel?
> 
Since this is io.low specifically, I don't want the change in device
mapper layer. But I do agree with you that I have to cc dm-devel as
well. Sorry for missing that.

> In any case, not sure why you need to reference "resume".  Very few
> people will appreciate that detail.
> 
> Better to just say: DM device sets QUEUE_FLAG_NONROT after the queue
> is registered.
> 
Thanks for your advice. I will update and resend it.

Thanks,
Joseph

>> Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
>> ---
>>  block/blk-throttle.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index bf52035..6d6b220 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -541,6 +541,23 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
>> if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
>> sq->parent_sq = _to_tg(blkg->parent)->service_queue;
>> tg->td = td;
>> +
>> +   /*
>> +* dm device set QUEUE_FLAG_NONROT in resume, which is after resister
>> +* queue, so the previous initialization is wrong in this case. Check
>> +* and update it here.
>> +*/
> 
> typo: s/resister/register/
> But again, best to say:
> DM device sets QUEUE_FLAG_NONROT after the queue is registered, ...
> 
> Also, an alternative to your patch would be to have DM's
> dm_table_set_restrictions() call a blk-throttle function to setup
> blk-throttle after it is done setting queue flags?
> Saves all other drivers from having to run this 2 stage initialization
> code.. just a thought.
> 
> Mike
> 


[PATCH v2 2/2] blk-throttle: fix wrong initialization in case of dm device

2017-11-19 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

dm device set QUEUE_FLAG_NONROT in resume, which is after register
queue. That is to mean, the previous initialization in
blk_throtl_register_queue is wrong in this case.
Fix it by checking and then updating the info during root tg
initialization as we don't have a better choice.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index bf52035..6d6b220 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -541,6 +541,23 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
sq->parent_sq = _to_tg(blkg->parent)->service_queue;
tg->td = td;
+
+   /*
+* dm device set QUEUE_FLAG_NONROT in resume, which is after resister
+* queue, so the previous initialization is wrong in this case. Check
+* and update it here.
+*/
+   if (blk_queue_nonrot(blkg->q) &&
+   td->filtered_latency != LATENCY_FILTERED_SSD) {
+   int i;
+
+   td->throtl_slice = DFL_THROTL_SLICE_SSD;
+   td->filtered_latency = LATENCY_FILTERED_SSD;
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   td->avg_buckets[READ][i].latency = 0;
+   td->avg_buckets[WRITE][i].latency = 0;
+   }
+   }
 }
 
 /*
-- 
1.9.4


[PATCH v2 RESEND 1/2] blk-throttle: track read and write request individually

2017-11-19 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

In mixed read/write workload on SSD, write latency is much lower than
read. But now we only track and record read latency and then use it as
threshold base for both read and write io latency accounting. As a
result, write io latency will always be considered as good and
bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
tg to be checked will be treated as idle most of the time and still let
others dispatch more ios, even it is truly running under low limit and
wants its low limit to be guaranteed, which is not we expected in fact.
So track read and write request individually, which can bring more
precise latency control for low limit idle detection.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
Reviewed-by: Shaohua Li <s...@fb.com>
---
 block/blk-throttle.c | 134 ++-
 1 file changed, 79 insertions(+), 55 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 96ad326..bf52035 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -216,9 +216,9 @@ struct throtl_data
 
unsigned int scale;
 
-   struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
-   struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
-   struct latency_bucket __percpu *latency_buckets;
+   struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
+   struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
+   struct latency_bucket __percpu *latency_buckets[2];
unsigned long last_calculate_time;
unsigned long filtered_latency;
 
@@ -2041,10 +2041,10 @@ static void blk_throtl_update_idletime(struct 
throtl_grp *tg)
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 static void throtl_update_latency_buckets(struct throtl_data *td)
 {
-   struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
-   int i, cpu;
-   unsigned long last_latency = 0;
-   unsigned long latency;
+   struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
+   int i, cpu, rw;
+   unsigned long last_latency[2] = { 0 };
+   unsigned long latency[2];
 
if (!blk_queue_nonrot(td->queue))
return;
@@ -2053,56 +2053,67 @@ static void throtl_update_latency_buckets(struct 
throtl_data *td)
td->last_calculate_time = jiffies;
 
memset(avg_latency, 0, sizeof(avg_latency));
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   struct latency_bucket *tmp = >tmp_buckets[i];
-
-   for_each_possible_cpu(cpu) {
-   struct latency_bucket *bucket;
-
-   /* this isn't race free, but ok in practice */
-   bucket = per_cpu_ptr(td->latency_buckets, cpu);
-   tmp->total_latency += bucket[i].total_latency;
-   tmp->samples += bucket[i].samples;
-   bucket[i].total_latency = 0;
-   bucket[i].samples = 0;
-   }
+   for (rw = READ; rw <= WRITE; rw++) {
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   struct latency_bucket *tmp = >tmp_buckets[rw][i];
+
+   for_each_possible_cpu(cpu) {
+   struct latency_bucket *bucket;
+
+   /* this isn't race free, but ok in practice */
+   bucket = per_cpu_ptr(td->latency_buckets[rw],
+   cpu);
+   tmp->total_latency += bucket[i].total_latency;
+   tmp->samples += bucket[i].samples;
+   bucket[i].total_latency = 0;
+   bucket[i].samples = 0;
+   }
 
-   if (tmp->samples >= 32) {
-   int samples = tmp->samples;
+   if (tmp->samples >= 32) {
+   int samples = tmp->samples;
 
-   latency = tmp->total_latency;
+   latency[rw] = tmp->total_latency;
 
-   tmp->total_latency = 0;
-   tmp->samples = 0;
-   latency /= samples;
-   if (latency == 0)
-   continue;
-   avg_latency[i].latency = latency;
+   tmp->total_latency = 0;
+   tmp->samples = 0;
+   latency[rw] /= samples;
+   if (latency[rw] == 0)
+   continue;
+   avg_latency[rw][i].latency = latency[rw];
+   }
}
}
 
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   if (

[PATCH] blk-throttle: fix wrong initialization in case of dm device

2017-11-05 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

dm device set QUEUE_FLAG_NONROT in resume, which is after register
queue. That is to mean, the previous initialization in
blk_throtl_register_queue is wrong in this case.
Fix it by checking and then updating the info during root tg
initialization as we don't have a better choice.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 17816a0..6aeb79e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -539,6 +539,21 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
sq->parent_sq = >service_queue;
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
sq->parent_sq = _to_tg(blkg->parent)->service_queue;
+
+   /*
+* dm device set QUEUE_FLAG_NONROT in resume, which is after resister
+* queue, so the previous initialization is wrong in this case. Check
+* and update it here.
+*/
+   if (blk_queue_nonrot(blkg->q) &&
+   td->filtered_latency != LATENCY_FILTERED_SSD) {
+   int i;
+
+   td->throtl_slice = DFL_THROTL_SLICE_SSD;
+   td->filtered_latency = LATENCY_FILTERED_SSD;
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++)
+   td->avg_buckets[i].latency = 0;
+   }
tg->td = td;
 }
 
-- 
1.9.4


Re: [PATCH v2] blk-throttle: track read and write request individually

2017-10-22 Thread Joseph Qi
Hi Jens,
Could you please pick this patch?

Thanks,
Joseph

On 17/10/13 01:13, Shaohua Li wrote:
> On Thu, Oct 12, 2017 at 05:15:46PM +0800, Joseph Qi wrote:
>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>
>> In mixed read/write workload on SSD, write latency is much lower than
>> read. But now we only track and record read latency and then use it as
>> threshold base for both read and write io latency accounting. As a
>> result, write io latency will always be considered as good and
>> bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
>> tg to be checked will be treated as idle most of the time and still let
>> others dispatch more ios, even it is truly running under low limit and
>> wants its low limit to be guaranteed, which is not we expected in fact.
>> So track read and write request individually, which can bring more
>> precise latency control for low limit idle detection.
>>
>> Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
> 
> Reviewed-by: Shaohua Li <s...@fb.com>
>> ---
>>  block/blk-throttle.c | 134 
>> ++-
>>  1 file changed, 79 insertions(+), 55 deletions(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 17816a0..0897378 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -215,9 +215,9 @@ struct throtl_data
>>  
>>  unsigned int scale;
>>  
>> -struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
>> -struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
>> -struct latency_bucket __percpu *latency_buckets;
>> +struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
>> +struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
>> +struct latency_bucket __percpu *latency_buckets[2];
>>  unsigned long last_calculate_time;
>>  unsigned long filtered_latency;
>>  
>> @@ -2040,10 +2040,10 @@ static void blk_throtl_update_idletime(struct 
>> throtl_grp *tg)
>>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>>  static void throtl_update_latency_buckets(struct throtl_data *td)
>>  {
>> -struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
>> -int i, cpu;
>> -unsigned long last_latency = 0;
>> -unsigned long latency;
>> +struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
>> +int i, cpu, rw;
>> +unsigned long last_latency[2] = { 0 };
>> +unsigned long latency[2];
>>  
>>  if (!blk_queue_nonrot(td->queue))
>>  return;
>> @@ -2052,56 +2052,67 @@ static void throtl_update_latency_buckets(struct 
>> throtl_data *td)
>>  td->last_calculate_time = jiffies;
>>  
>>  memset(avg_latency, 0, sizeof(avg_latency));
>> -for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> -struct latency_bucket *tmp = >tmp_buckets[i];
>> -
>> -for_each_possible_cpu(cpu) {
>> -struct latency_bucket *bucket;
>> -
>> -/* this isn't race free, but ok in practice */
>> -bucket = per_cpu_ptr(td->latency_buckets, cpu);
>> -tmp->total_latency += bucket[i].total_latency;
>> -tmp->samples += bucket[i].samples;
>> -bucket[i].total_latency = 0;
>> -bucket[i].samples = 0;
>> -}
>> +for (rw = READ; rw <= WRITE; rw++) {
>> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> +struct latency_bucket *tmp = >tmp_buckets[rw][i];
>> +
>> +for_each_possible_cpu(cpu) {
>> +struct latency_bucket *bucket;
>> +
>> +/* this isn't race free, but ok in practice */
>> +bucket = per_cpu_ptr(td->latency_buckets[rw],
>> +cpu);
>> +tmp->total_latency += bucket[i].total_latency;
>> +tmp->samples += bucket[i].samples;
>> +bucket[i].total_latency = 0;
>> +bucket[i].samples = 0;
>> +}
>>  
>> -if (tmp->samples >= 32) {
>> -int samples = tmp->samples;
>> +if (tmp->samples >= 32) {
>> +int samples = tmp->samples;
>>  
>> -laten

[PATCH v2] blk-throttle: track read and write request individually

2017-10-12 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

In mixed read/write workload on SSD, write latency is much lower than
read. But now we only track and record read latency and then use it as
threshold base for both read and write io latency accounting. As a
result, write io latency will always be considered as good and
bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
tg to be checked will be treated as idle most of the time and still let
others dispatch more ios, even it is truly running under low limit and
wants its low limit to be guaranteed, which is not we expected in fact.
So track read and write request individually, which can bring more
precise latency control for low limit idle detection.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 134 ++-
 1 file changed, 79 insertions(+), 55 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 17816a0..0897378 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -215,9 +215,9 @@ struct throtl_data
 
unsigned int scale;
 
-   struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
-   struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
-   struct latency_bucket __percpu *latency_buckets;
+   struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
+   struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
+   struct latency_bucket __percpu *latency_buckets[2];
unsigned long last_calculate_time;
unsigned long filtered_latency;
 
@@ -2040,10 +2040,10 @@ static void blk_throtl_update_idletime(struct 
throtl_grp *tg)
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 static void throtl_update_latency_buckets(struct throtl_data *td)
 {
-   struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
-   int i, cpu;
-   unsigned long last_latency = 0;
-   unsigned long latency;
+   struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
+   int i, cpu, rw;
+   unsigned long last_latency[2] = { 0 };
+   unsigned long latency[2];
 
if (!blk_queue_nonrot(td->queue))
return;
@@ -2052,56 +2052,67 @@ static void throtl_update_latency_buckets(struct 
throtl_data *td)
td->last_calculate_time = jiffies;
 
memset(avg_latency, 0, sizeof(avg_latency));
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   struct latency_bucket *tmp = >tmp_buckets[i];
-
-   for_each_possible_cpu(cpu) {
-   struct latency_bucket *bucket;
-
-   /* this isn't race free, but ok in practice */
-   bucket = per_cpu_ptr(td->latency_buckets, cpu);
-   tmp->total_latency += bucket[i].total_latency;
-   tmp->samples += bucket[i].samples;
-   bucket[i].total_latency = 0;
-   bucket[i].samples = 0;
-   }
+   for (rw = READ; rw <= WRITE; rw++) {
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   struct latency_bucket *tmp = >tmp_buckets[rw][i];
+
+   for_each_possible_cpu(cpu) {
+   struct latency_bucket *bucket;
+
+   /* this isn't race free, but ok in practice */
+   bucket = per_cpu_ptr(td->latency_buckets[rw],
+   cpu);
+   tmp->total_latency += bucket[i].total_latency;
+   tmp->samples += bucket[i].samples;
+   bucket[i].total_latency = 0;
+   bucket[i].samples = 0;
+   }
 
-   if (tmp->samples >= 32) {
-   int samples = tmp->samples;
+   if (tmp->samples >= 32) {
+   int samples = tmp->samples;
 
-   latency = tmp->total_latency;
+   latency[rw] = tmp->total_latency;
 
-   tmp->total_latency = 0;
-   tmp->samples = 0;
-   latency /= samples;
-   if (latency == 0)
-   continue;
-   avg_latency[i].latency = latency;
+   tmp->total_latency = 0;
+   tmp->samples = 0;
+   latency[rw] /= samples;
+   if (latency[rw] == 0)
+   continue;
+   avg_latency[rw][i].latency = latency[rw];
+   }
}
}
 
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   if (!avg_latency[i].latency) {
-  

Re: [PATCH] blk-throttle: track read and write request individually

2017-10-11 Thread Joseph Qi


On 17/10/12 03:21, Shaohua Li wrote:
> On Wed, Oct 11, 2017 at 05:53:34PM +0800, Joseph Qi wrote:
>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>
>> In mixed read/write workload on SSD, write latency is much lower than
>> read. But now we only track and record read latency and then use it as
>> threshold base for both read and write io latency accounting. As a
>> result, write io latency will always be considered as good and
>> bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
>> tg to be checked will be treated as idle most of the time and still let
>> others dispatch more ios, even it is truly running under low limit and
>> wants its low limit to be guaranteed, which is not we expected in fact.
>> So track read and write request individually, which can bring more
>> precise latency control for low limit idle detection.
> 
> Looks good. Originally I thought we don't need very precise latency control, 
> so
> the read/write base latency difference doesn't matter too much, but tracking
> write base latency shouldn't harm.
>  
>> Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
>> ---
>>  block/blk-throttle.c | 134 
>> ++-
>>  1 file changed, 79 insertions(+), 55 deletions(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 17816a0..0897378 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -215,9 +215,9 @@ struct throtl_data
>>  
>>  unsigned int scale;
>>  
>> -struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
>> -struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
>> -struct latency_bucket __percpu *latency_buckets;
>> +struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
>> +struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
>> +struct latency_bucket __percpu *latency_buckets[2];
>>  unsigned long last_calculate_time;
>>  unsigned long filtered_latency;
>>  
>> @@ -2040,10 +2040,10 @@ static void blk_throtl_update_idletime(struct 
>> throtl_grp *tg)
>>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>>  static void throtl_update_latency_buckets(struct throtl_data *td)
>>  {
>> -struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
>> -int i, cpu;
>> -unsigned long last_latency = 0;
>> -unsigned long latency;
>> +struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
>> +int i, cpu, rw;
>> +unsigned long last_latency[2] = { 0 };
>> +unsigned long latency[2];
>>  
>>  if (!blk_queue_nonrot(td->queue))
>>  return;
>> @@ -2052,56 +2052,67 @@ static void throtl_update_latency_buckets(struct 
>> throtl_data *td)
>>  td->last_calculate_time = jiffies;
>>  
>>  memset(avg_latency, 0, sizeof(avg_latency));
>> -for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> -struct latency_bucket *tmp = >tmp_buckets[i];
>> -
>> -for_each_possible_cpu(cpu) {
>> -struct latency_bucket *bucket;
>> -
>> -/* this isn't race free, but ok in practice */
>> -bucket = per_cpu_ptr(td->latency_buckets, cpu);
>> -tmp->total_latency += bucket[i].total_latency;
>> -tmp->samples += bucket[i].samples;
>> -bucket[i].total_latency = 0;
>> -bucket[i].samples = 0;
>> -}
>> +for (rw = READ; rw <= WRITE; rw++) {
>> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> +struct latency_bucket *tmp = >tmp_buckets[rw][i];
>> +
>> +for_each_possible_cpu(cpu) {
>> +struct latency_bucket *bucket;
>> +
>> +/* this isn't race free, but ok in practice */
>> +bucket = per_cpu_ptr(td->latency_buckets[rw],
>> +cpu);
>> +tmp->total_latency += bucket[i].total_latency;
>> +tmp->samples += bucket[i].samples;
>> +bucket[i].total_latency = 0;
>> +bucket[i].samples = 0;
>> +}
>>  
>> -if (tmp->samples >= 32) {
>> -int samples = tmp->samples;
>> +if (tmp->samples >= 32) {
>> + 

[PATCH] blk-throttle: track read and write request individually

2017-10-11 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

In mixed read/write workload on SSD, write latency is much lower than
read. But now we only track and record read latency and then use it as
threshold base for both read and write io latency accounting. As a
result, write io latency will always be considered as good and
bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
tg to be checked will be treated as idle most of the time and still let
others dispatch more ios, even it is truly running under low limit and
wants its low limit to be guaranteed, which is not we expected in fact.
So track read and write request individually, which can bring more
precise latency control for low limit idle detection.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 134 ++-
 1 file changed, 79 insertions(+), 55 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 17816a0..0897378 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -215,9 +215,9 @@ struct throtl_data
 
unsigned int scale;
 
-   struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
-   struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
-   struct latency_bucket __percpu *latency_buckets;
+   struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
+   struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
+   struct latency_bucket __percpu *latency_buckets[2];
unsigned long last_calculate_time;
unsigned long filtered_latency;
 
@@ -2040,10 +2040,10 @@ static void blk_throtl_update_idletime(struct 
throtl_grp *tg)
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 static void throtl_update_latency_buckets(struct throtl_data *td)
 {
-   struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
-   int i, cpu;
-   unsigned long last_latency = 0;
-   unsigned long latency;
+   struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
+   int i, cpu, rw;
+   unsigned long last_latency[2] = { 0 };
+   unsigned long latency[2];
 
if (!blk_queue_nonrot(td->queue))
return;
@@ -2052,56 +2052,67 @@ static void throtl_update_latency_buckets(struct 
throtl_data *td)
td->last_calculate_time = jiffies;
 
memset(avg_latency, 0, sizeof(avg_latency));
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   struct latency_bucket *tmp = >tmp_buckets[i];
-
-   for_each_possible_cpu(cpu) {
-   struct latency_bucket *bucket;
-
-   /* this isn't race free, but ok in practice */
-   bucket = per_cpu_ptr(td->latency_buckets, cpu);
-   tmp->total_latency += bucket[i].total_latency;
-   tmp->samples += bucket[i].samples;
-   bucket[i].total_latency = 0;
-   bucket[i].samples = 0;
-   }
+   for (rw = READ; rw <= WRITE; rw++) {
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   struct latency_bucket *tmp = >tmp_buckets[rw][i];
+
+   for_each_possible_cpu(cpu) {
+   struct latency_bucket *bucket;
+
+   /* this isn't race free, but ok in practice */
+   bucket = per_cpu_ptr(td->latency_buckets[rw],
+   cpu);
+   tmp->total_latency += bucket[i].total_latency;
+   tmp->samples += bucket[i].samples;
+   bucket[i].total_latency = 0;
+   bucket[i].samples = 0;
+   }
 
-   if (tmp->samples >= 32) {
-   int samples = tmp->samples;
+   if (tmp->samples >= 32) {
+   int samples = tmp->samples;
 
-   latency = tmp->total_latency;
+   latency[rw] = tmp->total_latency;
 
-   tmp->total_latency = 0;
-   tmp->samples = 0;
-   latency /= samples;
-   if (latency == 0)
-   continue;
-   avg_latency[i].latency = latency;
+   tmp->total_latency = 0;
+   tmp->samples = 0;
+   latency[rw] /= samples;
+   if (latency[rw] == 0)
+   continue;
+   avg_latency[rw][i].latency = latency[rw];
+   }
}
}
 
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   if (!avg_latency[i].latency) {
-  

[PATCH v2] blk-throttle: fix possible io stall when upgrade to max

2017-09-30 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

There is a case which will lead to io stall. The case is described as
follows. 
/test1
  |-subtest1
/test2
  |-subtest2
And subtest1 and subtest2 each has 32 queued bios already.

Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
bios as follows:
1) tg=subtest1, do nothing;
2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
left, no need to schedule next dispatch;
3) tg=subtest2, do nothing;
4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
left, no need to schedule next dispatch;
5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
to /; note that test1 and test2 each still has 16 queued bios left;
6) tg=/, try to schedule next dispatch, but since disptime is now
(update in tg_update_disptime, wait=0), pending timer is not scheduled
in fact;
7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
32 left. test1 and test2 each has 16 queued bios;
8) throtl_pending_timer_fn sees the left over bios, but could do
nothing, because throtl_select_dispatch returns 0, and test1/test2 has
no pending tg.

The blktrace shows the following:
8,32   00 2.539007641 0  m   N throtl upgrade to max
8,32   00 2.539072267 0  m   N throtl /test2 dispatch 
nr_queued=16 read=0 write=16
8,32   70 2.539077142 0  m   N throtl /test1 dispatch 
nr_queued=16 read=0 write=16

So force schedule dispatch if there are pending children.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0fea76a..17816a0 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1911,11 +1911,11 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
tg->disptime = jiffies - 1;
throtl_select_dispatch(sq);
-   throtl_schedule_next_dispatch(sq, false);
+   throtl_schedule_next_dispatch(sq, true);
}
rcu_read_unlock();
throtl_select_dispatch(>service_queue);
-   throtl_schedule_next_dispatch(>service_queue, false);
+   throtl_schedule_next_dispatch(>service_queue, true);
queue_work(kthrotld_workqueue, >dispatch_work);
 }
 
-- 
1.9.4


Re: [PATCH] blk-throttle: fix possible io stall when doing upgrade

2017-09-28 Thread Joseph Qi


On 17/9/28 11:48, Joseph Qi wrote:
> Hi Shahua,
> 
> On 17/9/28 05:38, Shaohua Li wrote:
>> On Tue, Sep 26, 2017 at 11:16:05AM +0800, Joseph Qi wrote:
>>>
>>>
>>> On 17/9/26 10:48, Shaohua Li wrote:
>>>> On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
>>>>> Hi Shaohua,
>>>>>
>>>>> On 17/9/26 01:22, Shaohua Li wrote:
>>>>>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>>>>>>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>>>>>>
>>>>>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>>>>>>> lead to io stall in the following case.
>>>>>>> Say the hierarchy is like:
>>>>>>> /-test1
>>>>>>>   |-subtest1
>>>>>>> and subtest1 has 32 queued bios now.
>>>>>>>
>>>>>>> throtl_pending_timer_fnthrotl_upgrade_state
>>>>>>> 
>>>>>>>upgrade to max
>>>>>>>throtl_select_dispatch
>>>>>>>throtl_schedule_next_dispatch
>>>>>>> throtl_select_dispatch
>>>>>>> throtl_schedule_next_dispatch
>>>>>>>
>>>>>>> Since throtl_select_dispatch will move queued bios from subtest1 to
>>>>>>> test1 in throtl_upgrade_state, it will then just do nothing in
>>>>>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>>>>>>> any more if no proper timer scheduled.
>>>>>>
>>>>>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
>>>>>> throtl_upgrade_state already moves bios to parent), there is no pending
>>>>>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
>>>>>> anything? could you please describe the failure in details?
>>>>>>
>>>>>> Thanks,
>>>>>> Shaohua
>>>>>> In normal case, throtl_pending_timer_fn tries to move bios from
>>>>> subtest1 to test1, and finally do the real issueing work when reach
>>>>> the top-level.
>>>>> But int the case above, throtl_select_dispatch in
>>>>> throtl_pending_timer_fn returns 0, because the work is done by
>>>>> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
>>>>> nothing to do, but the queued bios are still in service queue of
>>>>> test1.
>>>>
>>>> Still didn't get, sorry. If there are pending bios in test1, why
>>>> throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
>>>> timer?
>>>>
>>>
>>> throtl_schedule_next_dispatch doesn't setup timer because there is no
>>> pending children left, all the queued bios are moved to parent test1
>>> now. IMO, this is used in case that it cannot dispatch all queued bios
>>> in one round.
>>> And if the select dispatch is done by timer, it will then do propagate
>>> dispatch in parent till reach the top-level.
>>> But in the case above, it breaks this logic.
>>> Please point out if I am understanding wrong.
>>
>> I read your reply again. So if the bios are move to test1, why don't we
>> dispatch bios of test1? throtl_upgrade_state does a post-order traversal, so 
>> it
>> handles subtest1 and then test1. Anything I missed? Please describe in 
>> details,
>> thanks! Did you see a real stall or is this based on code analysis?
>>
>> Thanks,
>> Shaohua
>>
> 
> Sorry for the unclear description and the misunderstanding brought in.
> I backported your patches to my kernel 3.10 and did the test. I tested
> with libaio and iodepth 32. Most time it worked well, but occasionally
> it would stall io, and the blktrace showed the following:
> 
> 252,0   26019.884802028 0  m   N throtl upgrade to max
> 252,0   13019.884820336 0  m   N throtl /test1 dispatch 
> nr_queued=32 read=0 write=32
> 
> From my analysis, it was because upgrade had moved the queued bios from
> subtest1 to test1, but not continued to move them to parent and did the
> real issuing. Then timer fn saw there were still 32 queued bios, but
> since select dispatch returned 0, it would

Re: [PATCH] blk-throttle: fix possible io stall when doing upgrade

2017-09-27 Thread Joseph Qi
Hi Shahua,

On 17/9/28 05:38, Shaohua Li wrote:
> On Tue, Sep 26, 2017 at 11:16:05AM +0800, Joseph Qi wrote:
>>
>>
>> On 17/9/26 10:48, Shaohua Li wrote:
>>> On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
>>>> Hi Shaohua,
>>>>
>>>> On 17/9/26 01:22, Shaohua Li wrote:
>>>>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>>>>>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>>>>>
>>>>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>>>>>> lead to io stall in the following case.
>>>>>> Say the hierarchy is like:
>>>>>> /-test1
>>>>>>   |-subtest1
>>>>>> and subtest1 has 32 queued bios now.
>>>>>>
>>>>>> throtl_pending_timer_fnthrotl_upgrade_state
>>>>>> 
>>>>>>upgrade to max
>>>>>>throtl_select_dispatch
>>>>>>throtl_schedule_next_dispatch
>>>>>> throtl_select_dispatch
>>>>>> throtl_schedule_next_dispatch
>>>>>>
>>>>>> Since throtl_select_dispatch will move queued bios from subtest1 to
>>>>>> test1 in throtl_upgrade_state, it will then just do nothing in
>>>>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>>>>>> any more if no proper timer scheduled.
>>>>>
>>>>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
>>>>> throtl_upgrade_state already moves bios to parent), there is no pending
>>>>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
>>>>> anything? could you please describe the failure in details?
>>>>>
>>>>> Thanks,
>>>>> Shaohua
>>>>> In normal case, throtl_pending_timer_fn tries to move bios from
>>>> subtest1 to test1, and finally do the real issueing work when reach
>>>> the top-level.
>>>> But int the case above, throtl_select_dispatch in
>>>> throtl_pending_timer_fn returns 0, because the work is done by
>>>> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
>>>> nothing to do, but the queued bios are still in service queue of
>>>> test1.
>>>
>>> Still didn't get, sorry. If there are pending bios in test1, why
>>> throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
>>> timer?
>>>
>>
>> throtl_schedule_next_dispatch doesn't setup timer because there is no
>> pending children left, all the queued bios are moved to parent test1
>> now. IMO, this is used in case that it cannot dispatch all queued bios
>> in one round.
>> And if the select dispatch is done by timer, it will then do propagate
>> dispatch in parent till reach the top-level.
>> But in the case above, it breaks this logic.
>> Please point out if I am understanding wrong.
> 
> I read your reply again. So if the bios are move to test1, why don't we
> dispatch bios of test1? throtl_upgrade_state does a post-order traversal, so 
> it
> handles subtest1 and then test1. Anything I missed? Please describe in 
> details,
> thanks! Did you see a real stall or is this based on code analysis?
> 
> Thanks,
> Shaohua
> 

Sorry for the unclear description and the misunderstanding brought in.
I backported your patches to my kernel 3.10 and did the test. I tested
with libaio and iodepth 32. Most time it worked well, but occasionally
it would stall io, and the blktrace showed the following:

252,0   26019.884802028 0  m   N throtl upgrade to max
252,0   13019.884820336 0  m   N throtl /test1 dispatch 
nr_queued=32 read=0 write=32

>From my analysis, it was because upgrade had moved the queued bios from
subtest1 to test1, but not continued to move them to parent and did the
real issuing. Then timer fn saw there were still 32 queued bios, but
since select dispatch returned 0, it wouldn't try more. As a result,
the corresponding fio stalled.
I've looked at the code again and found that the behavior of
blkg_for_each_descendant_post changes between 3.10 and 4.12. In 3.10 it
doesn't include root while in 4.12 it does. That's why the above case
happens.
So upstream don't have this problem, sorry again for the noise.

Thanks,
Joseph


Re: [PATCH] blk-throttle: fix possible io stall when doing upgrade

2017-09-25 Thread Joseph Qi


On 17/9/26 10:48, Shaohua Li wrote:
> On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
>> Hi Shaohua,
>>
>> On 17/9/26 01:22, Shaohua Li wrote:
>>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>>>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>>>
>>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>>>> lead to io stall in the following case.
>>>> Say the hierarchy is like:
>>>> /-test1
>>>>   |-subtest1
>>>> and subtest1 has 32 queued bios now.
>>>>
>>>> throtl_pending_timer_fnthrotl_upgrade_state
>>>> 
>>>>upgrade to max
>>>>throtl_select_dispatch
>>>>throtl_schedule_next_dispatch
>>>> throtl_select_dispatch
>>>> throtl_schedule_next_dispatch
>>>>
>>>> Since throtl_select_dispatch will move queued bios from subtest1 to
>>>> test1 in throtl_upgrade_state, it will then just do nothing in
>>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>>>> any more if no proper timer scheduled.
>>>
>>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
>>> throtl_upgrade_state already moves bios to parent), there is no pending
>>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
>>> anything? could you please describe the failure in details?
>>>
>>> Thanks,
>>> Shaohua
>>> In normal case, throtl_pending_timer_fn tries to move bios from
>> subtest1 to test1, and finally do the real issueing work when reach
>> the top-level.
>> But int the case above, throtl_select_dispatch in
>> throtl_pending_timer_fn returns 0, because the work is done by
>> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
>> nothing to do, but the queued bios are still in service queue of
>> test1.
> 
> Still didn't get, sorry. If there are pending bios in test1, why
> throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
> timer?
> 

throtl_schedule_next_dispatch doesn't setup timer because there is no
pending children left, all the queued bios are moved to parent test1
now. IMO, this is used in case that it cannot dispatch all queued bios
in one round.
And if the select dispatch is done by timer, it will then do propagate
dispatch in parent till reach the top-level.
But in the case above, it breaks this logic.
Please point out if I am understanding wrong.

Thanks,
Joseph


Re: [PATCH] blk-throttle: fix possible io stall when doing upgrade

2017-09-25 Thread Joseph Qi
Hi Shaohua,

On 17/9/26 01:22, Shaohua Li wrote:
> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>> From: Joseph Qi <qijiang...@alibaba-inc.com>
>>
>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>> lead to io stall in the following case.
>> Say the hierarchy is like:
>> /-test1
>>   |-subtest1
>> and subtest1 has 32 queued bios now.
>>
>> throtl_pending_timer_fnthrotl_upgrade_state
>> 
>>upgrade to max
>>throtl_select_dispatch
>>throtl_schedule_next_dispatch
>> throtl_select_dispatch
>> throtl_schedule_next_dispatch
>>
>> Since throtl_select_dispatch will move queued bios from subtest1 to
>> test1 in throtl_upgrade_state, it will then just do nothing in
>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>> any more if no proper timer scheduled.
> 
> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
> throtl_upgrade_state already moves bios to parent), there is no pending
> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
> anything? could you please describe the failure in details?
> 
> Thanks,
> Shaohua
>In normal case, throtl_pending_timer_fn tries to move bios from
subtest1 to test1, and finally do the real issueing work when reach
the top-level.
But int the case above, throtl_select_dispatch in
throtl_pending_timer_fn returns 0, because the work is done by
throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
nothing to do, but the queued bios are still in service queue of
test1.
Since both throtl_pending_timer_fn and throtl_upgrade_state won't
handle the queued bios, io stall happens.

Thanks,
Joseph


[PATCH] blk-throttle: fix possible io stall when doing upgrade

2017-09-25 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

Currently it will try to dispatch bio in throtl_upgrade_state. This may
lead to io stall in the following case.
Say the hierarchy is like:
/-test1
  |-subtest1
and subtest1 has 32 queued bios now.

throtl_pending_timer_fnthrotl_upgrade_state

   upgrade to max
   throtl_select_dispatch
   throtl_schedule_next_dispatch
throtl_select_dispatch
throtl_schedule_next_dispatch

Since throtl_select_dispatch will move queued bios from subtest1 to
test1 in throtl_upgrade_state, it will then just do nothing in
throtl_pending_timer_fn. As a result, queued bios won't be dispatched
any more if no proper timer scheduled.
Fix this issue by just scheduling dispatch now and let the dispatch be
done only in timer.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0fea76a..29d282f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1909,13 +1909,11 @@ static void throtl_upgrade_state(struct throtl_data *td)
struct throtl_grp *tg = blkg_to_tg(blkg);
struct throtl_service_queue *sq = >service_queue;
 
-   tg->disptime = jiffies - 1;
-   throtl_select_dispatch(sq);
-   throtl_schedule_next_dispatch(sq, false);
+   tg->disptime = jiffies;
+   throtl_schedule_next_dispatch(sq, true);
}
rcu_read_unlock();
-   throtl_select_dispatch(>service_queue);
-   throtl_schedule_next_dispatch(>service_queue, false);
+   throtl_schedule_next_dispatch(>service_queue, true);
queue_work(kthrotld_workqueue, >dispatch_work);
 }
 
-- 
1.9.4


Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-05 Thread Joseph Qi
Hi Shaohua,

On 17/9/6 05:02, Shaohua Li wrote:
> On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
>>
>>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha scritto:
>>>
>>> Hi,
>>>
>>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
>>> not
>>> much for SSD. This patch set try to add a conservative limit for 
>>> blk-throttle.
>>> It isn't a proportional scheduling, but can help prioritize cgroups. There 
>>> are
>>> several advantages we choose blk-throttle:
>>> - blk-throttle resides early in the block stack. It works for both bio and
>>>  request based queues.
>>> - blk-throttle is light weight in general. It still takes queue lock, but 
>>> it's
>>>  not hard to implement a per-cpu cache and remove the lock contention.
>>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. 
>>> The
>>>  mechanism is proved to harm performance for fast SSD.
>>>
>>> The patch set add a new io.low limit for blk-throttle. It's only for 
>>> cgroup2.
>>> The existing io.max is a hard limit throttling. cgroup with a max limit 
>>> never
>>> dispatch more IO than its max limit. While io.low is a best effort 
>>> throttling.
>>> cgroups with 'low' limit can run above their 'low' limit at appropriate 
>>> time.
>>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
>>> above
>>> their 'low' limit. If any cgroup runs under its 'low' limit, all other 
>>> cgroups
>>> will run according to their 'low' limit. So the 'low' limit could act as two
>>> roles, it allows cgroups using free bandwidth and it protects cgroups from
>>> their 'low' limit.
>>>
>>> An example usage is we have a high prio cgroup with high 'low' limit and a 
>>> low
>>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
>>> the low
>>> prio can run above its 'low' limit, so we don't waste the bandwidth. When 
>>> the
>>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
>>> under its 'low' limit. This will protect high prio cgroup to get more
>>> resources.
>>>
>>
>> Hi Shaohua,
> 
> Hi,
> 
> Sorry for the late response.
>> I would like to ask you some questions, to make sure I fully
>> understand how the 'low' limit and the idle-group detection work in
>> your above scenario.  Suppose that: the drive has a random-I/O peak
>> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
>> the low prio group has a 'low' limit of 10 MB/s.  If
>> - the high prio process happens to do, say, only 5 MB/s for a given
>>   long time
>> - the low prio process constantly does greedy I/O
>> - the idle-group detection is not being used
>> then the low prio process is limited to 10 MB/s during all this time
>> interval.  And only 10% of the device bandwidth is utilized.
>>
>> To recover lost bandwidth through idle-group detection, we need to set
>> a target IO latency for the high-prio group.  The high prio group
>> should happen to be below the threshold, and thus to be detected as
>> idle, leaving the low prio group free too use all the bandwidth.
>>
>> Here are my questions:
>> 1) Is all I wrote above correct?
> 
> Yes
>> 2) In particular, maybe there are other better mechanism to saturate
>> the bandwidth in the above scenario?
> 
> Assume it's the 4) below.
>> If what I wrote above is correct:
>> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
>> full bandwidth, the latency threshold of the high prio group may be
>> overcome, causing the high prio group to not be considered idle any
>> longer, and thus the low prio group to be limited again; this in turn
>> will cause the threshold to not be overcome any longer, and so on.
> 
> That's true. We try to mitigate the fluctuation by increasing the low prio
> cgroup bandwidth graduately though.
> 
>> 4) Is there a way to compute an appropriate target latency of the high
>> prio group, if it is a generic group, for which the latency
>> requirements of the processes it contains are only partially known or
>> completely unknown?  By appropriate target latency, I mean a target
>> latency that enables the framework to fully utilize the device
>> bandwidth while the high prio group is doing less I/O than its limit.
> 
> Not sure how we can do this. The device max bandwidth varies based on request
> size and read/write ratio. We don't know when the max bandwidth is reached.
> Also I think we must consider a case that the workloads never use the full
> bandwidth of a disk, which is pretty common for SSD (at least in our
> environment).
> 
I have a question on the base latency tracking.
>From my test on SSD, write latency is much lower than read when doing
mixed read/write, but currently we only track read request and then use
it's average as base latency. In other words, we don't distinguish read
and write now. As a result, all write request's latency will always be
considered as good. So I think we have to track read and write latency
separately. Or 

[PATCH RESEND] blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch

2017-08-14 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

Since throtl_rb_first may return NULL, so it should check tg first and
then use it.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7285bf..c1d2c5b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1201,7 +1201,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
while (1) {
struct throtl_grp *tg = throtl_rb_first(parent_sq);
-   struct throtl_service_queue *sq = >service_queue;
+   struct throtl_service_queue *sq;
 
if (!tg)
break;
@@ -1213,6 +1213,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
nr_disp += throtl_dispatch_tg(tg);
 
+   sq = >service_queue;
if (sq->nr_queued[0] || sq->nr_queued[1])
tg_update_disptime(tg);
 
-- 
1.9.4


blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch

2017-08-09 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

Since throtl_rb_first may return NULL, so it should check tg first and
then use it.

Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7285bf..c1d2c5b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1201,7 +1201,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
while (1) {
struct throtl_grp *tg = throtl_rb_first(parent_sq);
-   struct throtl_service_queue *sq = >service_queue;
+   struct throtl_service_queue *sq;
 
if (!tg)
break;
@@ -1213,6 +1213,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
 
nr_disp += throtl_dispatch_tg(tg);
 
+   sq = >service_queue;
if (sq->nr_queued[0] || sq->nr_queued[1])
tg_update_disptime(tg);
 
-- 
1.9.4


[PATCH] blk-throttle: fix NULL pointer dereference in throtl_schedule_pending_timer

2017-06-06 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

I have encountered a NULL pointer dereference in
throtl_schedule_pending_timer:
  [  413.735396] BUG: unable to handle kernel NULL pointer dereference at 
0038
  [  413.735535] IP: [] 
throtl_schedule_pending_timer+0x3f/0x210
  [  413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
  [  413.735713] Oops:  [#1] SMP
  ..

This is caused by the following case:
  blk_throtl_bio
throtl_schedule_next_dispatch  <= sq is top level one without parent
  throtl_schedule_pending_timer
sq_to_tg(sq)->td->throtl_slice  <= sq_to_tg(sq) returns NULL

Fix it by using sq_to_td instead of sq_to_tg(sq)->td, which will always
return a valid td.

Fixes: 297e3d854784 ("blk-throttle: make throtl_slice tunable")
Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-throttle.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fc13dd0..3b751f7 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -698,7 +698,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
  unsigned long expires)
 {
-   unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;
+   unsigned long max_expire = jiffies + 8 * sq_to_td(sq)->throtl_slice;
 
/*
 * Since we are adjusting the throttle limit dynamically, the sleep
-- 
1.9.4


blk-mq: free callback if error occurs in blk_mq_init_allocated_queue

2017-05-31 Thread Joseph Qi
From: Joseph Qi <qijiang...@alibaba-inc.com>

If error occurs, we have to free the allocated callback in
blk_mq_init_allocated_queue to avoid memory leaking.

Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Joseph Qi <qijiang...@alibaba-inc.com>
---
 block/blk-mq.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2224ffd..50b5d7b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2273,7 +2273,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 
q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
if (!q->queue_ctx)
-   goto err_exit;
+   goto err_poll_cb;
 
/* init q->mq_kobj and sw queues' kobjects */
blk_mq_sysfs_init(q);
@@ -2346,6 +2346,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
kfree(q->queue_hw_ctx);
 err_percpu:
free_percpu(q->queue_ctx);
+err_poll_cb:
+   blk_stat_free_callback(q->poll_cb);
 err_exit:
q->mq_ops = NULL;
return ERR_PTR(-ENOMEM);
-- 
1.8.3.1