Re: [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup

2021-03-21 Thread Paolo Valente



> Il giorno 12 mar 2021, alle ore 12:08, brookxu  ha 
> scritto:
> 
> From: Chunguang Xu 
> 

Hi Chunguang,

> Tasks in the production environment can be roughly divided into
> three categories: emergency tasks, ordinary tasks and offline
> tasks. Emergency tasks need to be scheduled in real time, such
> as system agents. Offline tasks do not need to guarantee QoS,
> but can improve system resource utilization during system idle
> periods, such as background tasks. The above requirements need
> to achieve IO preemption. At present, we can use weights to
> simulate IO preemption, but since weights are more of a shared
> concept, they cannot be simulated well. For example, the weights
> of emergency tasks and ordinary tasks cannot be determined well,
> offline tasks (with the same weight) actually occupy different
> resources on disks with different performance, and the tail
> latency caused by offline tasks cannot be well controlled. Using
> ioprio's concept of preemption, we can solve the above problems
> very well. Since ioprio will eventually be converted to weight,
> using ioprio alone can also achieve weight isolation within the
> same class. But we can still use bfq.weight to control resource,
> achieving better IO Qos control.
> 
> However, currently the class of bfq_group is always be class, and
> the ioprio class of the task can only be reflected in a single
> cgroup. We cannot guarantee that real-time tasks in a cgroup are
> scheduled in time. Therefore, we introduce bfq.ioprio, which
> allows us to configure ioprio class for cgroup. In this way, we
> can ensure that the real-time tasks of a cgroup can be scheduled
> in time. Similarly, the processing of offline task groups can
> also be simpler.
> 

I find this contribution very interesting.  Anyway, given the
relevance of such a contribution, I'd like to hear from relevant
people (Jens, Tejun, ...?), before revising individual patches.

Yet I already have a general question.  How does this mechanism comply
with per-process ioprios and ioprio classes?  For example, what
happens if a process belongs to BE-class group according to your
mechanism, but to a RT class according to its ioprio?  Does the
pre-group class dominate the per-process class?  Is all clean and
predictable?

> The bfq.ioprio interface now is available for cgroup v1 and cgroup
> v2. Users can configure the ioprio for cgroup through this interface,
> as shown below:
> 
> echo "1 2"> blkio.bfq.ioprio

Wouldn't it be nicer to have acronyms for classes (RT, BE, IDLE),
instead of numbers?

Thank you very much for this improvement proposal,
Paolo

> 
> The above two values respectively represent the values of ioprio
> class and ioprio for cgroup. The ioprio of tasks within the cgroup
> is uniformly equal to the ioprio of the cgroup. If the ioprio of
> the cgroup is disabled, the ioprio of the task remains the same,
> usually from io_context.
> 
> When testing, using fio and fio_generate_plots we can clearly see
> that the IO delay of the task satisfies RT> BE> IDLE. When RT is
> running, BE and IDLE are guaranteed minimum bandwidth. When used
> with bfq.weight, we can also isolate the resource within the same
> class.
> 
> The test process is as follows:
> # prepare data disk
> mount /dev/sdb /data1
> 
> # create cgroup v1 hierarchy
> cd /sys/fs/cgroup/blkio
> mkdir rt be idle
> echo "1 0" > rt/blkio.bfq.ioprio
> echo "2 0" > be/blkio.bfq.ioprio
> echo "3 0" > idle/blkio.bfq.ioprio
> 
> # run fio test
> fio fio.ini
> 
> # generate svg graph
> fio_generate_plots res
> 
> The contents of fio.ini are as follows:
> [global]
> ioengine=libaio
> group_reporting=1
> log_avg_msec=500
> direct=1
> time_based=1
> iodepth=16
> size=100M
> rw=write
> bs=1M
> [rt]
> name=rt
> write_bw_log=rt
> write_lat_log=rt
> write_iops_log=rt
> filename=/data1/rt.bin
> cgroup=rt
> runtime=30s
> nice=-10
> [be]
> name=be
> new_group
> write_bw_log=be
> write_lat_log=be
> write_iops_log=be
> filename=/data1/be.bin
> cgroup=be
> runtime=60s
> [idle]
> name=idle
> new_group
> write_bw_log=idle
> write_lat_log=idle
> write_iops_log=idle
> filename=/data1/idle.bin
> cgroup=idle
> runtime=90s
> 
> V2:
> 1. Optmise bfq_select_next_class().
> 2. Introduce bfq_group [] to track the number of groups for each CLASS.
> 3. Optimse IO injection, EMQ and Idle mechanism for CLASS_RT.
> 
> Chunguang Xu (11):
>  bfq: introduce bfq_entity_to_bfqg helper method
>  bfq: limit the IO depth of idle_class to 1
>  bfq: keep the minimun bandwidth for be_class
>  bfq: expire other class if CLASS_RT is waiting
>  bfq: optimse IO injection for CLASS_RT
>  bfq: disallow idle if CLASS_RT waiting for service
>  bfq: disallow merge CLASS_RT with other class
>  bfq: introduce bfq.ioprio for cgroup
>  bfq: convert the type of bfq_group.bfqd to bfq_data*
>  bfq: remove unnecessary initialization logic
>  bfq: optimize the calculation of bfq_weight_to_ioprio()
> 
> block/bfq-cgroup.c  |  99 +++
> 

Re: [PATCH BUGFIX/IMPROVEMENT V2 0/6] revised version of third and last batch of patches

2021-03-20 Thread Paolo Valente



> Il giorno 4 mar 2021, alle ore 18:46, Paolo Valente 
>  ha scritto:
> 
> Hi,
> this is the V2 for the third and last batches of patches that I
> proposed recently [1].
> 
> I've tried to address all issues raised in [1].
> 
> In more detail, main changes for V1 are:
> 1. I've improved code as requested in "block, bfq: merge bursts of
> newly-created queues"
> 2. I've improved comments as requested in "block, bfq: put reqs of
> waker and woken in dispatch list"
> 

Hi Jens,
any news on this patch series?

Thanks,
Paolo

> Thanks,
> Paolo
> 
> [1] https://www.spinics.net/lists/linux-block/msg64333.html
> 
> Paolo Valente (6):
>  block, bfq: always inject I/O of queues blocked by wakers
>  block, bfq: put reqs of waker and woken in dispatch list
>  block, bfq: make shared queues inherit wakers
>  block, bfq: fix weight-raising resume with !low_latency
>  block, bfq: keep shared queues out of the waker mechanism
>  block, bfq: merge bursts of newly-created queues
> 
> block/bfq-cgroup.c  |   2 +
> block/bfq-iosched.c | 399 +---
> block/bfq-iosched.h |  15 ++
> block/bfq-wf2q.c|   8 +
> 4 files changed, 402 insertions(+), 22 deletions(-)
> 
> --
> 2.20.1



Re: [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI

2021-03-05 Thread Paolo Valente
I'm thinking of a way to debug this too.  The symptom may hint at a
use-after-free.  Could you enable KASAN in your tests?  (On the flip
side, I know this might change timings, thereby making the fault
disappear).

Thanks,
Paolo

> Il giorno 5 mar 2021, alle ore 10:27, Ming Lei  ha 
> scritto:
> 
> Hello Hillf,
> 
> Thanks for the debug patch.
> 
> On Fri, Mar 5, 2021 at 5:00 PM Hillf Danton  wrote:
>> 
>> On Thu, 4 Mar 2021 16:42:30 +0800  Ming Lei wrote:
>>> On Sat, Oct 10, 2020 at 1:40 PM Mikhail Gavrilov
>>>  wrote:
 
 Paolo, Jens I am sorry for the noise.
 But today I hit the kernel panic and git blame said that you have
 created the file in which happened panic (this I saw from trace)
 
 $ /usr/src/kernels/`uname -r`/scripts/faddr2line
 /lib/debug/lib/modules/`uname -r`/vmlinux
 __bfq_deactivate_entity+0x15a
 __bfq_deactivate_entity+0x15a/0x240:
 bfq_gt at block/bfq-wf2q.c:20
 (inlined by) bfq_insert at block/bfq-wf2q.c:381
 (inlined by) bfq_idle_insert at block/bfq-wf2q.c:621
 (inlined by) __bfq_deactivate_entity at block/bfq-wf2q.c:1203
 
 https://github.com/torvalds/linux/blame/master/block/bfq-wf2q.c#L1203
 
 $ head /sys/block/*/queue/scheduler
 ==> /sys/block/nvme0n1/queue/scheduler <==
 [none] mq-deadline kyber bfq
 
 ==> /sys/block/sda/queue/scheduler <==
 mq-deadline kyber [bfq] none
 
 ==> /sys/block/zram0/queue/scheduler <==
 none
 
 Trace:
 general protection fault, probably for non-canonical address
 0x46b1b0f0d8856e4a:  [#1] SMP NOPTI
 CPU: 27 PID: 1018 Comm: kworker/27:1H Tainted: GW
 - ---  5.9.0-0.rc8.28.fc34.x86_64 #1
 Hardware name: System manufacturer System Product Name/ROG STRIX
 X570-I GAMING, BIOS 2606 08/13/2020
 Workqueue: kblockd blk_mq_run_work_fn
 RIP: 0010:__bfq_deactivate_entity+0x15a/0x240
 Code: 48 2b 41 28 48 85 c0 7e 05 49 89 5c 24 18 49 8b 44 24 08 4d 8d
 74 24 08 48 85 c0 0f 84 d6 00 00 00 48 8b 7b 28 eb 03 48 89 c8 <48> 8b
 48 28 48 8d 70 10 48 8d 50 08 48 29 f9 48 85 c9 48 0f 4f d6
 RSP: 0018:adf6c0c6fc00 EFLAGS: 00010002
 RAX: 46b1b0f0d8856e4a RBX: 8dc2773b5c88 RCX: 46b1b0f0d8856e4a
 RDX: 8dc7d02ed0a0 RSI: 8dc7d02ed0a8 RDI: 584e64e96beb
 RBP: 8dc2773b5c00 R08: 8dc9054cb938 R09: 
 R10: 0018 R11: 0018 R12: 8dc904927150
 R13: 0001 R14: 8dc904927158 R15: 8dc2773b5c88
 FS:  () GS:8dc90e0c() 
 knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 003e8ebe4000 CR3: 0007c2546000 CR4: 00350ee0
 Call Trace:
 bfq_deactivate_entity+0x4f/0xc0
>>> 
>>> Hello,
>>> 
>>> The same stack trace was observed in RH internal test too, and kernel
>>> is 5.11.0-0.rc6,
>>> but there isn't reproducer yet.
>>> 
>>> 
>>> --
>>> Ming Lei
>> 
>> Add some debug info.
>> 
>> --- x/block/bfq-wf2q.c
>> +++ y/block/bfq-wf2q.c
>> @@ -647,8 +647,10 @@ static void bfq_forget_entity(struct bfq
>> 
>>entity->on_st_or_in_serv = false;
>>st->wsum -= entity->weight;
>> -   if (bfqq && !is_in_service)
>> +   if (bfqq && !is_in_service) {
>> +   WARN_ON(entity->tree != NULL);
>>bfq_put_queue(bfqq);
>> +   }
>> }
>> 
>> /**
>> @@ -1631,6 +1633,7 @@ bool __bfq_bfqd_reset_in_service(struct
>> * bfqq gets freed here.
>> */
>>int ref = in_serv_bfqq->ref;
>> +   WARN_ON(in_serv_entity->tree != NULL);
>>bfq_put_queue(in_serv_bfqq);
>>if (ref == 1)
>>return true;
> 
> This kernel oops isn't easy to be reproduced, and  we have got another crash
> report[1] too, still on __bfq_deactivate_entity(), and not easy to
> trigger.  Can your
> debug patch cover the report[1]? If not, feel free to add more debug messages,
> then I will try to reproduce the two.
> 
> [1] another kernel oops log on __bfq_deactivate_entity
> 
> [  899.790606] systemd-sysv-generator[25205]: SysV service
> '/etc/rc.d/init.d/anamon' lacks a native systemd unit file.
> Automatically generating a unit file for compatibility. Please update
> package to include a native systemd unit file, in order to make it
> more safe and robust.
> [  901.937047] BUG: kernel NULL pointer dereference, address: 
> [  901.944005] #PF: supervisor read access in kernel mode
> [  901.949143] #PF: error_code(0x) - not-present page
> [  901.954285] PGD 0 P4D 0
> [  901.956824] Oops:  [#1] SMP NOPTI
> [  901.960490] CPU: 13 PID: 22966 Comm: kworker/13:0 Tainted: G
>  IX - ---  5.11.0-1.el9.x86_64 #1
> [  901.970829] Hardware name: Dell Inc. PowerEdge R740xd/0WXD1Y, BIOS
> 2.5.4 01/13/2020
> [  901.978480] Workqueue: cgwb_release cgwb_release_workfn
> [  901.983705] RIP: 

[PATCH BUGFIX/IMPROVEMENT V2 6/6] block, bfq: merge bursts of newly-created queues

2021-03-04 Thread Paolo Valente
Many throughput-sensitive workloads are made of several parallel I/O
flows, with all flows generated by the same application, or more
generically by the same task (e.g., system boot). The most
counterproductive action with these workloads is plugging I/O dispatch
when one of the bfq_queues associated with these flows remains
temporarily empty.

To avoid this plugging, BFQ has been using a burst-handling mechanism
for years now. This mechanism has proven effective for throughput, and
not detrimental for service guarantees. This commit pushes this
mechanism a little bit further, basing on the following two facts.

First, all the I/O flows of a the same application or task contribute
to the execution/completion of that common application or task. So the
performance figures that matter are total throughput of the flows and
task-wide I/O latency.  In particular, these flows do not need to be
protected from each other, in terms of individual bandwidth or
latency.

Second, the above fact holds regardless of the number of flows.

Putting these two facts together, this commits merges stably the
bfq_queues associated with these I/O flows, i.e., with the processes
that generate these IO/ flows, regardless of how many the involved
processes are.

To decide whether a set of bfq_queues is actually associated with the
I/O flows of a common application or task, and to merge these queues
stably, this commit operates as follows: given a bfq_queue, say Q2,
currently being created, and the last bfq_queue, say Q1, created
before Q2, Q2 is merged stably with Q1 if
- very little time has elapsed since when Q1 was created
- Q2 has the same ioprio as Q1
- Q2 belongs to the same group as Q1

Merging bfq_queues also reduces scheduling overhead. A fio test with
ten random readers on /dev/nullb shows a throughput boost of 40%, with
a quadcore. Since BFQ's execution time amounts to ~50% of the total
per-request processing time, the above throughput boost implies that
BFQ's overhead is reduced by more than 50%.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-cgroup.c  |   2 +
 block/bfq-iosched.c | 259 ++--
 block/bfq-iosched.h |  15 +++
 3 files changed, 266 insertions(+), 10 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index b791e2041e49..e2f14508f2d6 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -547,6 +547,8 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
 
entity->orig_weight = entity->weight = entity->new_weight = d->weight;
entity->my_sched_data = >sched_data;
+   entity->last_bfqq_created = NULL;
+
bfqg->my_entity = entity; /*
   * the root_group's will be set to NULL
   * in bfq_init_queue()
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c62dbbe9cc33..4ba89c55a856 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1073,7 +1073,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
 static int bfqq_process_refs(struct bfq_queue *bfqq)
 {
return bfqq->ref - bfqq->allocated - bfqq->entity.on_st_or_in_serv -
-   (bfqq->weight_counter != NULL);
+   (bfqq->weight_counter != NULL) - bfqq->stable_ref;
 }
 
 /* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */
@@ -2625,6 +2625,11 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue 
*bfqq,
return true;
 }
 
+static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
+struct bfq_queue *bfqq);
+
+static void bfq_put_stable_ref(struct bfq_queue *bfqq);
+
 /*
  * Attempt to schedule a merge of bfqq with the currently in-service
  * queue or with a close queue among the scheduled queues.  Return
@@ -2647,10 +2652,49 @@ static bool bfq_may_be_close_cooperator(struct 
bfq_queue *bfqq,
  */
 static struct bfq_queue *
 bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-void *io_struct, bool request)
+void *io_struct, bool request, struct bfq_io_cq *bic)
 {
struct bfq_queue *in_service_bfqq, *new_bfqq;
 
+   /*
+* Check delayed stable merge for rotational or non-queueing
+* devs. For this branch to be executed, bfqq must not be
+* currently merged with some other queue (i.e., bfqq->bic
+* must be non null). If we considered also merged queues,
+* then we should also check whether bfqq has already been
+* merged with bic->stable_merge_bfqq. But this would be
+* costly and complicated.
+*/
+   if (unlikely(!bfqd->nonrot_with_queueing)) {
+   if (bic->stable_merge_bfqq &&
+   !bfq_bfqq_just_created(bfqq) &&
+   time_is_after_jiffies(bfqq->split_time +
+  

[PATCH BUGFIX/IMPROVEMENT V2 3/6] block, bfq: make shared queues inherit wakers

2021-03-04 Thread Paolo Valente
Consider a bfq_queue bfqq that is about to be merged with another
bfq_queue new_bfqq. The processes associated with bfqq are cooperators
of the processes associated with new_bfqq. So, if bfqq has a waker,
then it is reasonable (and beneficial for throughput) to assume that
all these processes will be happy to let bfqq's waker freely inject
I/O when they have no I/O. So this commit makes new_bfqq inherit
bfqq's waker.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a9c1a14b64f4..4b3d4849f3f5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2819,6 +2819,29 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq 
*bic,
bfq_mark_bfqq_IO_bound(new_bfqq);
bfq_clear_bfqq_IO_bound(bfqq);
 
+   /*
+* The processes associated with bfqq are cooperators of the
+* processes associated with new_bfqq. So, if bfqq has a
+* waker, then assume that all these processes will be happy
+* to let bfqq's waker freely inject I/O when they have no
+* I/O.
+*/
+   if (bfqq->waker_bfqq && !new_bfqq->waker_bfqq &&
+   bfqq->waker_bfqq != new_bfqq) {
+   new_bfqq->waker_bfqq = bfqq->waker_bfqq;
+   new_bfqq->tentative_waker_bfqq = NULL;
+
+   /*
+* If the waker queue disappears, then
+* new_bfqq->waker_bfqq must be reset. So insert
+* new_bfqq into the woken_list of the waker. See
+* bfq_check_waker for details.
+*/
+   hlist_add_head(_bfqq->woken_list_node,
+  _bfqq->waker_bfqq->woken_list);
+
+   }
+
/*
 * If bfqq is weight-raised, then let new_bfqq inherit
 * weight-raising. To reduce false positives, neglect the case
@@ -6303,7 +6326,7 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
if (likely(!new_queue)) {
/* If the queue was seeky for too long, break it apart. */
if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
-   bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+   struct bfq_queue *old_bfqq = bfqq;
 
/* Update bic before losing reference to bfqq */
if (bfq_bfqq_in_large_burst(bfqq))
@@ -6312,11 +6335,24 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
bfqq = bfq_split_bfqq(bic, bfqq);
split = true;
 
-   if (!bfqq)
+   if (!bfqq) {
bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
 true, is_sync,
 NULL);
-   else
+   bfqq->waker_bfqq = old_bfqq->waker_bfqq;
+   bfqq->tentative_waker_bfqq = NULL;
+
+   /*
+* If the waker queue disappears, then
+* new_bfqq->waker_bfqq must be
+* reset. So insert new_bfqq into the
+* woken_list of the waker. See
+* bfq_check_waker for details.
+*/
+   if (bfqq->waker_bfqq)
+   hlist_add_head(>woken_list_node,
+  
>waker_bfqq->woken_list);
+   } else
bfqq_already_existing = true;
}
}
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT V2 4/6] block, bfq: fix weight-raising resume with !low_latency

2021-03-04 Thread Paolo Valente
When the io_latency heuristic is off, bfq_queues must not start to be
weight-raised. Unfortunately, by mistake, this may happen when the
state of a previously weight-raised bfq_queue is resumed after a queue
split. This commit fixes this error.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 4b3d4849f3f5..8497d0803d74 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1010,7 +1010,7 @@ static void
 bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
  struct bfq_io_cq *bic, bool bfq_already_existing)
 {
-   unsigned int old_wr_coeff = bfqq->wr_coeff;
+   unsigned int old_wr_coeff = 1;
bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
 
if (bic->saved_has_short_ttime)
@@ -1031,7 +1031,13 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
bfqq->ttime = bic->saved_ttime;
bfqq->io_start_time = bic->saved_io_start_time;
bfqq->tot_idle_time = bic->saved_tot_idle_time;
-   bfqq->wr_coeff = bic->saved_wr_coeff;
+   /*
+* Restore weight coefficient only if low_latency is on
+*/
+   if (bfqd->low_latency) {
+   old_wr_coeff = bfqq->wr_coeff;
+   bfqq->wr_coeff = bic->saved_wr_coeff;
+   }
bfqq->service_from_wr = bic->saved_service_from_wr;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
bfqq->last_wr_start_finish = bic->saved_last_wr_start_finish;
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT V2 2/6] block, bfq: put reqs of waker and woken in dispatch list

2021-03-04 Thread Paolo Valente
Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
this happens, the only active bfq_queues are bfqq and either its waker
bfq_queue or one of its woken bfq_queues, then there is no point in
queueing this new I/O request in bfqq for service. In fact, the
in-service queue and bfqq agree on serving this new I/O request as
soon as possible. So this commit puts this new I/O request directly
into the dispatch list.

Tested-by: Jan Kara 
Acked-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 44 +++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a83149407336..a9c1a14b64f4 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5640,7 +5640,49 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
*hctx, struct request *rq,
 
spin_lock_irq(>lock);
bfqq = bfq_init_rq(rq);
-   if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
+
+   /*
+* Reqs with at_head or passthrough flags set are to be put
+* directly into dispatch list. Additional case for putting rq
+* directly into the dispatch queue: the only active
+* bfq_queues are bfqq and either its waker bfq_queue or one
+* of its woken bfq_queues. The rationale behind this
+* additional condition is as follows:
+* - consider a bfq_queue, say Q1, detected as a waker of
+*   another bfq_queue, say Q2
+* - by definition of a waker, Q1 blocks the I/O of Q2, i.e.,
+*   some I/O of Q1 needs to be completed for new I/O of Q2
+*   to arrive.  A notable example of waker is journald
+* - so, Q1 and Q2 are in any respect the queues of two
+*   cooperating processes (or of two cooperating sets of
+*   processes): the goal of Q1's I/O is doing what needs to
+*   be done so that new Q2's I/O can finally be
+*   issued. Therefore, if the service of Q1's I/O is delayed,
+*   then Q2's I/O is delayed too.  Conversely, if Q2's I/O is
+*   delayed, the goal of Q1's I/O is hindered.
+* - as a consequence, if some I/O of Q1/Q2 arrives while
+*   Q2/Q1 is the only queue in service, there is absolutely
+*   no point in delaying the service of such an I/O. The
+*   only possible result is a throughput loss
+* - so, when the above condition holds, the best option is to
+*   have the new I/O dispatched as soon as possible
+* - the most effective and efficient way to attain the above
+*   goal is to put the new I/O directly in the dispatch
+*   list
+* - as an additional restriction, Q1 and Q2 must be the only
+*   busy queues for this commit to put the I/O of Q2/Q1 in
+*   the dispatch list.  This is necessary, because, if also
+*   other queues are waiting for service, then putting new
+*   I/O directly in the dispatch list may evidently cause a
+*   violation of service guarantees for the other queues
+*/
+   if (!bfqq ||
+   (bfqq != bfqd->in_service_queue &&
+bfqd->in_service_queue != NULL &&
+bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
+(bfqq->waker_bfqq == bfqd->in_service_queue ||
+ bfqd->in_service_queue->waker_bfqq == bfqq)) ||
+   at_head || blk_rq_is_passthrough(rq)) {
if (at_head)
list_add(>queuelist, >dispatch);
else
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT V2 5/6] block, bfq: keep shared queues out of the waker mechanism

2021-03-04 Thread Paolo Valente
Shared queues are likely to receive I/O at a high rate. This may
deceptively let them be considered as wakers of other queues. But a
false waker will unjustly steal bandwidth to its supposedly woken
queue. So considering also shared queues in the waking mechanism may
cause more control troubles than throughput benefits. This commit
keeps shared queues out of the waker-detection mechanism.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 8497d0803d74..c62dbbe9cc33 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5852,7 +5852,17 @@ static void bfq_completed_request(struct bfq_queue 
*bfqq, struct bfq_data *bfqd)
1UL<<(BFQ_RATE_SHIFT - 10))
bfq_update_rate_reset(bfqd, NULL);
bfqd->last_completion = now_ns;
-   bfqd->last_completed_rq_bfqq = bfqq;
+   /*
+* Shared queues are likely to receive I/O at a high
+* rate. This may deceptively let them be considered as wakers
+* of other queues. But a false waker will unjustly steal
+* bandwidth to its supposedly woken queue. So considering
+* also shared queues in the waking mechanism may cause more
+* control troubles than throughput benefits. Then do not set
+* last_completed_rq_bfqq to bfqq if bfqq is a shared queue.
+*/
+   if (!bfq_bfqq_coop(bfqq))
+   bfqd->last_completed_rq_bfqq = bfqq;
 
/*
 * If we are waiting to discover whether the request pattern
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT V2 0/6] revised version of third and last batch of patches

2021-03-04 Thread Paolo Valente
Hi,
this is the V2 for the third and last batches of patches that I
proposed recently [1].

I've tried to address all issues raised in [1].

In more detail, main changes for V1 are:
1. I've improved code as requested in "block, bfq: merge bursts of
newly-created queues"
2. I've improved comments as requested in "block, bfq: put reqs of
waker and woken in dispatch list"

Thanks,
Paolo

[1] https://www.spinics.net/lists/linux-block/msg64333.html

Paolo Valente (6):
  block, bfq: always inject I/O of queues blocked by wakers
  block, bfq: put reqs of waker and woken in dispatch list
  block, bfq: make shared queues inherit wakers
  block, bfq: fix weight-raising resume with !low_latency
  block, bfq: keep shared queues out of the waker mechanism
  block, bfq: merge bursts of newly-created queues

 block/bfq-cgroup.c  |   2 +
 block/bfq-iosched.c | 399 +---
 block/bfq-iosched.h |  15 ++
 block/bfq-wf2q.c|   8 +
 4 files changed, 402 insertions(+), 22 deletions(-)

--
2.20.1


[PATCH BUGFIX/IMPROVEMENT V2 1/6] block, bfq: always inject I/O of queues blocked by wakers

2021-03-04 Thread Paolo Valente
Suppose that I/O dispatch is plugged, to wait for new I/O for the
in-service bfq-queue, say bfqq.  Suppose then that there is a further
bfq_queue woken by bfqq, and that this woken queue has pending I/O. A
woken queue does not steal bandwidth from bfqq, because it remains
soon without I/O if bfqq is not served. So there is virtually no risk
of loss of bandwidth for bfqq if this woken queue has I/O dispatched
while bfqq is waiting for new I/O. In contrast, this extra I/O
injection boosts throughput. This commit performs this extra
injection.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 32 +++-
 block/bfq-wf2q.c|  8 
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 445cef9c0bb9..a83149407336 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4487,9 +4487,15 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
bfq_bfqq_busy(bfqq->bic->bfqq[0]) &&
bfqq->bic->bfqq[0]->next_rq ?
bfqq->bic->bfqq[0] : NULL;
+   struct bfq_queue *blocked_bfqq =
+   !hlist_empty(>woken_list) ?
+   container_of(bfqq->woken_list.first,
+struct bfq_queue,
+woken_list_node)
+   : NULL;
 
/*
-* The next three mutually-exclusive ifs decide
+* The next four mutually-exclusive ifs decide
 * whether to try injection, and choose the queue to
 * pick an I/O request from.
 *
@@ -4522,7 +4528,15 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
 * next bfqq's I/O is brought forward dramatically,
 * for it is not blocked for milliseconds.
 *
-* The third if checks whether bfqq is a queue for
+* The third if checks whether there is a queue woken
+* by bfqq, and currently with pending I/O. Such a
+* woken queue does not steal bandwidth from bfqq,
+* because it remains soon without I/O if bfqq is not
+* served. So there is virtually no risk of loss of
+* bandwidth for bfqq if this woken queue has I/O
+* dispatched while bfqq is waiting for new I/O.
+*
+* The fourth if checks whether bfqq is a queue for
 * which it is better to avoid injection. It is so if
 * bfqq delivers more throughput when served without
 * any further I/O from other queues in the middle, or
@@ -4542,11 +4556,11 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
 * bfq_update_has_short_ttime(), it is rather likely
 * that, if I/O is being plugged for bfqq and the
 * waker queue has pending I/O requests that are
-* blocking bfqq's I/O, then the third alternative
+* blocking bfqq's I/O, then the fourth alternative
 * above lets the waker queue get served before the
 * I/O-plugging timeout fires. So one may deem the
 * second alternative superfluous. It is not, because
-* the third alternative may be way less effective in
+* the fourth alternative may be way less effective in
 * case of a synchronization. For two main
 * reasons. First, throughput may be low because the
 * inject limit may be too low to guarantee the same
@@ -4555,7 +4569,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data 
*bfqd)
 * guarantees (the second alternative unconditionally
 * injects a pending I/O request of the waker queue
 * for each bfq_dispatch_request()). Second, with the
-* third alternative, the duration of the plugging,
+* fourth alternative, the duration of the plugging,
 * i.e., the time before bfqq finally receives new I/O,
 * may not be minimized, because the waker queue may
 * happen to be served only after other queues.
@@ -4573,6 +4587,14 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
   bfq_bfqq_budget_left(bfqq->waker_bfqq)
)
bfqq = bfqq->waker_bfqq;
+   else if (blocked_bfqq &&
+  bfq_bfqq_busy(blocked_bfqq) &&
+  blocked_bfqq->next_rq &&
+  bfq_serv_to_charge(blocked_bfqq->next_rq,
+ blocke

Re: [PATCH BUGFIX/IMPROVEMENT 6/6] block, bfq: merge bursts of newly-created queues

2021-02-25 Thread Paolo Valente



> Il giorno 26 gen 2021, alle ore 17:15, Jens Axboe  ha 
> scritto:
> 
> On 1/26/21 3:51 AM, Paolo Valente wrote:
>> @@ -2809,6 +2853,12 @@ void bfq_release_process_ref(struct bfq_data *bfqd, 
>> struct bfq_queue *bfqq)
>>  bfqq != bfqd->in_service_queue)
>>  bfq_del_bfqq_busy(bfqd, bfqq, false);
>> 
>> +if (bfqq->entity.parent &&
>> +bfqq->entity.parent->last_bfqq_created == bfqq)
>> +bfqq->entity.parent->last_bfqq_created = NULL;
>> +else if (bfqq->bfqd && bfqq->bfqd->last_bfqq_created == bfqq)
>> +bfqq->bfqd->last_bfqq_created = NULL;
>> +
>>  bfq_put_queue(bfqq);
>> }
>> 
>> @@ -2905,6 +2955,13 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct 
>> bfq_io_cq *bic,
>>   */
>>  new_bfqq->pid = -1;
>>  bfqq->bic = NULL;
>> +
>> +if (bfqq->entity.parent &&
>> +bfqq->entity.parent->last_bfqq_created == bfqq)
>> +bfqq->entity.parent->last_bfqq_created = new_bfqq;
>> +else if (bfqq->bfqd && bfqq->bfqd->last_bfqq_created == bfqq)
>> +bfqq->bfqd->last_bfqq_created = new_bfqq;
>> +
>>  bfq_release_process_ref(bfqd, bfqq);
>> }
> 
> Almost identical code constructs makes it seem like this should have a
> helper instead.
> 

Right, sorry. Improved in V2.

>> @@ -5033,6 +5090,12 @@ void bfq_put_queue(struct bfq_queue *bfqq)
>>  bfqg_and_blkg_put(bfqg);
>> }
>> 
>> +static void bfq_put_stable_ref(struct bfq_queue *bfqq)
>> +{
>> +bfqq->stable_ref--;
>> +bfq_put_queue(bfqq);
>> +}
>> +
>> static void bfq_put_cooperator(struct bfq_queue *bfqq)
>> {
>>  struct bfq_queue *__bfqq, *next;
>> @@ -5089,6 +5152,17 @@ static void bfq_exit_icq(struct io_cq *icq)
>> {
>>  struct bfq_io_cq *bic = icq_to_bic(icq);
>> 
>> +if (bic->stable_merge_bfqq) {
>> +unsigned long flags;
>> +struct bfq_data *bfqd = bic->stable_merge_bfqq->bfqd;
>> +
>> +if (bfqd)
>> +spin_lock_irqsave(>lock, flags);
>> +bfq_put_stable_ref(bic->stable_merge_bfqq);
>> +if (bfqd)
>> +spin_unlock_irqrestore(>lock, flags);
>> +}
>> +
> 
> Construct like this are really painful. Just do:
> 
> if (bfqd) {
>   unsigned long flags;
> 
>   spin_lock_irqsave(>lock, flags);
>   bfq_put_stable_ref(bic->stable_merge_bfqq);
>   spin_unlock_irqrestore(>lock, flags);
> } else {
>   bfq_put_stable_ref(bic->stable_merge_bfqq);
> }
> 
> which is also less likely to cause code analyzer false warnings.

Done, thanks.

> Outside
> of that, it needs a comment on why it's ok NOT to grab the lock when
> bfqd is zero, because that seems counter-intuitive and more a case of
> "well we can't grab a lock for something we don't have". Maybe it's
> because bfqd is no longer visible at this point, and it's ok,

yes

> but it's
> definitely not clear just looking at this patch.

Right, the reason is already reported a few lines above, but not
repeated in this function.  I'll repeat it.


> Even with that, is the
> bfqq visible? Should the ref be atomic, and locking happen further down
> instead?
> 

Since the scheduler is gone, no pending I/O is expected to still
reference bfqq.  I'll write this too in V2.

As I stated in my reply to another comments of yours, I'll submit the
V2 soon, unless I receive a reply before.

Thanks.
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH BUGFIX/IMPROVEMENT 1/6] block, bfq: always inject I/O of queues blocked by wakers

2021-02-25 Thread Paolo Valente



> Il giorno 26 gen 2021, alle ore 17:17, Jens Axboe  ha 
> scritto:
> 
> On 1/26/21 3:50 AM, Paolo Valente wrote:
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 445cef9c0bb9..a83149407336 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -4487,9 +4487,15 @@ static struct bfq_queue *bfq_select_queue(struct 
>> bfq_data *bfqd)
>>  bfq_bfqq_busy(bfqq->bic->bfqq[0]) &&
>>  bfqq->bic->bfqq[0]->next_rq ?
>>  bfqq->bic->bfqq[0] : NULL;
>> +struct bfq_queue *blocked_bfqq =
>> +!hlist_empty(>woken_list) ?
>> +container_of(bfqq->woken_list.first,
>> + struct bfq_queue,
>> + woken_list_node)
>> +: NULL;
> 
> hlist_first_entry_or_null?
> 

I didn't find any such function.  There is a list_first_entry_or_null,
but it's for circular doubly linked lists.

I'll wait a little bit for your reply, then send a V2 with this patch
unchanged.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH 1/2] bfq: remove some useless logic of bfq_update_next_in_service()

2021-02-10 Thread Paolo Valente



> Il giorno 10 feb 2021, alle ore 16:21, Jens Axboe  ha 
> scritto:
> 
> On 2/10/21 8:20 AM, Oleksandr Natalenko wrote:
>> On Wed, Feb 10, 2021 at 12:13:29PM +0100, Paolo Valente wrote:
>>> 
>>> 
>>>> Il giorno 29 gen 2021, alle ore 11:51, Chunguang Xu  
>>>> ha scritto:
>>>> 
>>>> From: Chunguang Xu 
>>>> 
>>>> The if statement at the end of the function is obviously useless,
>>>> maybe we can delete it.
>>>> 
>>> 
>>> Thanks for spotting this mistake.
>>> 
>>> Acked-by: Paolo Valente 
>>> 
>>>> Signed-off-by: Chunguang Xu 
>>>> ---
>>>> block/bfq-wf2q.c | 3 ---
>>>> 1 file changed, 3 deletions(-)
>>>> 
>>>> diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
>>>> index 26776bd..070e34a 100644
>>>> --- a/block/bfq-wf2q.c
>>>> +++ b/block/bfq-wf2q.c
>>>> @@ -137,9 +137,6 @@ static bool bfq_update_next_in_service(struct 
>>>> bfq_sched_data *sd,
>>>> 
>>>>sd->next_in_service = next_in_service;
>>>> 
>>>> -  if (!next_in_service)
>>>> -  return parent_sched_may_change;
>>>> -
>> 
>> Unless I'm missing something, this has already been fixed here:
>> 
>> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.12/block=1a23e06cdab2be07cbda460c6417d7de564c48e6
> 
> Yep indeed.
> 

I seemed to remember this patch as well. But my memory is rather weak.

> -- 
> Jens Axboe



Re: [PATCH 2/2] bfq: amend the function name of bfq_may_expire_for_budg_timeout()

2021-02-10 Thread Paolo Valente



> Il giorno 29 gen 2021, alle ore 11:51, Chunguang Xu  ha 
> scritto:
> 
> From: Chunguang Xu 
> 
> The function name bfq_may_expire_for_budg_timeout() may be misspelled,
> try to fix it.
> 

Ok for me to make this name longer.

Thanks,
Paolo

> Signed-off-by: Chunguang Xu 
> ---
> block/bfq-iosched.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 9e4eb0f..4f40c61 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4061,7 +4061,7 @@ static bool bfq_bfqq_budget_timeout(struct bfq_queue 
> *bfqq)
>  * condition does not hold, or if the queue is slow enough to deserve
>  * only to be kicked off for preserving a high throughput.
>  */
> -static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
> +static bool bfq_may_expire_for_budget_timeout(struct bfq_queue *bfqq)
> {
>   bfq_log_bfqq(bfqq->bfqd, bfqq,
>   "may_budget_timeout: wait_request %d left %d timeout %d",
> @@ -4350,7 +4350,7 @@ static struct bfq_queue *bfq_select_queue(struct 
> bfq_data *bfqd)
>* on the case where bfq_bfqq_must_idle() returns true, in
>* bfq_completed_request().
>*/
> - if (bfq_may_expire_for_budg_timeout(bfqq) &&
> + if (bfq_may_expire_for_budget_timeout(bfqq) &&
>   !bfq_bfqq_must_idle(bfqq))
>   goto expire;
> 
> @@ -5706,7 +5706,7 @@ static void bfq_completed_request(struct bfq_queue 
> *bfqq, struct bfq_data *bfqd)
>* of its reserved service guarantees.
>*/
>   return;
> - } else if (bfq_may_expire_for_budg_timeout(bfqq))
> + } else if (bfq_may_expire_for_budget_timeout(bfqq))
>   bfq_bfqq_expire(bfqd, bfqq, false,
>   BFQQE_BUDGET_TIMEOUT);
>   else if (RB_EMPTY_ROOT(>sort_list) &&
> -- 
> 1.8.3.1
> 



Re: [PATCH 1/2] bfq: remove some useless logic of bfq_update_next_in_service()

2021-02-10 Thread Paolo Valente



> Il giorno 29 gen 2021, alle ore 11:51, Chunguang Xu  ha 
> scritto:
> 
> From: Chunguang Xu 
> 
> The if statement at the end of the function is obviously useless,
> maybe we can delete it.
> 

Thanks for spotting this mistake.

Acked-by: Paolo Valente 

> Signed-off-by: Chunguang Xu 
> ---
> block/bfq-wf2q.c | 3 ---
> 1 file changed, 3 deletions(-)
> 
> diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
> index 26776bd..070e34a 100644
> --- a/block/bfq-wf2q.c
> +++ b/block/bfq-wf2q.c
> @@ -137,9 +137,6 @@ static bool bfq_update_next_in_service(struct 
> bfq_sched_data *sd,
> 
>   sd->next_in_service = next_in_service;
> 
> - if (!next_in_service)
> - return parent_sched_may_change;
> -
>   return parent_sched_may_change;
> }
> 
> -- 
> 1.8.3.1
> 



Re: [PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: put reqs of waker and woken in dispatch list

2021-02-09 Thread Paolo Valente



> Il giorno 5 feb 2021, alle ore 11:16, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 3 feb 2021, alle ore 12:43, Jan Kara  ha scritto:
>> 
>> On Thu 28-01-21 18:54:05, Paolo Valente wrote:
>>> 
>>> 
>>>> Il giorno 26 gen 2021, alle ore 17:18, Jens Axboe  ha 
>>>> scritto:
>>>> 
>>>> On 1/26/21 3:50 AM, Paolo Valente wrote:
>>>>> Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
>>>>> this happens, the only active bfq_queues are bfqq and either its waker
>>>>> bfq_queue or one of its woken bfq_queues, then there is no point in
>>>>> queueing this new I/O request in bfqq for service. In fact, the
>>>>> in-service queue and bfqq agree on serving this new I/O request as
>>>>> soon as possible. So this commit puts this new I/O request directly
>>>>> into the dispatch list.
>>>>> 
>>>>> Tested-by: Jan Kara 
>>>>> Signed-off-by: Paolo Valente 
>>>>> ---
>>>>> block/bfq-iosched.c | 17 -
>>>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>>> 
>>>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>>>> index a83149407336..e5b83910fbe0 100644
>>>>> --- a/block/bfq-iosched.c
>>>>> +++ b/block/bfq-iosched.c
>>>>> @@ -5640,7 +5640,22 @@ static void bfq_insert_request(struct 
>>>>> blk_mq_hw_ctx *hctx, struct request *rq,
>>>>> 
>>>>>   spin_lock_irq(>lock);
>>>>>   bfqq = bfq_init_rq(rq);
>>>>> - if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
>>>>> +
>>>>> + /*
>>>>> +  * Additional case for putting rq directly into the dispatch
>>>>> +  * queue: the only active bfq_queues are bfqq and either its
>>>>> +  * waker bfq_queue or one of its woken bfq_queues. In this
>>>>> +  * case, there is no point in queueing rq in bfqq for
>>>>> +  * service. In fact, the in-service queue and bfqq agree on
>>>>> +  * serving this new I/O request as soon as possible.
>>>>> +  */
>>>>> + if (!bfqq ||
>>>>> + (bfqq != bfqd->in_service_queue &&
>>>>> +  bfqd->in_service_queue != NULL &&
>>>>> +  bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
>>>>> +  (bfqq->waker_bfqq == bfqd->in_service_queue ||
>>>>> +   bfqd->in_service_queue->waker_bfqq == bfqq)) ||
>>>>> + at_head || blk_rq_is_passthrough(rq)) {
>>>>>   if (at_head)
>>>>>   list_add(>queuelist, >dispatch);
>>>>>   else
>>>>> 
>>>> 
>>>> This is unreadable... Just seems like you are piling heuristics in to
>>>> catch some case, and it's neither readable nor clean.
>>>> 
>>> 
>>> Yeah, these comments inappropriately assume that the reader knows the
>>> waker mechanism in depth.  And they do not stress at all how important
>>> this improvement is.
>>> 
>>> I'll do my best to improve these comments.
>>> 
>>> To try to do a better job, let me also explain the matter early here.
>>> Maybe you or others can give me some early feedback (or just tell me
>>> to proceed).
>>> 
>>> This change is one of the main improvements that boosted
>>> throughput in Jan's tests.  Here is the rationale:
>>> - consider a bfq_queue, say Q1, detected as a waker of another
>>> bfq_queue, say Q2
>>> - by definition of a waker, Q1 blocks the I/O of Q2, i.e., some I/O of
>>> of Q1 needs to be completed for new I/O of Q1 to arrive.  A notable
>> ^^ Q2?
>> 
> 
> Yes, thank you!
> 
> (after this interaction, I'll fix and improve all this description,
> according to your comments)
> 
>>> example is journald
>>> - so, Q1 and Q2 are in any respect two cooperating processes: if the
>>> service of Q1's I/O is delayed, Q2 can only suffer from it.
>>> Conversely, if Q2's I/O is delayed, the purpose of Q1 is just defeated.
>> 
>> What do you exactly mean by this last sentence?
> 
> By definition of waker, the purpose of Q1's I/O is doing what needs to
> be done, so that new Q2's I/O can finally be issued.  Delaying Q2's I/O
> is the opposite of this goal.
> 
>>

Re: [PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: put reqs of waker and woken in dispatch list

2021-02-05 Thread Paolo Valente



> Il giorno 3 feb 2021, alle ore 12:43, Jan Kara  ha scritto:
> 
> On Thu 28-01-21 18:54:05, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 26 gen 2021, alle ore 17:18, Jens Axboe  ha 
>>> scritto:
>>> 
>>> On 1/26/21 3:50 AM, Paolo Valente wrote:
>>>> Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
>>>> this happens, the only active bfq_queues are bfqq and either its waker
>>>> bfq_queue or one of its woken bfq_queues, then there is no point in
>>>> queueing this new I/O request in bfqq for service. In fact, the
>>>> in-service queue and bfqq agree on serving this new I/O request as
>>>> soon as possible. So this commit puts this new I/O request directly
>>>> into the dispatch list.
>>>> 
>>>> Tested-by: Jan Kara 
>>>> Signed-off-by: Paolo Valente 
>>>> ---
>>>> block/bfq-iosched.c | 17 -
>>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>>> index a83149407336..e5b83910fbe0 100644
>>>> --- a/block/bfq-iosched.c
>>>> +++ b/block/bfq-iosched.c
>>>> @@ -5640,7 +5640,22 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
>>>> *hctx, struct request *rq,
>>>> 
>>>>spin_lock_irq(>lock);
>>>>bfqq = bfq_init_rq(rq);
>>>> -  if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
>>>> +
>>>> +  /*
>>>> +   * Additional case for putting rq directly into the dispatch
>>>> +   * queue: the only active bfq_queues are bfqq and either its
>>>> +   * waker bfq_queue or one of its woken bfq_queues. In this
>>>> +   * case, there is no point in queueing rq in bfqq for
>>>> +   * service. In fact, the in-service queue and bfqq agree on
>>>> +   * serving this new I/O request as soon as possible.
>>>> +   */
>>>> +  if (!bfqq ||
>>>> +  (bfqq != bfqd->in_service_queue &&
>>>> +   bfqd->in_service_queue != NULL &&
>>>> +   bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
>>>> +   (bfqq->waker_bfqq == bfqd->in_service_queue ||
>>>> +bfqd->in_service_queue->waker_bfqq == bfqq)) ||
>>>> +  at_head || blk_rq_is_passthrough(rq)) {
>>>>if (at_head)
>>>>list_add(>queuelist, >dispatch);
>>>>else
>>>> 
>>> 
>>> This is unreadable... Just seems like you are piling heuristics in to
>>> catch some case, and it's neither readable nor clean.
>>> 
>> 
>> Yeah, these comments inappropriately assume that the reader knows the
>> waker mechanism in depth.  And they do not stress at all how important
>> this improvement is.
>> 
>> I'll do my best to improve these comments.
>> 
>> To try to do a better job, let me also explain the matter early here.
>> Maybe you or others can give me some early feedback (or just tell me
>> to proceed).
>> 
>> This change is one of the main improvements that boosted
>> throughput in Jan's tests.  Here is the rationale:
>> - consider a bfq_queue, say Q1, detected as a waker of another
>>  bfq_queue, say Q2
>> - by definition of a waker, Q1 blocks the I/O of Q2, i.e., some I/O of
>>  of Q1 needs to be completed for new I/O of Q1 to arrive.  A notable
>  ^^ Q2?
> 

Yes, thank you!

(after this interaction, I'll fix and improve all this description,
according to your comments)

>>  example is journald
>> - so, Q1 and Q2 are in any respect two cooperating processes: if the
>>  service of Q1's I/O is delayed, Q2 can only suffer from it.
>>  Conversely, if Q2's I/O is delayed, the purpose of Q1 is just defeated.
> 
> What do you exactly mean by this last sentence?

By definition of waker, the purpose of Q1's I/O is doing what needs to
be done, so that new Q2's I/O can finally be issued.  Delaying Q2's I/O
is the opposite of this goal.

> 
>> - as a consequence if some I/O of Q1/Q2 arrives while Q2/Q1 is the
>>  only queue in service, there is absolutely no point in delaying the
>>  service of such an I/O.  The only possible result is a throughput
>>  loss, detected by Jan's test
> 
> If we are idling at that moment waiting for more IO from in service queue,
> I agree.

And I agree too, if the drive has no internal queueing, has no
parallelism or pipeli

Re: [PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: put reqs of waker and woken in dispatch list

2021-02-03 Thread Paolo Valente



> Il giorno 28 gen 2021, alle ore 18:54, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 26 gen 2021, alle ore 17:18, Jens Axboe  ha 
>> scritto:
>> 
>> On 1/26/21 3:50 AM, Paolo Valente wrote:
>>> Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
>>> this happens, the only active bfq_queues are bfqq and either its waker
>>> bfq_queue or one of its woken bfq_queues, then there is no point in
>>> queueing this new I/O request in bfqq for service. In fact, the
>>> in-service queue and bfqq agree on serving this new I/O request as
>>> soon as possible. So this commit puts this new I/O request directly
>>> into the dispatch list.
>>> 
>>> Tested-by: Jan Kara 
>>> Signed-off-by: Paolo Valente 
>>> ---
>>> block/bfq-iosched.c | 17 -
>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>> index a83149407336..e5b83910fbe0 100644
>>> --- a/block/bfq-iosched.c
>>> +++ b/block/bfq-iosched.c
>>> @@ -5640,7 +5640,22 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
>>> *hctx, struct request *rq,
>>> 
>>> spin_lock_irq(>lock);
>>> bfqq = bfq_init_rq(rq);
>>> -   if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
>>> +
>>> +   /*
>>> +* Additional case for putting rq directly into the dispatch
>>> +* queue: the only active bfq_queues are bfqq and either its
>>> +* waker bfq_queue or one of its woken bfq_queues. In this
>>> +* case, there is no point in queueing rq in bfqq for
>>> +* service. In fact, the in-service queue and bfqq agree on
>>> +* serving this new I/O request as soon as possible.
>>> +*/
>>> +   if (!bfqq ||
>>> +   (bfqq != bfqd->in_service_queue &&
>>> +bfqd->in_service_queue != NULL &&
>>> +bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
>>> +(bfqq->waker_bfqq == bfqd->in_service_queue ||
>>> + bfqd->in_service_queue->waker_bfqq == bfqq)) ||
>>> +   at_head || blk_rq_is_passthrough(rq)) {
>>> if (at_head)
>>> list_add(>queuelist, >dispatch);
>>> else
>>> 
>> 
>> This is unreadable... Just seems like you are piling heuristics in to
>> catch some case, and it's neither readable nor clean.
>> 
> 
> Yeah, these comments inappropriately assume that the reader knows the
> waker mechanism in depth.  And they do not stress at all how important
> this improvement is.
> 
> I'll do my best to improve these comments.
> 
> To try to do a better job, let me also explain the matter early here.
> Maybe you or others can give me some early feedback (or just tell me
> to proceed).
> 
> This change is one of the main improvements that boosted
> throughput in Jan's tests.  Here is the rationale:
> - consider a bfq_queue, say Q1, detected as a waker of another
>  bfq_queue, say Q2
> - by definition of a waker, Q1 blocks the I/O of Q2, i.e., some I/O of
>  of Q1 needs to be completed for new I/O of Q1 to arrive.  A notable
>  example is journald
> - so, Q1 and Q2 are in any respect two cooperating processes: if the
>  service of Q1's I/O is delayed, Q2 can only suffer from it.
>  Conversely, if Q2's I/O is delayed, the purpose of Q1 is just defeated.
> - as a consequence if some I/O of Q1/Q2 arrives while Q2/Q1 is the
>  only queue in service, there is absolutely no point in delaying the
>  service of such an I/O.  The only possible result is a throughput
>  loss, detected by Jan's test
> - so, when the above condition holds, the most effective and efficient
>  action is to put the new I/O directly in the dispatch list
> - as an additional restriction, Q1 and Q2 must be the only busy queues
>  for this commit to put the I/O of Q2/Q1 in the dispatch list.  This is
>  necessary, because, if also other queues are waiting for service, then
>  putting new I/O directly in the dispatch list may evidently cause a
>  violation of service guarantees for the other queues
> 
> If these comments make things clearer, then I'll put them in the
> commit message and the code, and I'll proceed with a V2.
> 

Hi Jens,
may I proceed with a V2?

Thanks,
Paolo

> Thanks,
> Paolo
> 
> 
>> -- 
>> Jens Axboe



Re: [PATCH] Revert "bfq: Fix computation of shallow depth"

2021-02-01 Thread Paolo Valente



> Il giorno 1 feb 2021, alle ore 08:32, Lin Feng  ha scritto:
> 
> Hi, it seems that this patch was blocked by linux mailist servers, so ping 
> again.
> 
> Based on 
> https://patchwork.kernel.org/project/linux-block/patch/20201210094433.25491-1-j...@suse.cz/,
> it looks like we have made a consensus about bfqd->word_depths[2][2]'s 
> changing, so now the
> computation codes for bfq's word_depths array are not necessary and one 
> variable is enough.
> 
> But IMHO async depth limitation for slow drivers is essential, which is what 
> we always did in cfq age.
> 

It is essential.

Thanks,
Paolo

> On 1/29/21 19:18, Lin Feng wrote:
>> This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.
>> bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
>> sbitmap_get_shallow, which uses just the number to limit the scan depth of
>> each bitmap word, formula:
>> scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%
>> That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
>> But after commit patch 'bfq: Fix computation of shallow depth', we use
>> sbitmap.depth instead, as a example in following case:
>> sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
>> The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
>> three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
>> nothing. Do we really don't want limit depth for such workloads, or we
>> just want to bump up the percentiles to 100%?
>> Please correct me if I miss something, thanks.
>> Signed-off-by: Lin Feng 
>> ---
>>  block/bfq-iosched.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 9e4eb0fc1c16..9e81d1052091 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -6332,13 +6332,13 @@ static unsigned int bfq_update_depths(struct 
>> bfq_data *bfqd,
>>   * limit 'something'.
>>   */
>>  /* no more than 50% of tags for async I/O */
>> -bfqd->word_depths[0][0] = max(bt->sb.depth >> 1, 1U);
>> +bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U);
>>  /*
>>   * no more than 75% of tags for sync writes (25% extra tags
>>   * w.r.t. async I/O, to prevent async I/O from starving sync
>>   * writes)
>>   */
>> -bfqd->word_depths[0][1] = max((bt->sb.depth * 3) >> 2, 1U);
>> +bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U);
>>  /*
>>   * In-word depths in case some bfq_queue is being weight-
>> @@ -6348,9 +6348,9 @@ static unsigned int bfq_update_depths(struct bfq_data 
>> *bfqd,
>>   * shortage.
>>   */
>>  /* no more than ~18% of tags for async I/O */
>> -bfqd->word_depths[1][0] = max((bt->sb.depth * 3) >> 4, 1U);
>> +bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
>>  /* no more than ~37% of tags for sync writes (~20% extra tags) */
>> -bfqd->word_depths[1][1] = max((bt->sb.depth * 6) >> 4, 1U);
>> +bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
>>  for (i = 0; i < 2; i++)
>>  for (j = 0; j < 2; j++)
> 



Re: [PATCH] Revert "bfq: Fix computation of shallow depth"

2021-02-01 Thread Paolo Valente



> Il giorno 29 gen 2021, alle ore 12:18, Lin Feng  ha scritto:
> 
> This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.
> 
> bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
> sbitmap_get_shallow, which uses just the number to limit the scan depth of
> each bitmap word, formula:
> scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%
> 
> That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
> But after commit patch 'bfq: Fix computation of shallow depth', we use
> sbitmap.depth instead, as a example in following case:
> 
> sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
> The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
> three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
> nothing. Do we really don't want limit depth for such workloads, or we
> just want to bump up the percentiles to 100%?
> 

Bumping to 100% would be a mistake.

Thanks,
Paolo

> Please correct me if I miss something, thanks.
> 
> Signed-off-by: Lin Feng 
> ---
> block/bfq-iosched.c | 8 
> 1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 9e4eb0fc1c16..9e81d1052091 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -6332,13 +6332,13 @@ static unsigned int bfq_update_depths(struct bfq_data 
> *bfqd,
>* limit 'something'.
>*/
>   /* no more than 50% of tags for async I/O */
> - bfqd->word_depths[0][0] = max(bt->sb.depth >> 1, 1U);
> + bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U);
>   /*
>* no more than 75% of tags for sync writes (25% extra tags
>* w.r.t. async I/O, to prevent async I/O from starving sync
>* writes)
>*/
> - bfqd->word_depths[0][1] = max((bt->sb.depth * 3) >> 2, 1U);
> + bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U);
> 
>   /*
>* In-word depths in case some bfq_queue is being weight-
> @@ -6348,9 +6348,9 @@ static unsigned int bfq_update_depths(struct bfq_data 
> *bfqd,
>* shortage.
>*/
>   /* no more than ~18% of tags for async I/O */
> - bfqd->word_depths[1][0] = max((bt->sb.depth * 3) >> 4, 1U);
> + bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
>   /* no more than ~37% of tags for sync writes (~20% extra tags) */
> - bfqd->word_depths[1][1] = max((bt->sb.depth * 6) >> 4, 1U);
> + bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
> 
>   for (i = 0; i < 2; i++)
>   for (j = 0; j < 2; j++)
> -- 
> 2.25.4
> 



Re: [PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: put reqs of waker and woken in dispatch list

2021-01-28 Thread Paolo Valente



> Il giorno 26 gen 2021, alle ore 17:18, Jens Axboe  ha 
> scritto:
> 
> On 1/26/21 3:50 AM, Paolo Valente wrote:
>> Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
>> this happens, the only active bfq_queues are bfqq and either its waker
>> bfq_queue or one of its woken bfq_queues, then there is no point in
>> queueing this new I/O request in bfqq for service. In fact, the
>> in-service queue and bfqq agree on serving this new I/O request as
>> soon as possible. So this commit puts this new I/O request directly
>> into the dispatch list.
>> 
>> Tested-by: Jan Kara 
>> Signed-off-by: Paolo Valente 
>> ---
>> block/bfq-iosched.c | 17 -
>> 1 file changed, 16 insertions(+), 1 deletion(-)
>> 
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index a83149407336..e5b83910fbe0 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -5640,7 +5640,22 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
>> *hctx, struct request *rq,
>> 
>>  spin_lock_irq(>lock);
>>  bfqq = bfq_init_rq(rq);
>> -if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
>> +
>> +/*
>> + * Additional case for putting rq directly into the dispatch
>> + * queue: the only active bfq_queues are bfqq and either its
>> + * waker bfq_queue or one of its woken bfq_queues. In this
>> + * case, there is no point in queueing rq in bfqq for
>> + * service. In fact, the in-service queue and bfqq agree on
>> + * serving this new I/O request as soon as possible.
>> + */
>> +if (!bfqq ||
>> +(bfqq != bfqd->in_service_queue &&
>> + bfqd->in_service_queue != NULL &&
>> + bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
>> + (bfqq->waker_bfqq == bfqd->in_service_queue ||
>> +  bfqd->in_service_queue->waker_bfqq == bfqq)) ||
>> +at_head || blk_rq_is_passthrough(rq)) {
>>  if (at_head)
>>  list_add(>queuelist, >dispatch);
>>  else
>> 
> 
> This is unreadable... Just seems like you are piling heuristics in to
> catch some case, and it's neither readable nor clean.
> 

Yeah, these comments inappropriately assume that the reader knows the
waker mechanism in depth.  And they do not stress at all how important
this improvement is.

I'll do my best to improve these comments.

To try to do a better job, let me also explain the matter early here.
Maybe you or others can give me some early feedback (or just tell me
to proceed).

This change is one of the main improvements that boosted
throughput in Jan's tests.  Here is the rationale:
- consider a bfq_queue, say Q1, detected as a waker of another
  bfq_queue, say Q2
- by definition of a waker, Q1 blocks the I/O of Q2, i.e., some I/O of
  of Q1 needs to be completed for new I/O of Q1 to arrive.  A notable
  example is journald
- so, Q1 and Q2 are in any respect two cooperating processes: if the
  service of Q1's I/O is delayed, Q2 can only suffer from it.
  Conversely, if Q2's I/O is delayed, the purpose of Q1 is just defeated.
- as a consequence if some I/O of Q1/Q2 arrives while Q2/Q1 is the
  only queue in service, there is absolutely no point in delaying the
  service of such an I/O.  The only possible result is a throughput
  loss, detected by Jan's test
- so, when the above condition holds, the most effective and efficient
  action is to put the new I/O directly in the dispatch list
- as an additional restriction, Q1 and Q2 must be the only busy queues
  for this commit to put the I/O of Q2/Q1 in the dispatch list.  This is
  necessary, because, if also other queues are waiting for service, then
  putting new I/O directly in the dispatch list may evidently cause a
  violation of service guarantees for the other queues

If these comments make things clearer, then I'll put them in the
commit message and the code, and I'll proceed with a V2.

Thanks,
Paolo


> -- 
> Jens Axboe
> 



[PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: put reqs of waker and woken in dispatch list

2021-01-26 Thread Paolo Valente
Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
this happens, the only active bfq_queues are bfqq and either its waker
bfq_queue or one of its woken bfq_queues, then there is no point in
queueing this new I/O request in bfqq for service. In fact, the
in-service queue and bfqq agree on serving this new I/O request as
soon as possible. So this commit puts this new I/O request directly
into the dispatch list.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a83149407336..e5b83910fbe0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5640,7 +5640,22 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
*hctx, struct request *rq,
 
spin_lock_irq(>lock);
bfqq = bfq_init_rq(rq);
-   if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
+
+   /*
+* Additional case for putting rq directly into the dispatch
+* queue: the only active bfq_queues are bfqq and either its
+* waker bfq_queue or one of its woken bfq_queues. In this
+* case, there is no point in queueing rq in bfqq for
+* service. In fact, the in-service queue and bfqq agree on
+* serving this new I/O request as soon as possible.
+*/
+   if (!bfqq ||
+   (bfqq != bfqd->in_service_queue &&
+bfqd->in_service_queue != NULL &&
+bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
+(bfqq->waker_bfqq == bfqd->in_service_queue ||
+ bfqd->in_service_queue->waker_bfqq == bfqq)) ||
+   at_head || blk_rq_is_passthrough(rq)) {
if (at_head)
list_add(>queuelist, >dispatch);
else
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 0/6] block, bfq: third and last batch of fixes and improvements

2021-01-26 Thread Paolo Valente
Hi,
here's batch 3/3.

Thanks,
Paolo

Paolo Valente (6):
  block, bfq: always inject I/O of queues blocked by wakers
  block, bfq: put reqs of waker and woken in dispatch list
  block, bfq: make shared queues inherit wakers
  block, bfq: fix weight-raising resume with !low_latency
  block, bfq: keep shared queues out of the waker mechanism
  block, bfq: merge bursts of newly-created queues

 block/bfq-cgroup.c  |   2 +
 block/bfq-iosched.c | 362 +---
 block/bfq-iosched.h |  15 ++
 block/bfq-wf2q.c|   8 +
 4 files changed, 365 insertions(+), 22 deletions(-)

--
2.20.1


[PATCH BUGFIX/IMPROVEMENT 3/6] block, bfq: make shared queues inherit wakers

2021-01-26 Thread Paolo Valente
Consider a bfq_queue bfqq that is about to be merged with another
bfq_queue new_bfqq. The processes associated with bfqq are cooperators
of the processes associated with new_bfqq. So, if bfqq has a waker,
then it is reasonable (and beneficial for throughput) to assume that
all these processes will be happy to let bfqq's waker freely inject
I/O when they have no I/O. So this commit makes new_bfqq inherit
bfqq's waker.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index e5b83910fbe0..c5bda33c0923 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2819,6 +2819,29 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq 
*bic,
bfq_mark_bfqq_IO_bound(new_bfqq);
bfq_clear_bfqq_IO_bound(bfqq);
 
+   /*
+* The processes associated with bfqq are cooperators of the
+* processes associated with new_bfqq. So, if bfqq has a
+* waker, then assume that all these processes will be happy
+* to let bfqq's waker freely inject I/O when they have no
+* I/O.
+*/
+   if (bfqq->waker_bfqq && !new_bfqq->waker_bfqq &&
+   bfqq->waker_bfqq != new_bfqq) {
+   new_bfqq->waker_bfqq = bfqq->waker_bfqq;
+   new_bfqq->tentative_waker_bfqq = NULL;
+
+   /*
+* If the waker queue disappears, then
+* new_bfqq->waker_bfqq must be reset. So insert
+* new_bfqq into the woken_list of the waker. See
+* bfq_check_waker for details.
+*/
+   hlist_add_head(_bfqq->woken_list_node,
+  _bfqq->waker_bfqq->woken_list);
+
+   }
+
/*
 * If bfqq is weight-raised, then let new_bfqq inherit
 * weight-raising. To reduce false positives, neglect the case
@@ -6276,7 +6299,7 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
if (likely(!new_queue)) {
/* If the queue was seeky for too long, break it apart. */
if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
-   bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+   struct bfq_queue *old_bfqq = bfqq;
 
/* Update bic before losing reference to bfqq */
if (bfq_bfqq_in_large_burst(bfqq))
@@ -6285,11 +6308,24 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
bfqq = bfq_split_bfqq(bic, bfqq);
split = true;
 
-   if (!bfqq)
+   if (!bfqq) {
bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
 true, is_sync,
 NULL);
-   else
+   bfqq->waker_bfqq = old_bfqq->waker_bfqq;
+   bfqq->tentative_waker_bfqq = NULL;
+
+   /*
+* If the waker queue disappears, then
+* new_bfqq->waker_bfqq must be
+* reset. So insert new_bfqq into the
+* woken_list of the waker. See
+* bfq_check_waker for details.
+*/
+   if (bfqq->waker_bfqq)
+   hlist_add_head(>woken_list_node,
+  
>waker_bfqq->woken_list);
+   } else
bfqq_already_existing = true;
}
}
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 5/6] block, bfq: keep shared queues out of the waker mechanism

2021-01-26 Thread Paolo Valente
Shared queues are likely to receive I/O at a high rate. This may
deceptively let them be considered as wakers of other queues. But a
false waker will unjustly steal bandwidth to its supposedly woken
queue. So considering also shared queues in the waking mechanism may
cause more control troubles than throughput benefits. This commit
keeps shared queues out of the waker-detection mechanism.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0c7e203085f1..23d0dd7bd90f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5825,7 +5825,17 @@ static void bfq_completed_request(struct bfq_queue 
*bfqq, struct bfq_data *bfqd)
1UL<<(BFQ_RATE_SHIFT - 10))
bfq_update_rate_reset(bfqd, NULL);
bfqd->last_completion = now_ns;
-   bfqd->last_completed_rq_bfqq = bfqq;
+   /*
+* Shared queues are likely to receive I/O at a high
+* rate. This may deceptively let them be considered as wakers
+* of other queues. But a false waker will unjustly steal
+* bandwidth to its supposedly woken queue. So considering
+* also shared queues in the waking mechanism may cause more
+* control troubles than throughput benefits. Then do not set
+* last_completed_rq_bfqq to bfqq if bfqq is a shared queue.
+*/
+   if (!bfq_bfqq_coop(bfqq))
+   bfqd->last_completed_rq_bfqq = bfqq;
 
/*
 * If we are waiting to discover whether the request pattern
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 4/6] block, bfq: fix weight-raising resume with !low_latency

2021-01-26 Thread Paolo Valente
When the io_latency heuristic is off, bfq_queues must not start to be
weight-raised. Unfortunately, by mistake, this may happen when the
state of a previously weight-raised bfq_queue is resumed after a queue
split. This commit fixes this error.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c5bda33c0923..0c7e203085f1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1010,7 +1010,7 @@ static void
 bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
  struct bfq_io_cq *bic, bool bfq_already_existing)
 {
-   unsigned int old_wr_coeff = bfqq->wr_coeff;
+   unsigned int old_wr_coeff = 1;
bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
 
if (bic->saved_has_short_ttime)
@@ -1031,7 +1031,13 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
bfqq->ttime = bic->saved_ttime;
bfqq->io_start_time = bic->saved_io_start_time;
bfqq->tot_idle_time = bic->saved_tot_idle_time;
-   bfqq->wr_coeff = bic->saved_wr_coeff;
+   /*
+* Restore weight coefficient only if low_latency is on
+*/
+   if (bfqd->low_latency) {
+   old_wr_coeff = bfqq->wr_coeff;
+   bfqq->wr_coeff = bic->saved_wr_coeff;
+   }
bfqq->service_from_wr = bic->saved_service_from_wr;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
bfqq->last_wr_start_finish = bic->saved_last_wr_start_finish;
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 6/6] block, bfq: merge bursts of newly-created queues

2021-01-26 Thread Paolo Valente
Many throughput-sensitive workloads are made of several parallel I/O
flows, with all flows generated by the same application, or more
generically by the same task (e.g., system boot). The most
counterproductive action with these workloads is plugging I/O dispatch
when one of the bfq_queues associated with these flows remains
temporarily empty.

To avoid this plugging, BFQ has been using a burst-handling mechanism
for years now. This mechanism has proven effective for throughput, and
not detrimental for service guarantees. This commit pushes this
mechanism a little bit further, basing on the following two facts.

First, all the I/O flows of a the same application or task contribute
to the execution/completion of that common application or task. So the
performance figures that matter are total throughput of the flows and
task-wide I/O latency.  In particular, these flows do not need to be
protected from each other, in terms of individual bandwidth or
latency.

Second, the above fact holds regardless of the number of flows.

Putting these two facts together, this commits merges stably the
bfq_queues associated with these I/O flows, i.e., with the processes
that generate these IO/ flows, regardless of how many the involved
processes are.

To decide whether a set of bfq_queues is actually associated with the
I/O flows of a common application or task, and to merge these queues
stably, this commit operates as follows: given a bfq_queue, say Q2,
currently being created, and the last bfq_queue, say Q1, created
before Q2, Q2 is merged stably with Q1 if
- very little time has elapsed since when Q1 was created
- Q2 has the same ioprio as Q1
- Q2 belongs to the same group as Q1

Merging bfq_queues also reduces scheduling overhead. A fio test with
ten random readers on /dev/nullb shows a throughput boost of 40%, with
a quadcore. Since BFQ's execution time amounts to ~50% of the total
per-request processing time, the above throughput boost implies that
BFQ's overhead is reduced by more than 50%.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-cgroup.c  |   2 +
 block/bfq-iosched.c | 249 ++--
 block/bfq-iosched.h |  15 +++
 3 files changed, 256 insertions(+), 10 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index b791e2041e49..e2f14508f2d6 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -547,6 +547,8 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
 
entity->orig_weight = entity->weight = entity->new_weight = d->weight;
entity->my_sched_data = >sched_data;
+   entity->last_bfqq_created = NULL;
+
bfqg->my_entity = entity; /*
   * the root_group's will be set to NULL
   * in bfq_init_queue()
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 23d0dd7bd90f..f4e23e0ced74 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1073,7 +1073,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
 static int bfqq_process_refs(struct bfq_queue *bfqq)
 {
return bfqq->ref - bfqq->allocated - bfqq->entity.on_st_or_in_serv -
-   (bfqq->weight_counter != NULL);
+   (bfqq->weight_counter != NULL) - bfqq->stable_ref;
 }
 
 /* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */
@@ -2625,6 +2625,11 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue 
*bfqq,
return true;
 }
 
+static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
+struct bfq_queue *bfqq);
+
+static void bfq_put_stable_ref(struct bfq_queue *bfqq);
+
 /*
  * Attempt to schedule a merge of bfqq with the currently in-service
  * queue or with a close queue among the scheduled queues.  Return
@@ -2647,10 +2652,49 @@ static bool bfq_may_be_close_cooperator(struct 
bfq_queue *bfqq,
  */
 static struct bfq_queue *
 bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-void *io_struct, bool request)
+void *io_struct, bool request, struct bfq_io_cq *bic)
 {
struct bfq_queue *in_service_bfqq, *new_bfqq;
 
+   /*
+* Check delayed stable merge for rotational or non-queueing
+* devs. For this branch to be executed, bfqq must not be
+* currently merged with some other queue (i.e., bfqq->bic
+* must be non null). If we considered also merged queues,
+* then we should also check whether bfqq has already been
+* merged with bic->stable_merge_bfqq. But this would be
+* costly and complicated.
+*/
+   if (unlikely(!bfqd->nonrot_with_queueing)) {
+   if (bic->stable_merge_bfqq &&
+   !bfq_bfqq_just_created(bfqq) &&
+   time_is_after_jiffies(bfqq->split_time +
+  

[PATCH BUGFIX/IMPROVEMENT 1/6] block, bfq: always inject I/O of queues blocked by wakers

2021-01-26 Thread Paolo Valente
Suppose that I/O dispatch is plugged, to wait for new I/O for the
in-service bfq-queue, say bfqq.  Suppose then that there is a further
bfq_queue woken by bfqq, and that this woken queue has pending I/O. A
woken queue does not steal bandwidth from bfqq, because it remains
soon without I/O if bfqq is not served. So there is virtually no risk
of loss of bandwidth for bfqq if this woken queue has I/O dispatched
while bfqq is waiting for new I/O. In contrast, this extra I/O
injection boosts throughput. This commit performs this extra
injection.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 32 +++-
 block/bfq-wf2q.c|  8 
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 445cef9c0bb9..a83149407336 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4487,9 +4487,15 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
bfq_bfqq_busy(bfqq->bic->bfqq[0]) &&
bfqq->bic->bfqq[0]->next_rq ?
bfqq->bic->bfqq[0] : NULL;
+   struct bfq_queue *blocked_bfqq =
+   !hlist_empty(>woken_list) ?
+   container_of(bfqq->woken_list.first,
+struct bfq_queue,
+woken_list_node)
+   : NULL;
 
/*
-* The next three mutually-exclusive ifs decide
+* The next four mutually-exclusive ifs decide
 * whether to try injection, and choose the queue to
 * pick an I/O request from.
 *
@@ -4522,7 +4528,15 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
 * next bfqq's I/O is brought forward dramatically,
 * for it is not blocked for milliseconds.
 *
-* The third if checks whether bfqq is a queue for
+* The third if checks whether there is a queue woken
+* by bfqq, and currently with pending I/O. Such a
+* woken queue does not steal bandwidth from bfqq,
+* because it remains soon without I/O if bfqq is not
+* served. So there is virtually no risk of loss of
+* bandwidth for bfqq if this woken queue has I/O
+* dispatched while bfqq is waiting for new I/O.
+*
+* The fourth if checks whether bfqq is a queue for
 * which it is better to avoid injection. It is so if
 * bfqq delivers more throughput when served without
 * any further I/O from other queues in the middle, or
@@ -4542,11 +4556,11 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
 * bfq_update_has_short_ttime(), it is rather likely
 * that, if I/O is being plugged for bfqq and the
 * waker queue has pending I/O requests that are
-* blocking bfqq's I/O, then the third alternative
+* blocking bfqq's I/O, then the fourth alternative
 * above lets the waker queue get served before the
 * I/O-plugging timeout fires. So one may deem the
 * second alternative superfluous. It is not, because
-* the third alternative may be way less effective in
+* the fourth alternative may be way less effective in
 * case of a synchronization. For two main
 * reasons. First, throughput may be low because the
 * inject limit may be too low to guarantee the same
@@ -4555,7 +4569,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data 
*bfqd)
 * guarantees (the second alternative unconditionally
 * injects a pending I/O request of the waker queue
 * for each bfq_dispatch_request()). Second, with the
-* third alternative, the duration of the plugging,
+* fourth alternative, the duration of the plugging,
 * i.e., the time before bfqq finally receives new I/O,
 * may not be minimized, because the waker queue may
 * happen to be served only after other queues.
@@ -4573,6 +4587,14 @@ static struct bfq_queue *bfq_select_queue(struct 
bfq_data *bfqd)
   bfq_bfqq_budget_left(bfqq->waker_bfqq)
)
bfqq = bfqq->waker_bfqq;
+   else if (blocked_bfqq &&
+  bfq_bfqq_busy(blocked_bfqq) &&
+  blocked_bfqq->next_rq &&
+  bfq_serv_to_charge(blocked_bfqq->next_rq,
+ blocke

[PATCH BUGFIX/IMPROVEMENT 4/6] block, bfq: save also weight-raised service on queue merging

2021-01-26 Thread Paolo Valente
To prevent weight-raising information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 ++
 block/bfq-iosched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e5242b2788a..56ad6067d41d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1029,6 +1029,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
bfqq->io_start_time = bic->saved_io_start_time;
bfqq->tot_idle_time = bic->saved_tot_idle_time;
bfqq->wr_coeff = bic->saved_wr_coeff;
+   bfqq->service_from_wr = bic->saved_service_from_wr;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
bfqq->last_wr_start_finish = bic->saved_last_wr_start_finish;
bfqq->wr_cur_max_time = bic->saved_wr_cur_max_time;
@@ -2775,6 +2776,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
bic->saved_wr_coeff = bfqq->wr_coeff;
bic->saved_wr_start_at_switch_to_srt =
bfqq->wr_start_at_switch_to_srt;
+   bic->saved_service_from_wr = bfqq->service_from_wr;
bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish;
bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
}
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index c913b06016b3..d15299d59f89 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -440,6 +440,7 @@ struct bfq_io_cq {
 */
unsigned long saved_wr_coeff;
unsigned long saved_last_wr_start_finish;
+   unsigned long saved_service_from_wr;
unsigned long saved_wr_start_at_switch_to_srt;
unsigned int saved_wr_cur_max_time;
struct bfq_ttime saved_ttime;
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 3/6] block, bfq: fix switch back from soft-rt weitgh-raising

2021-01-26 Thread Paolo Valente
A bfq_queue may happen to be deemed as soft real-time while it is
still enjoying interactive weight-raising. If this happens because of
a false positive, then the bfq_queue is likely to loose its soft
real-time status soon. Upon losing such a status, the bfq_queue must
get back its interactive weight-raising, if its interactive period is
not over yet. But this case is not handled. This commit corrects this
error.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6a02a12ff553..9e5242b2788a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5293,8 +5293,26 @@ bfq_update_io_seektime(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 
if (bfqq->wr_coeff > 1 &&
bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
-   BFQQ_TOTALLY_SEEKY(bfqq))
-   bfq_bfqq_end_wr(bfqq);
+   BFQQ_TOTALLY_SEEKY(bfqq)) {
+   if (time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +
+  bfq_wr_duration(bfqd))) {
+   /*
+* In soft_rt weight raising with the
+* interactive-weight-raising period
+* elapsed (so no switch back to
+* interactive weight raising).
+*/
+   bfq_bfqq_end_wr(bfqq);
+   } else { /*
+ * stopping soft_rt weight raising
+ * while still in interactive period,
+ * switch back to interactive weight
+ * raising
+ */
+   switch_back_to_interactive_wr(bfqq, bfqd);
+   bfqq->entity.prio_changed = 1;
+   }
+   }
 }
 
 static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 6/6] block, bfq: make waker-queue detection more robust

2021-01-26 Thread Paolo Valente
In the presence of many parallel I/O flows, the detection of waker
bfq_queues suffers from false positives. This commits addresses this
issue by making the filtering of actual wakers more selective. In more
detail, a candidate waker must be found to meet waker requirements
three times before being promoted to actual waker.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 211 +---
 block/bfq-iosched.h |   7 +-
 2 files changed, 108 insertions(+), 110 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index e56ee60df014..445cef9c0bb9 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -158,7 +158,6 @@ BFQ_BFQQ_FNS(in_large_burst);
 BFQ_BFQQ_FNS(coop);
 BFQ_BFQQ_FNS(split_coop);
 BFQ_BFQQ_FNS(softrt_update);
-BFQ_BFQQ_FNS(has_waker);
 #undef BFQ_BFQQ_FNS\
 
 /* Expiration time of sync (0) and async (1) requests, in ns. */
@@ -1905,6 +1904,107 @@ static void bfq_update_io_intensity(struct bfq_queue 
*bfqq, u64 now_ns)
}
 }
 
+/*
+ * Detect whether bfqq's I/O seems synchronized with that of some
+ * other queue, i.e., whether bfqq, after remaining empty, happens to
+ * receive new I/O only right after some I/O request of the other
+ * queue has been completed. We call waker queue the other queue, and
+ * we assume, for simplicity, that bfqq may have at most one waker
+ * queue.
+ *
+ * A remarkable throughput boost can be reached by unconditionally
+ * injecting the I/O of the waker queue, every time a new
+ * bfq_dispatch_request happens to be invoked while I/O is being
+ * plugged for bfqq.  In addition to boosting throughput, this
+ * unblocks bfqq's I/O, thereby improving bandwidth and latency for
+ * bfqq. Note that these same results may be achieved with the general
+ * injection mechanism, but less effectively. For details on this
+ * aspect, see the comments on the choice of the queue for injection
+ * in bfq_select_queue().
+ *
+ * Turning back to the detection of a waker queue, a queue Q is deemed
+ * as a waker queue for bfqq if, for three consecutive times, bfqq
+ * happens to become non empty right after a request of Q has been
+ * completed. In particular, on the first time, Q is tentatively set
+ * as a candidate waker queue, while on the third consecutive time
+ * that Q is detected, the field waker_bfqq is set to Q, to confirm
+ * that Q is a waker queue for bfqq. These detection steps are
+ * performed only if bfqq has a long think time, so as to make it more
+ * likely that bfqq's I/O is actually being blocked by a
+ * synchronization. This last filter, plus the above three-times
+ * requirement, make false positives less likely.
+ *
+ * NOTE
+ *
+ * The sooner a waker queue is detected, the sooner throughput can be
+ * boosted by injecting I/O from the waker queue. Fortunately,
+ * detection is likely to be actually fast, for the following
+ * reasons. While blocked by synchronization, bfqq has a long think
+ * time. This implies that bfqq's inject limit is at least equal to 1
+ * (see the comments in bfq_update_inject_limit()). So, thanks to
+ * injection, the waker queue is likely to be served during the very
+ * first I/O-plugging time interval for bfqq. This triggers the first
+ * step of the detection mechanism. Thanks again to injection, the
+ * candidate waker queue is then likely to be confirmed no later than
+ * during the next I/O-plugging interval for bfqq.
+ *
+ * ISSUE
+ *
+ * On queue merging all waker information is lost.
+ */
+void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq, u64 now_ns)
+{
+   if (!bfqd->last_completed_rq_bfqq ||
+   bfqd->last_completed_rq_bfqq == bfqq ||
+   bfq_bfqq_has_short_ttime(bfqq) ||
+   now_ns - bfqd->last_completion >= 4 * NSEC_PER_MSEC ||
+   bfqd->last_completed_rq_bfqq == bfqq->waker_bfqq)
+   return;
+
+   if (bfqd->last_completed_rq_bfqq !=
+   bfqq->tentative_waker_bfqq) {
+   /*
+* First synchronization detected with a
+* candidate waker queue, or with a different
+* candidate waker queue from the current one.
+*/
+   bfqq->tentative_waker_bfqq =
+   bfqd->last_completed_rq_bfqq;
+   bfqq->num_waker_detections = 1;
+   } else /* Same tentative waker queue detected again */
+   bfqq->num_waker_detections++;
+
+   if (bfqq->num_waker_detections == 3) {
+   bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq;
+   bfqq->tentative_waker_bfqq = NULL;
+
+   /*
+* If the waker queue disappears, then
+* bfqq->waker_bfqq must be reset. To
+* this goal, we maintain in each
+* waker queue a list, woken_list, of
+* all the queue

[PATCH BUGFIX/IMPROVEMENT 5/6] block, bfq: save also injection state on queue merging

2021-01-26 Thread Paolo Valente
To prevent injection information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 8 
 block/bfq-iosched.h | 5 +
 2 files changed, 13 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 56ad6067d41d..e56ee60df014 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1024,6 +1024,10 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
else
bfq_clear_bfqq_IO_bound(bfqq);
 
+   bfqq->last_serv_time_ns = bic->saved_last_serv_time_ns;
+   bfqq->inject_limit = bic->saved_inject_limit;
+   bfqq->decrease_time_jif = bic->saved_decrease_time_jif;
+
bfqq->entity.new_weight = bic->saved_weight;
bfqq->ttime = bic->saved_ttime;
bfqq->io_start_time = bic->saved_io_start_time;
@@ -2748,6 +2752,10 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
if (!bic)
return;
 
+   bic->saved_last_serv_time_ns = bfqq->last_serv_time_ns;
+   bic->saved_inject_limit = bfqq->inject_limit;
+   bic->saved_decrease_time_jif = bfqq->decrease_time_jif;
+
bic->saved_weight = bfqq->entity.orig_weight;
bic->saved_ttime = bfqq->ttime;
bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index d15299d59f89..3f350fa3c5fd 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -444,6 +444,11 @@ struct bfq_io_cq {
unsigned long saved_wr_start_at_switch_to_srt;
unsigned int saved_wr_cur_max_time;
struct bfq_ttime saved_ttime;
+
+   /* Save also injection state */
+   u64 saved_last_serv_time_ns;
+   unsigned int saved_inject_limit;
+   unsigned long saved_decrease_time_jif;
 };
 
 /**
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: re-evaluate convenience of I/O plugging on rq arrivals

2021-01-26 Thread Paolo Valente
Upon an I/O-dispatch attempt, BFQ may detect that it was better to
plug I/O dispatch, and to wait for a new request to arrive for the
currently in-service queue. But the arrival of a new request for an
empty bfq_queue, and thus the switch from idle to busy of the
bfq_queue, may cause the scenario to change, and make plugging no
longer needed for service guarantees, or more convenient for
throughput. In this case, keeping I/O-dispatch plugged would certainly
lower throughput.

To address this issue, this commit makes such a check, and stops
plugging I/O if it is better to stop plugging I/O.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index db393f5d70ba..6a02a12ff553 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1649,6 +1649,8 @@ static bool bfq_bfqq_higher_class_or_weight(struct 
bfq_queue *bfqq,
return bfqq_weight > in_serv_weight;
 }
 
+static bool bfq_better_to_idle(struct bfq_queue *bfqq);
+
 static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 struct bfq_queue *bfqq,
 int old_wr_coeff,
@@ -1750,10 +1752,10 @@ static void bfq_bfqq_handle_idle_busy_switch(struct 
bfq_data *bfqd,
bfq_add_bfqq_busy(bfqd, bfqq);
 
/*
-* Expire in-service queue only if preemption may be needed
-* for guarantees. In particular, we care only about two
-* cases. The first is that bfqq has to recover a service
-* hole, as explained in the comments on
+* Expire in-service queue if preemption may be needed for
+* guarantees or throughput. As for guarantees, we care
+* explicitly about two cases. The first is that bfqq has to
+* recover a service hole, as explained in the comments on
 * bfq_bfqq_update_budg_for_activation(), i.e., that
 * bfqq_wants_to_preempt is true. However, if bfqq does not
 * carry time-critical I/O, then bfqq's bandwidth is less
@@ -1780,11 +1782,23 @@ static void bfq_bfqq_handle_idle_busy_switch(struct 
bfq_data *bfqd,
 * timestamps of the in-service queue would need to be
 * updated, and this operation is quite costly (see the
 * comments on bfq_bfqq_update_budg_for_activation()).
+*
+* As for throughput, we ask bfq_better_to_idle() whether we
+* still need to plug I/O dispatching. If bfq_better_to_idle()
+* says no, then plugging is not needed any longer, either to
+* boost throughput or to perserve service guarantees. Then
+* the best option is to stop plugging I/O, as not doing so
+* would certainly lower throughput. We may end up in this
+* case if: (1) upon a dispatch attempt, we detected that it
+* was better to plug I/O dispatch, and to wait for a new
+* request to arrive for the currently in-service queue, but
+* (2) this switch of bfqq to busy changes the scenario.
 */
if (bfqd->in_service_queue &&
((bfqq_wants_to_preempt &&
  bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) ||
-bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue)) &&
+bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue) ||
+!bfq_better_to_idle(bfqd->in_service_queue)) &&
next_queue_may_preempt(bfqd))
bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
false, BFQQE_PREEMPTED);
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 0/6] block, bfq: second batch of fixes and improvements

2021-01-26 Thread Paolo Valente
Hi,
here's batch 2/3.

Thanks,
Paolo

Paolo Valente (6):
  block, bfq: replace mechanism for evaluating I/O intensity
  block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
  block, bfq: fix switch back from soft-rt weitgh-raising
  block, bfq: save also weight-raised service on queue merging
  block, bfq: save also injection state on queue merging
  block, bfq: make waker-queue detection more robust

 block/bfq-iosched.c | 328 ++--
 block/bfq-iosched.h |  29 ++--
 2 files changed, 214 insertions(+), 143 deletions(-)

--
2.20.1


[PATCH BUGFIX/IMPROVEMENT 1/6] block, bfq: replace mechanism for evaluating I/O intensity

2021-01-26 Thread Paolo Valente
Some BFQ mechanisms make their decisions on a bfq_queue basing also on
whether the bfq_queue is I/O bound. In this respect, the current logic
for evaluating whether a bfq_queue is I/O bound is rather rough. This
commits replaces this logic with a more effective one.

The new logic measures the percentage of time during which a bfq_queue
is active, and marks the bfq_queue as I/O bound if the latter if this
percentage is above a fixed threshold.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 63 +++--
 block/bfq-iosched.h | 16 ++--
 2 files changed, 52 insertions(+), 27 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c045613ce927..db393f5d70ba 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1026,6 +1026,8 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
 
bfqq->entity.new_weight = bic->saved_weight;
bfqq->ttime = bic->saved_ttime;
+   bfqq->io_start_time = bic->saved_io_start_time;
+   bfqq->tot_idle_time = bic->saved_tot_idle_time;
bfqq->wr_coeff = bic->saved_wr_coeff;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
bfqq->last_wr_start_finish = bic->saved_last_wr_start_finish;
@@ -1721,17 +1723,6 @@ static void bfq_bfqq_handle_idle_busy_switch(struct 
bfq_data *bfqd,
 
bfq_clear_bfqq_just_created(bfqq);
 
-
-   if (!bfq_bfqq_IO_bound(bfqq)) {
-   if (arrived_in_time) {
-   bfqq->requests_within_timer++;
-   if (bfqq->requests_within_timer >=
-   bfqd->bfq_requests_within_timer)
-   bfq_mark_bfqq_IO_bound(bfqq);
-   } else
-   bfqq->requests_within_timer = 0;
-   }
-
if (bfqd->low_latency) {
if (unlikely(time_is_after_jiffies(bfqq->split_time)))
/* wraparound */
@@ -1865,6 +1856,36 @@ static void bfq_reset_inject_limit(struct bfq_data *bfqd,
bfqq->decrease_time_jif = jiffies;
 }
 
+static void bfq_update_io_intensity(struct bfq_queue *bfqq, u64 now_ns)
+{
+   u64 tot_io_time = now_ns - bfqq->io_start_time;
+
+   if (RB_EMPTY_ROOT(>sort_list) && bfqq->dispatched == 0)
+   bfqq->tot_idle_time +=
+   now_ns - bfqq->ttime.last_end_request;
+
+   if (unlikely(bfq_bfqq_just_created(bfqq)))
+   return;
+
+   /*
+* Must be busy for at least about 80% of the time to be
+* considered I/O bound.
+*/
+   if (bfqq->tot_idle_time * 5 > tot_io_time)
+   bfq_clear_bfqq_IO_bound(bfqq);
+   else
+   bfq_mark_bfqq_IO_bound(bfqq);
+
+   /*
+* Keep an observation window of at most 200 ms in the past
+* from now.
+*/
+   if (tot_io_time > 200 * NSEC_PER_MSEC) {
+   bfqq->io_start_time = now_ns - (tot_io_time>>1);
+   bfqq->tot_idle_time >>= 1;
+   }
+}
+
 static void bfq_add_request(struct request *rq)
 {
struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -1872,6 +1893,7 @@ static void bfq_add_request(struct request *rq)
struct request *next_rq, *prev;
unsigned int old_wr_coeff = bfqq->wr_coeff;
bool interactive = false;
+   u64 now_ns = ktime_get_ns();
 
bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
bfqq->queued[rq_is_sync(rq)]++;
@@ -1934,7 +1956,7 @@ static void bfq_add_request(struct request *rq)
 */
if (bfqd->last_completed_rq_bfqq &&
!bfq_bfqq_has_short_ttime(bfqq) &&
-   ktime_get_ns() - bfqd->last_completion <
+   now_ns - bfqd->last_completion <
4 * NSEC_PER_MSEC) {
if (bfqd->last_completed_rq_bfqq != bfqq &&
bfqd->last_completed_rq_bfqq !=
@@ -2051,6 +2073,9 @@ static void bfq_add_request(struct request *rq)
}
}
 
+   if (bfq_bfqq_sync(bfqq))
+   bfq_update_io_intensity(bfqq, now_ns);
+
elv_rb_add(>sort_list, rq);
 
/*
@@ -2712,6 +2737,8 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
bic->saved_ttime = bfqq->ttime;
bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
+   bic->saved_io_start_time = bfqq->io_start_time;
+   bic->saved_tot_idle_time = bfqq->tot_idle_time;
bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
bic->was_in_burst_list = !hlist_unhashed(>burst_list_node);
if (unlikely(bfq_bfqq_just_created(bfqq) &

Re: linux-next: Fixes tag needs some work in the block tree

2021-01-25 Thread Paolo Valente



> Il giorno 25 gen 2021, alle ore 10:40, Stephen Rothwell 
>  ha scritto:
> 
> Hi all,
> 
> In commit
> 
>  d4fc3640ff36 ("block, bfq: set next_rq to waker_bfqq->next_rq in waker 
> injection")
> 
> Fixes tag
> 
>  Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject 
> their I/O")
> 
> has these problem(s):
> 
>  - Target SHA1 does not exist
> 
> Maybe you meant
> 
> Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject 
> their I/O")
> 

Hi Jens,
how to proceed in such a case (with patches already applied by you)?
Shall I send you a v2 with only this change?

Thanks,
Paolo

> -- 
> Cheers,
> Stephen Rothwell



[PATCH BUGFIX/IMPROVEMENT 3/6] block, bfq: increase time window for waker detection

2021-01-22 Thread Paolo Valente
Tests on slower machines showed current window to be way too
small. This commit increases it.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index fdc5e163b2fe..43e2c39cf7b5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1931,7 +1931,7 @@ static void bfq_add_request(struct request *rq)
if (bfqd->last_completed_rq_bfqq &&
!bfq_bfqq_has_short_ttime(bfqq) &&
ktime_get_ns() - bfqd->last_completion <
-   200 * NSEC_PER_USEC) {
+   4 * NSEC_PER_MSEC) {
if (bfqd->last_completed_rq_bfqq != bfqq &&
bfqd->last_completed_rq_bfqq !=
bfqq->waker_bfqq) {
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: set next_rq to waker_bfqq->next_rq in waker injection

2021-01-22 Thread Paolo Valente
From: Jia Cheng Hu 

Since commit c5089591c3ba ("block, bfq: detect wakers and
unconditionally inject their I/O"), when the in-service bfq_queue, say
Q, is temporarily empty, BFQ checks whether there are I/O requests to
inject (also) from the waker bfq_queue for Q. To this goal, the value
pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the
current implementation mistakenly looks at bfqq->next_rq, which
instead points to the next request of the currently served queue.

This mistake evidently causes losses of throughput in scenarios with
waker bfq_queues.

This commit corrects this mistake.

Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject 
their I/O")
Signed-off-by: Jia Cheng Hu 
Signed-off-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index eb2ca32d5b63..fdc5e163b2fe 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4499,7 +4499,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data 
*bfqd)
bfqq = bfqq->bic->bfqq[0];
else if (bfq_bfqq_has_waker(bfqq) &&
   bfq_bfqq_busy(bfqq->waker_bfqq) &&
-  bfqq->next_rq &&
+  bfqq->waker_bfqq->next_rq &&
   bfq_serv_to_charge(bfqq->waker_bfqq->next_rq,
  bfqq->waker_bfqq) <=
   bfq_bfqq_budget_left(bfqq->waker_bfqq)
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 1/6] block, bfq: use half slice_idle as a threshold to check short ttime

2021-01-22 Thread Paolo Valente
The value of the I/O plugging (idling) timeout is used also as the
think-time threshold to decide whether a process has a short think
time.  In this respect, a good value of this timeout for rotational
drives is un the order of several ms. Yet, this is often too long a
time interval to be effective as a think-time threshold. This commit
mitigates this problem (by a lot, according to tests), by halving the
threshold.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e4eb0fc1c16..eb2ca32d5b63 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5238,12 +5238,13 @@ static void bfq_update_has_short_ttime(struct bfq_data 
*bfqd,
return;
 
/* Think time is infinite if no process is linked to
-* bfqq. Otherwise check average think time to
-* decide whether to mark as has_short_ttime
+* bfqq. Otherwise check average think time to decide whether
+* to mark as has_short_ttime. To this goal, compare average
+* think time with half the I/O-plugging timeout.
 */
if (atomic_read(>icq.ioc->active_ref) == 0 ||
(bfq_sample_valid(bfqq->ttime.ttime_samples) &&
-bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle))
+bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle>>1))
has_short_ttime = false;
 
state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq);
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 0/6] block, bfq: first bath of fixes and improvements

2021-01-22 Thread Paolo Valente
Hi,

about nine months ago, Jan (Kara, SUSE) reported a throughput
regression with BFQ. That was the beginning of a fruitful dev
collaboration, which led to 18 new commits. Part are fixes, part are
actual performance improvements.

Jia Cheng Hu (1):
  block, bfq: set next_rq to waker_bfqq->next_rq in waker injection

Paolo Valente (5):
  block, bfq: use half slice_idle as a threshold to check short ttime
  block, bfq: increase time window for waker detection
  block, bfq: do not raise non-default weights
  block, bfq: avoid spurious switches to soft_rt of interactive queues
  block, bfq: do not expire a queue when it is the only busy one

 block/bfq-iosched.c | 100 +++-
 1 file changed, 70 insertions(+), 30 deletions(-)

--
2.20.1


[PATCH BUGFIX/IMPROVEMENT 4/6] block, bfq: do not raise non-default weights

2021-01-22 Thread Paolo Valente
BFQ heuristics try to detect interactive I/O, and raise the weight of
the queues containing such an I/O. Yet, if also the user changes the
weight of a queue (i.e., the user changes the ioprio of the process
associated with that queue), then it is most likely better to prevent
BFQ heuristics from silently changing the same weight.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 43e2c39cf7b5..161badb744d6 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1671,15 +1671,19 @@ static void bfq_bfqq_handle_idle_busy_switch(struct 
bfq_data *bfqd,
 * - it is sync,
 * - it does not belong to a large burst,
 * - it has been idle for enough time or is soft real-time,
-* - is linked to a bfq_io_cq (it is not shared in any sense).
+* - is linked to a bfq_io_cq (it is not shared in any sense),
+* - has a default weight (otherwise we assume the user wanted
+*   to control its weight explicitly)
 */
in_burst = bfq_bfqq_in_large_burst(bfqq);
soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
!BFQQ_TOTALLY_SEEKY(bfqq) &&
!in_burst &&
time_is_before_jiffies(bfqq->soft_rt_next_start) &&
-   bfqq->dispatched == 0;
-   *interactive = !in_burst && idle_for_long_time;
+   bfqq->dispatched == 0 &&
+   bfqq->entity.new_weight == 40;
+   *interactive = !in_burst && idle_for_long_time &&
+   bfqq->entity.new_weight == 40;
wr_or_deserves_wr = bfqd->low_latency &&
(bfqq->wr_coeff > 1 ||
 (bfq_bfqq_sync(bfqq) &&
-- 
2.20.1



[PATCH BUGFIX/IMPROVEMENT 5/6] block, bfq: avoid spurious switches to soft_rt of interactive queues

2021-01-22 Thread Paolo Valente
BFQ tags some bfq_queues as interactive or soft_rt if it deems that
these bfq_queues contain the I/O of, respectively, interactive or soft
real-time applications. BFQ privileges both these special types of
bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ
mainly raises the weight of the bfq_queue. In particular, soft_rt
bfq_queues get a higher weight than interactive bfq_queues.

A bfq_queue may turn from interactive to soft_rt. And this leads to a
tricky issue. Soft real-time applications usually start with an
I/O-bound, interactive phase, in which they load themselves into main
memory. BFQ correctly detects this phase, and keeps the bfq_queues
associated with the application in interactive mode for a
while. Problems arise when the I/O pattern of the application finally
switches to soft real-time. One of the conditions for a bfq_queue to
be deemed as soft_rt is that the bfq_queue does not consume too much
bandwidth. But the bfq_queues associated with a soft real-time
application consume as much bandwidth as they can in the loading phase
of the application. So, after the application becomes truly soft
real-time, a lot of time should pass before the average bandwidth
consumed by its bfq_queues finally drops to a value acceptable for
soft_rt bfq_queues. As a consequence, there might be a time gap during
which the application is not privileged at all, because its bfq_queues
are not interactive any longer, but cannot be deemed as soft_rt yet.

To avoid this problem, BFQ pretends that an interactive bfq_queue
consumes zero bandwidth, and allows an interactive bfq_queue to switch
to soft_rt. Yet, this fake zero-bandwidth consumption easily causes
the bfq_queue to often switch to soft_rt deceptively, during its
loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth
correctly computed, and therefore soon switches back to
interactive. Then it switches again to soft_rt, and so on. These
spurious fluctuations usually cause losses of throughput, because they
deceive BFQ's mechanisms for boosting throughput (injection,
I/O-plugging avoidance, ...).

This commit addresses this issue as follows:
1) It does compute actual bandwidth consumption also for interactive
   bfq_queues. This avoids the above false positives.
2) When a bfq_queue switches from interactive to normal mode, the
   consumed bandwidth is reset (forgotten). This allows the
   bfq_queue to enjoy soft_rt very quickly. In particular, two
   alternatives are possible in this switch:
- the bfq_queue still has backlog, and therefore there is a budget
  already scheduled to serve the bfq_queue; in this case, the
  scheduling of the current budget of the bfq_queue is not
  hindered, because only the scheduling of the next budget will
  be affected by the weight drop. After that, if the bfq_queue is
  actually in a soft_rt phase, and becomes empty during the
  service of its current budget, which is the natural behavior of
  a soft_rt bfq_queue, then the bfq_queue will be considered as
  soft_rt when its next I/O arrives. If, in contrast, the
  bfq_queue remains constantly non-empty, then its next budget
  will be scheduled with a low weight, which is the natural
  treatment for an I/O-bound (non soft_rt) bfq_queue.
- the bfq_queue is empty; in this case, the bfq_queue may be
  considered unjustly soft_rt when its new I/O arrives. Yet
  the problem is now much smaller than before, because it is
  unlikely that more than one spurious fluctuation occurs.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 57 +
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 161badb744d6..003c96fa01ad 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2356,6 +2356,24 @@ static void bfq_requests_merged(struct request_queue *q, 
struct request *rq,
 /* Must be called with bfqq != NULL */
 static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+   /*
+* If bfqq has been enjoying interactive weight-raising, then
+* reset soft_rt_next_start. We do it for the following
+* reason. bfqq may have been conveying the I/O needed to load
+* a soft real-time application. Such an application actually
+* exhibits a soft real-time I/O pattern after it finishes
+* loading, and finally starts doing its job. But, if bfqq has
+* been receiving a lot of bandwidth so far (likely to happen
+* on a fast device), then soft_rt_next_start now contains a
+* high value that. So, without this reset, bfqq would be
+* prevented from being possibly considered as soft_rt for a
+* very long time.
+*/
+
+   if (bfqq->wr_cur_max_time !=
+   bfqq->bfqd->bfq_wr_rt_max_time)
+   bfqq->soft_rt_next_start = jiffies;
+
if (bfq_bf

[PATCH BUGFIX/IMPROVEMENT 6/6] block, bfq: do not expire a queue when it is the only busy one

2021-01-22 Thread Paolo Valente
This commits preserves I/O-dispatch plugging for a special symmetric
case that may suddenly turn into asymmetric: the case where only one
bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not
cause any harm to any other queues in terms of service guarantees. In
contrast, it avoids the following unlucky sequence of events: (1) bfqq
is expired, (2) a new queue with a lower weight than bfqq becomes busy
(or more queues), (3) the new queue is served until a new request
arrives for bfqq, (4) when bfqq is finally served, there are so many
requests of the new queue in the drive that the pending requests for
bfqq take a lot of time to be served. In particular, event (2) may
case even already dispatched requests of bfqq to be delayed, inside
the drive. So, to avoid this series of events, the scenario is
preventively declared as asymmetric also if bfqq is the only busy
queues. By doing so, I/O-dispatch plugging is performed for bfqq.

Tested-by: Jan Kara 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 003c96fa01ad..c045613ce927 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3464,20 +3464,38 @@ static void bfq_dispatch_remove(struct request_queue 
*q, struct request *rq)
  * order until all the requests already queued in the device have been
  * served. The last sub-condition commented above somewhat mitigates
  * this problem for weight-raised queues.
+ *
+ * However, as an additional mitigation for this problem, we preserve
+ * plugging for a special symmetric case that may suddenly turn into
+ * asymmetric: the case where only bfqq is busy. In this case, not
+ * expiring bfqq does not cause any harm to any other queues in terms
+ * of service guarantees. In contrast, it avoids the following unlucky
+ * sequence of events: (1) bfqq is expired, (2) a new queue with a
+ * lower weight than bfqq becomes busy (or more queues), (3) the new
+ * queue is served until a new request arrives for bfqq, (4) when bfqq
+ * is finally served, there are so many requests of the new queue in
+ * the drive that the pending requests for bfqq take a lot of time to
+ * be served. In particular, event (2) may case even already
+ * dispatched requests of bfqq to be delayed, inside the drive. So, to
+ * avoid this series of events, the scenario is preventively declared
+ * as asymmetric also if bfqq is the only busy queues
  */
 static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
 struct bfq_queue *bfqq)
 {
+   int tot_busy_queues = bfq_tot_busy_queues(bfqd);
+
/* No point in idling for bfqq if it won't get requests any longer */
if (unlikely(!bfqq_process_refs(bfqq)))
return false;
 
return (bfqq->wr_coeff > 1 &&
(bfqd->wr_busy_queues <
-bfq_tot_busy_queues(bfqd) ||
+tot_busy_queues ||
 bfqd->rq_in_driver >=
 bfqq->dispatched + 4)) ||
-   bfq_asymmetric_scenario(bfqd, bfqq);
+   bfq_asymmetric_scenario(bfqd, bfqq) ||
+   tot_busy_queues == 1;
 }
 
 static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-- 
2.20.1



Re: [PATCH BUGFIX/IMPROVEMENT 0/6] block, bfq: first bath of fixes and improvements

2021-01-22 Thread Paolo Valente



> Il giorno 22 gen 2021, alle ore 19:19, Paolo Valente 
>  ha scritto:
> 
> Hi,
> 
> about nine months ago, Jan (Kara, SUSE) reported a throughput
> regression with BFQ. That was the beginning of a fruitful dev
> collaboration, which led to 18 new commits. Part are fixes, part are
> actual performance improvements.
> 

The cover letter was not complete, sorry. Here is the missing piece:

Given the high number of commits, and the size of a few of them, I've
opted for splitting their submission into three batches. This is the
first batch.

Thanks,
Paolo

> Jia Cheng Hu (1):
>  block, bfq: set next_rq to waker_bfqq->next_rq in waker injection
> 
> Paolo Valente (5):
>  block, bfq: use half slice_idle as a threshold to check short ttime
>  block, bfq: increase time window for waker detection
>  block, bfq: do not raise non-default weights
>  block, bfq: avoid spurious switches to soft_rt of interactive queues
>  block, bfq: do not expire a queue when it is the only busy one
> 
> block/bfq-iosched.c | 100 +++-
> 1 file changed, 70 insertions(+), 30 deletions(-)
> 
> --
> 2.20.1



Re: [PATCH] bfq: don't check active group if bfq.weight is not changed

2021-01-22 Thread Paolo Valente



> Il giorno 14 gen 2021, alle ore 13:24, Yu Kuai  ha 
> scritto:
> 
> Now the group scheduling in BFQ depends on the check of active group,
> but in most cases group scheduling is not used and the checking
> of active group will cause bfq_asymmetric_scenario() and its caller
> bfq_better_to_idle() to always return true, so the throughput
> will be impacted if the workload doesn't need idle (e.g. random rw)
> 
> To fix that, adding check in bfq_io_set_weight_legacy() and
> bfq_pd_init() to check whether or not group scheduling is used
> (a non-default weight is used). If not, there is no need
> to check active group.
> 

Hi,
I do like the goal you want to attain.  Still, I see a problem with
your proposal.  Consider two groups, say A and B.  Suppose that both
have the same, default weight.  Yet, group A generates large I/O
requests, while group B generates small requests.  With your change,
idling would not be performed.  This would cause group A to steal
bandwidth to group B, in proportion to how large its requests are
compared with those of group B.

As a possible solution, maybe we would need also a varied_rq_size
flag, similar to the varied_weights flag?

Thoughts?

Thanks for your contribution,
Paolo

> Signed-off-by: Yu Kuai 
> ---
> block/bfq-cgroup.c  | 14 --
> block/bfq-iosched.c |  8 +++-
> block/bfq-iosched.h | 19 +++
> 3 files changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index b791e2041e49..b4ac42c4bd9f 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -505,12 +505,18 @@ static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t 
> gfp)
>   return >pd;
> }
> 
> +static inline int bfq_dft_weight(void)
> +{
> + return cgroup_subsys_on_dfl(io_cgrp_subsys) ?
> +CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
> +
> +}
> +
> static void bfq_cpd_init(struct blkcg_policy_data *cpd)
> {
>   struct bfq_group_data *d = cpd_to_bfqgd(cpd);
> 
> - d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ?
> - CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
> + d->weight = bfq_dft_weight();
> }
> 
> static void bfq_cpd_free(struct blkcg_policy_data *cpd)
> @@ -554,6 +560,9 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
>   bfqg->bfqd = bfqd;
>   bfqg->active_entities = 0;
>   bfqg->rq_pos_tree = RB_ROOT;
> +
> + if (entity->new_weight != bfq_dft_weight())
> + bfqd_enable_active_group_check(bfqd);
> }
> 
> static void bfq_pd_free(struct blkg_policy_data *pd)
> @@ -1013,6 +1022,7 @@ static void bfq_group_set_weight(struct bfq_group 
> *bfqg, u64 weight, u64 dev_wei
>*/
>   smp_wmb();
>   bfqg->entity.prio_changed = 1;
> + bfqd_enable_active_group_check(bfqg->bfqd);
>   }
> }
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 9e4eb0fc1c16..1b695de1df95 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -699,11 +699,8 @@ static bool bfq_asymmetric_scenario(struct bfq_data 
> *bfqd,
>   (bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
>   (bfqd->busy_queues[1] && bfqd->busy_queues[2]);
> 
> - return varied_queue_weights || multiple_classes_busy
> -#ifdef CONFIG_BFQ_GROUP_IOSCHED
> -|| bfqd->num_groups_with_pending_reqs > 0
> -#endif
> - ;
> + return varied_queue_weights || multiple_classes_busy ||
> +bfqd_has_active_group(bfqd);
> }
> 
> /*
> @@ -6472,6 +6469,7 @@ static int bfq_init_queue(struct request_queue *q, 
> struct elevator_type *e)
> 
>   bfqd->queue_weights_tree = RB_ROOT_CACHED;
>   bfqd->num_groups_with_pending_reqs = 0;
> + bfqd->check_active_group = false;
> 
>   INIT_LIST_HEAD(>active_list);
>   INIT_LIST_HEAD(>idle_list);
> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
> index 703895224562..216509013012 100644
> --- a/block/bfq-iosched.h
> +++ b/block/bfq-iosched.h
> @@ -524,6 +524,8 @@ struct bfq_data {
> 
>   /* true if the device is non rotational and performs queueing */
>   bool nonrot_with_queueing;
> + /* true if need to check num_groups_with_pending_reqs */
> + bool check_active_group;
> 
>   /*
>* Maximum number of requests in driver in the last
> @@ -1066,6 +1068,17 @@ static inline void bfq_pid_to_str(int pid, char *str, 
> int len)
> }
> 
> #ifdef CONFIG_BFQ_GROUP_IOSCHED
> +static inline void bfqd_enable_active_group_check(struct bfq_data *bfqd)
> +{
> + cmpxchg_relaxed(>check_active_group, false, true);
> +}
> +
> +static inline bool bfqd_has_active_group(struct bfq_data *bfqd)
> +{
> + return bfqd->check_active_group &&
> +bfqd->num_groups_with_pending_reqs > 0;
> +}
> +
> struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
> 
> #define bfq_log_bfqq(bfqd, bfqq, fmt, args...)do {
> \
> @@ -1085,6 +1098,12 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq);

Re: block, bfq: lockdep circular locking dependency gripe

2020-10-21 Thread Paolo Valente



> Il giorno 20 ott 2020, alle ore 18:54, Jens Axboe  ha 
> scritto:
> 
> On 10/20/20 1:15 AM, Paolo Valente wrote:
>>> Il giorno 20 ott 2020, alle ore 08:15, Mike Galbraith  ha 
>>> scritto:
>>> 
>>> [ 1917.361401] ==
>>> [ 1917.361406] WARNING: possible circular locking dependency detected
>>> [ 1917.361413] 5.9.0.g7cf726a-master #2 Tainted: G S  E
>>> [ 1917.361417] --
>>> [ 1917.361422] kworker/u16:35/15995 is trying to acquire lock:
>>> [ 1917.361428] 89232237f7e0 (>lock){..-.}-{2:2}, at: 
>>> put_io_context+0x30/0x90
>>> [ 1917.361440]
>>>  but task is already holding lock:
>>> [ 1917.361445] 892244d2cc08 (>lock){-.-.}-{2:2}, at: 
>>> bfq_insert_requests+0x89/0x680
>>> [ 1917.361456]
>>>  which lock already depends on the new lock.
>>> 
>>> [ 1917.361463]
>>>  the existing dependency chain (in reverse order) is:
>>> [ 1917.361469]
>>>  -> #1 (>lock){-.-.}-{2:2}:
>>> [ 1917.361479]_raw_spin_lock_irqsave+0x3d/0x50
>>> [ 1917.361484]bfq_exit_icq_bfqq+0x48/0x3f0
>>> [ 1917.361489]bfq_exit_icq+0x13/0x20
>>> [ 1917.361494]put_io_context_active+0x55/0x80
>>> [ 1917.361499]do_exit+0x72c/0xca0
>>> [ 1917.361504]do_group_exit+0x47/0xb0
>>> [ 1917.361508]__x64_sys_exit_group+0x14/0x20
>>> [ 1917.361513]do_syscall_64+0x33/0x40
>>> [ 1917.361518]entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [ 1917.361523]
>>>  -> #0 (>lock){..-.}-{2:2}:
>>> [ 1917.361532]__lock_acquire+0x149d/0x1a70
>>> [ 1917.361537]lock_acquire+0x1a7/0x3b0
>>> [ 1917.361542]_raw_spin_lock_irqsave+0x3d/0x50
>>> [ 1917.361547]put_io_context+0x30/0x90
>>> [ 1917.361552]blk_mq_free_request+0x4f/0x140
>>> [ 1917.361557]blk_attempt_req_merge+0x19/0x30
>>> [ 1917.361563]elv_attempt_insert_merge+0x4f/0x90
>>> [ 1917.361568]blk_mq_sched_try_insert_merge+0x28/0x40
>>> [ 1917.361574]bfq_insert_requests+0x94/0x680
>>> [ 1917.361579]blk_mq_sched_insert_requests+0xd1/0x2a0
>>> [ 1917.361584]blk_mq_flush_plug_list+0x12d/0x240
>>> [ 1917.361589]blk_flush_plug_list+0xb4/0xd0
>>> [ 1917.361594]io_schedule_prepare+0x3c/0x40
>>> [ 1917.361599]io_schedule+0xb/0x40
>>> [ 1917.361604]blk_mq_get_tag+0x13a/0x250
>>> [ 1917.361608]__blk_mq_alloc_request+0x5c/0x130
>>> [ 1917.361613]blk_mq_submit_bio+0xf3/0x770
>>> [ 1917.361618]submit_bio_noacct+0x41e/0x4b0
>>> [ 1917.361622]submit_bio+0x33/0x160
>>> [ 1917.361644]ext4_io_submit+0x49/0x60 [ext4]
>>> [ 1917.361661]ext4_writepages+0x683/0x1070 [ext4]
>>> [ 1917.361667]do_writepages+0x3c/0xe0
>>> [ 1917.361672]__writeback_single_inode+0x62/0x630
>>> [ 1917.361677]writeback_sb_inodes+0x218/0x4d0
>>> [ 1917.361681]__writeback_inodes_wb+0x5f/0xc0
>>> [ 1917.361686]wb_writeback+0x283/0x490
>>> [ 1917.361691]wb_workfn+0x29a/0x670
>>> [ 1917.361696]process_one_work+0x283/0x620
>>> [ 1917.361701]worker_thread+0x39/0x3f0
>>> [ 1917.361706]kthread+0x152/0x170
>>> [ 1917.361711]ret_from_fork+0x1f/0x30
>>> [ 1917.361715]
>>>  other info that might help us debug this:
>>> 
>>> [ 1917.361722]  Possible unsafe locking scenario:
>>> 
>>> [ 1917.361728]CPU0CPU1
>>> [ 1917.361731]
>>> [ 1917.361736]   lock(>lock);
>>> [ 1917.361740]lock(>lock);
>>> [ 1917.361746]lock(>lock);
>>> [ 1917.361752]   lock(>lock);
>>> [ 1917.361757]
>>>   *** DEADLOCK ***
>>> 
>>> [ 1917.361763] 5 locks held by kworker/u16:35/15995:
>>> [ 1917.361767]  #0: 892240c9bd38 
>>> ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x1fa/0x620
>>> [ 1917.361778]  #1: 94569342fe78 
>>> ((work_completion)(&(>dwork)->work)){+.+.}-{0:0}, at: 
>>> process_one_work+0x1fa/0x620
>>> [ 1917.361789]  #2: 8921424

Re: block, bfq: lockdep circular locking dependency gripe

2020-10-20 Thread Paolo Valente
Hi,
that's apparently hard to solve inside bfq. The the ioc of the task is being 
exited while the task is still inside the code for having an I/O request 
served. Is still normal?

Thanks,
Polo

> Il giorno 20 ott 2020, alle ore 08:15, Mike Galbraith  ha 
> scritto:
> 
> [ 1917.361401] ==
> [ 1917.361406] WARNING: possible circular locking dependency detected
> [ 1917.361413] 5.9.0.g7cf726a-master #2 Tainted: G S  E
> [ 1917.361417] --
> [ 1917.361422] kworker/u16:35/15995 is trying to acquire lock:
> [ 1917.361428] 89232237f7e0 (>lock){..-.}-{2:2}, at: 
> put_io_context+0x30/0x90
> [ 1917.361440]
>   but task is already holding lock:
> [ 1917.361445] 892244d2cc08 (>lock){-.-.}-{2:2}, at: 
> bfq_insert_requests+0x89/0x680
> [ 1917.361456]
>   which lock already depends on the new lock.
> 
> [ 1917.361463]
>   the existing dependency chain (in reverse order) is:
> [ 1917.361469]
>   -> #1 (>lock){-.-.}-{2:2}:
> [ 1917.361479]_raw_spin_lock_irqsave+0x3d/0x50
> [ 1917.361484]bfq_exit_icq_bfqq+0x48/0x3f0
> [ 1917.361489]bfq_exit_icq+0x13/0x20
> [ 1917.361494]put_io_context_active+0x55/0x80
> [ 1917.361499]do_exit+0x72c/0xca0
> [ 1917.361504]do_group_exit+0x47/0xb0
> [ 1917.361508]__x64_sys_exit_group+0x14/0x20
> [ 1917.361513]do_syscall_64+0x33/0x40
> [ 1917.361518]entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1917.361523]
>   -> #0 (>lock){..-.}-{2:2}:
> [ 1917.361532]__lock_acquire+0x149d/0x1a70
> [ 1917.361537]lock_acquire+0x1a7/0x3b0
> [ 1917.361542]_raw_spin_lock_irqsave+0x3d/0x50
> [ 1917.361547]put_io_context+0x30/0x90
> [ 1917.361552]blk_mq_free_request+0x4f/0x140
> [ 1917.361557]blk_attempt_req_merge+0x19/0x30
> [ 1917.361563]elv_attempt_insert_merge+0x4f/0x90
> [ 1917.361568]blk_mq_sched_try_insert_merge+0x28/0x40
> [ 1917.361574]bfq_insert_requests+0x94/0x680
> [ 1917.361579]blk_mq_sched_insert_requests+0xd1/0x2a0
> [ 1917.361584]blk_mq_flush_plug_list+0x12d/0x240
> [ 1917.361589]blk_flush_plug_list+0xb4/0xd0
> [ 1917.361594]io_schedule_prepare+0x3c/0x40
> [ 1917.361599]io_schedule+0xb/0x40
> [ 1917.361604]blk_mq_get_tag+0x13a/0x250
> [ 1917.361608]__blk_mq_alloc_request+0x5c/0x130
> [ 1917.361613]blk_mq_submit_bio+0xf3/0x770
> [ 1917.361618]submit_bio_noacct+0x41e/0x4b0
> [ 1917.361622]submit_bio+0x33/0x160
> [ 1917.361644]ext4_io_submit+0x49/0x60 [ext4]
> [ 1917.361661]ext4_writepages+0x683/0x1070 [ext4]
> [ 1917.361667]do_writepages+0x3c/0xe0
> [ 1917.361672]__writeback_single_inode+0x62/0x630
> [ 1917.361677]writeback_sb_inodes+0x218/0x4d0
> [ 1917.361681]__writeback_inodes_wb+0x5f/0xc0
> [ 1917.361686]wb_writeback+0x283/0x490
> [ 1917.361691]wb_workfn+0x29a/0x670
> [ 1917.361696]process_one_work+0x283/0x620
> [ 1917.361701]worker_thread+0x39/0x3f0
> [ 1917.361706]kthread+0x152/0x170
> [ 1917.361711]ret_from_fork+0x1f/0x30
> [ 1917.361715]
>   other info that might help us debug this:
> 
> [ 1917.361722]  Possible unsafe locking scenario:
> 
> [ 1917.361728]CPU0CPU1
> [ 1917.361731]
> [ 1917.361736]   lock(>lock);
> [ 1917.361740]lock(>lock);
> [ 1917.361746]lock(>lock);
> [ 1917.361752]   lock(>lock);
> [ 1917.361757]
>*** DEADLOCK ***
> 
> [ 1917.361763] 5 locks held by kworker/u16:35/15995:
> [ 1917.361767]  #0: 892240c9bd38 ((wq_completion)writeback){+.+.}-{0:0}, 
> at: process_one_work+0x1fa/0x620
> [ 1917.361778]  #1: 94569342fe78 
> ((work_completion)(&(>dwork)->work)){+.+.}-{0:0}, at: 
> process_one_work+0x1fa/0x620
> [ 1917.361789]  #2: 8921424ae0e0 (>s_umount_key#39){}-{3:3}, 
> at: trylock_super+0x16/0x50
> [ 1917.361800]  #3: 8921424aaa40 (>s_writepages_rwsem){.+.+}-{0:0}, 
> at: do_writepages+0x3c/0xe0
> [ 1917.361811]  #4: 892244d2cc08 (>lock){-.-.}-{2:2}, at: 
> bfq_insert_requests+0x89/0x680
> [ 1917.361821]
>   stack backtrace:
> [ 1917.361827] CPU: 6 PID: 15995 Comm: kworker/u16:35 Kdump: loaded Tainted: 
> G S  E 5.9.0.g7cf726a-master #2
> [ 1917.361833] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
> 09/23/2013
> [ 1917.361840] Workqueue: writeback wb_workfn (flush-8:32)
> [ 1917.361846] Call Trace:
> [ 1917.361854]  dump_stack+0x77/0x97
> [ 1917.361860]  check_noncircular+0xe7/0x100
> [ 1917.361866]  ? __lock_acquire+0x2ce/0x1a70
> [ 1917.361872]  ? __lock_acquire+0x149d/0x1a70
> [ 1917.361877]  __lock_acquire+0x149d/0x1a70
> [ 1917.361884]  lock_acquire+0x1a7/0x3b0
> [ 

Re: [PATCH] bfq: fix blkio cgroup leakage

2020-07-09 Thread Paolo Valente



> Il giorno 9 lug 2020, alle ore 10:19, Dmitry Monakhov  
> ha scritto:
> 
> Paolo Valente  writes:
> 
>>> Il giorno 8 lug 2020, alle ore 19:48, Dmitry Monakhov  
>>> ha scritto:
>>> 
>>> Paolo Valente  writes:
>>> 
>>>> Hi,
>>>> sorry for the delay.  The commit you propose to drop fix the issues
>>>> reported in [1].
>>>> 
>>>> Such a commit does introduce the leak that you report (thank you for
>>>> spotting it).  Yet, according to the threads mentioned in [1],
>>>> dropping that commit would take us back to those issues.
>>>> 
>>>> Maybe the solution is to fix the unbalance that you spotted?
>>> I'm not quite shure that do I understand which bug was addressed for commit 
>>> db37a34c563b.
>>> AFAIU both bugs mentioned in original patchset was fixed by:
>>> 478de3380 ("block, bfq: deschedule empty bfq_queues not referred by any 
>>> proces")
>>> f718b0932 ( block, bfq: do not plug I/O for bfq_queues with no proc refs)"
>>> 
>>> So I review commit db37a34c563b as independent one.
>>> It introduces extra reference for bfq_groups via bfqg_and_blkg_get(),
>>> but do we actually need it here?
>>> 
>>> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
>>> bfqd->root_group is holded by bfqd from bfq_init_queue()
>>> other bfq_queue objects are owned by corresponding blkcg from bfq_pd_alloc()
>>> So bfq_queue can not disappear under us.
>>> 
>> 
>> You are right, but incomplete.  No extra ref is needed for an entity
>> that represents a bfq_queue.  And this consideration mistook me before
>> I realized that that commit was needed.  The problem is that an entity
>> may also represent a group of entities.  In that case no reference is
>> taken through any bfq_queue.  The commit you want to remove takes this
>> missing reference.
> Sorry, It looks like I've mistyped sentance above, I ment to say bfq_group.
> So here is my statement corrected:
> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
> bfqd->root_group is holded by bfqd from bfq_init_queue()
> other *bfq_group* objects are owned by corresponding blkcg, reference get 
> from bfq_pd_alloc()
> So *bfq_group* can not disappear under us.
> 
> So no extra reference is required for entity represents bfq_group. Commit is 
> not required.

No, the entity may remain alive and on some tree after bfq_pd_offline has been 
invoked.

Paolo

>> 
>> Paolo
>> 
>>> #IF CONFIG_BFQ_GROUP_IOSCHED is disabled:
>>> we have only one  bfqd->root_group object which allocated from 
>>> bfq_create_group_hierarch()
>>> and bfqg_and_blkg_get() bfqg_and_blkg_put() are noop
>>> 
>>> Resume: in both cases extra reference is not required, so I continue to
>>> insist that we should revert  commit db37a34c563b because it tries to
>>> solve a non existing issue, but introduce the real one.
>>> 
>>> Please correct me if I'm wrong.
>>>> 
>>>> I'll check it ASAP, unless you do it before me.
>>>> 
>>>> Thanks,
>>>> Paolo
>>>> 
>>>> [1] https://lkml.org/lkml/2020/1/31/94
>>>> 
>>>>> Il giorno 2 lug 2020, alle ore 12:57, Dmitry Monakhov 
>>>>>  ha scritto:
>>>>> 
>>>>> commit db37a34c563b ("block, bfq: get a ref to a group when adding it to 
>>>>> a service tree")
>>>>> introduce leak forbfq_group and blkcg_gq objects because of get/put
>>>>> imbalance. See trace balow:
>>>>> -> blkg_alloc
>>>>> -> bfq_pq_alloc
>>>>>   -> bfqg_get (+1)
>>>>> ->bfq_activate_bfqq
>>>>> ->bfq_activate_requeue_entity
>>>>>  -> __bfq_activate_entity
>>>>> ->bfq_get_entity
>>> ->> ->bfqg_and_blkg_get (+1)  < : Note1
>>>>> ->bfq_del_bfqq_busy
>>>>> ->bfq_deactivate_entity+0x53/0xc0 [bfq]
>>>>>  ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
>>>>>-> bfq_forget_entity(is_in_service = true)
>>>>>entity->on_st_or_in_serv = false   <=== :Note2
>>>>>if (is_in_service)
>>>>>return;  ==> do not touch reference
>>>>> -> blkcg_css_offline
>>>>> -> blkcg_destroy_blkgs
>>>>> -> blkg_destroy
>>>>> -> bfq_pd_offline
>>>>>  -> __bfq_deac

Re: [PATCH] bfq: fix blkio cgroup leakage

2020-07-09 Thread Paolo Valente



> Il giorno 8 lug 2020, alle ore 19:48, Dmitry Monakhov  
> ha scritto:
> 
> Paolo Valente  writes:
> 
>> Hi,
>> sorry for the delay.  The commit you propose to drop fix the issues
>> reported in [1].
>> 
>> Such a commit does introduce the leak that you report (thank you for
>> spotting it).  Yet, according to the threads mentioned in [1],
>> dropping that commit would take us back to those issues.
>> 
>> Maybe the solution is to fix the unbalance that you spotted?
> I'm not quite shure that do I understand which bug was addressed for commit 
> db37a34c563b.
> AFAIU both bugs mentioned in original patchset was fixed by:
> 478de3380 ("block, bfq: deschedule empty bfq_queues not referred by any 
> proces")
> f718b0932 ( block, bfq: do not plug I/O for bfq_queues with no proc refs)"
> 
> So I review commit db37a34c563b as independent one.
> It introduces extra reference for bfq_groups via bfqg_and_blkg_get(),
> but do we actually need it here?
> 
> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
> bfqd->root_group is holded by bfqd from bfq_init_queue()
> other bfq_queue objects are owned by corresponding blkcg from bfq_pd_alloc()
> So bfq_queue can not disappear under us.
> 

You are right, but incomplete.  No extra ref is needed for an entity
that represents a bfq_queue.  And this consideration mistook me before
I realized that that commit was needed.  The problem is that an entity
may also represent a group of entities.  In that case no reference is
taken through any bfq_queue.  The commit you want to remove takes this
missing reference.

Paolo

> #IF CONFIG_BFQ_GROUP_IOSCHED is disabled:
> we have only one  bfqd->root_group object which allocated from 
> bfq_create_group_hierarch()
> and bfqg_and_blkg_get() bfqg_and_blkg_put() are noop
> 
> Resume: in both cases extra reference is not required, so I continue to
> insist that we should revert  commit db37a34c563b because it tries to
> solve a non existing issue, but introduce the real one.
> 
> Please correct me if I'm wrong.
>> 
>> I'll check it ASAP, unless you do it before me.
>> 
>> Thanks,
>> Paolo
>> 
>> [1] https://lkml.org/lkml/2020/1/31/94
>> 
>>> Il giorno 2 lug 2020, alle ore 12:57, Dmitry Monakhov  
>>> ha scritto:
>>> 
>>> commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
>>> service tree")
>>> introduce leak forbfq_group and blkcg_gq objects because of get/put
>>> imbalance. See trace balow:
>>> -> blkg_alloc
>>>  -> bfq_pq_alloc
>>>-> bfqg_get (+1)
>>> ->bfq_activate_bfqq
>>> ->bfq_activate_requeue_entity
>>>   -> __bfq_activate_entity
>>>  ->bfq_get_entity
> ->> ->bfqg_and_blkg_get (+1)  < : Note1
>>> ->bfq_del_bfqq_busy
>>> ->bfq_deactivate_entity+0x53/0xc0 [bfq]
>>>   ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
>>> -> bfq_forget_entity(is_in_service = true)
>>>  entity->on_st_or_in_serv = false   <=== :Note2
>>>  if (is_in_service)
>>>  return;  ==> do not touch reference
>>> -> blkcg_css_offline
>>> -> blkcg_destroy_blkgs
>>> -> blkg_destroy
>>>  -> bfq_pd_offline
>>>   -> __bfq_deactivate_entity
>>>if (!entity->on_st_or_in_serv) /* true, because (Note2)
>>> return false;
>>> -> bfq_pd_free
>>>   -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
>>> So bfq_group and blkcg_gq  will leak forever, see test-case below.
>>> If fact bfq_group objects reference counting are quite different
>>> from bfq_queue. bfq_groups object are referenced by blkcg_gq via
>>> blkg_policy_data pointer, so  neither nor blkg_get() neither bfqg_get
>>> required here.
>>> 
>>> 
>>> This patch drop commit db37a34c563b ("block, bfq: get a ref to a group when 
>>> adding it to a service tree")
>>> and add corresponding comment.
>>> 
>>> ##TESTCASE_BEGIN:
>>> #!/bin/bash
>>> 
>>> max_iters=${1:-100}
>>> #prep cgroup mounts
>>> mount -t tmpfs cgroup_root /sys/fs/cgroup
>>> mkdir /sys/fs/cgroup/blkio
>>> mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
>>> 
>>> # Prepare blkdev
>>> grep blkio /proc/cgroups
>>> truncate -s 1M img
>>> losetup /dev/loop0 img
>>> echo bfq > /sys/block/loop0/queue/scheduler
>>> 
>>> grep blkio /proc/cgroups
>>> for (

Re: [PATCH] bfq: fix blkio cgroup leakage

2020-07-08 Thread Paolo Valente
Hi,
sorry for the delay.  The commit you propose to drop fix the issues
reported in [1].

Such a commit does introduce the leak that you report (thank you for
spotting it).  Yet, according to the threads mentioned in [1],
dropping that commit would take us back to those issues.

Maybe the solution is to fix the unbalance that you spotted?

I'll check it ASAP, unless you do it before me.

Thanks,
Paolo

[1] https://lkml.org/lkml/2020/1/31/94

> Il giorno 2 lug 2020, alle ore 12:57, Dmitry Monakhov  
> ha scritto:
> 
> commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
> service tree")
> introduce leak forbfq_group and blkcg_gq objects because of get/put
> imbalance. See trace balow:
> -> blkg_alloc
>   -> bfq_pq_alloc
> -> bfqg_get (+1)
> ->bfq_activate_bfqq
>  ->bfq_activate_requeue_entity
>-> __bfq_activate_entity
>   ->bfq_get_entity
> ->bfqg_and_blkg_get (+1)  < : Note1
> ->bfq_del_bfqq_busy
>  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
>->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
>  -> bfq_forget_entity(is_in_service = true)
>entity->on_st_or_in_serv = false   <=== :Note2
>if (is_in_service)
>return;  ==> do not touch reference
> -> blkcg_css_offline
> -> blkcg_destroy_blkgs
>  -> blkg_destroy
>   -> bfq_pd_offline
>-> __bfq_deactivate_entity
> if (!entity->on_st_or_in_serv) /* true, because (Note2)
>   return false;
> -> bfq_pd_free
>-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
> So bfq_group and blkcg_gq  will leak forever, see test-case below.
> If fact bfq_group objects reference counting are quite different
> from bfq_queue. bfq_groups object are referenced by blkcg_gq via
> blkg_policy_data pointer, so  neither nor blkg_get() neither bfqg_get
> required here.
> 
> 
> This patch drop commit db37a34c563b ("block, bfq: get a ref to a group when 
> adding it to a service tree")
> and add corresponding comment.
> 
> ##TESTCASE_BEGIN:
> #!/bin/bash
> 
> max_iters=${1:-100}
> #prep cgroup mounts
> mount -t tmpfs cgroup_root /sys/fs/cgroup
> mkdir /sys/fs/cgroup/blkio
> mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
> 
> # Prepare blkdev
> grep blkio /proc/cgroups
> truncate -s 1M img
> losetup /dev/loop0 img
> echo bfq > /sys/block/loop0/queue/scheduler
> 
> grep blkio /proc/cgroups
> for ((i=0;i do
>mkdir -p /sys/fs/cgroup/blkio/a
>echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
>dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
>echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
>rmdir /sys/fs/cgroup/blkio/a
>grep blkio /proc/cgroups
> done
> ##TESTCASE_END:
> 
> Signed-off-by: Dmitry Monakhov 
> ---
> block/bfq-cgroup.c  |  2 +-
> block/bfq-iosched.h |  1 -
> block/bfq-wf2q.c| 15 +--
> 3 files changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index 68882b9..b791e20 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -332,7 +332,7 @@ static void bfqg_put(struct bfq_group *bfqg)
>   kfree(bfqg);
> }
> 
> -void bfqg_and_blkg_get(struct bfq_group *bfqg)
> +static void bfqg_and_blkg_get(struct bfq_group *bfqg)
> {
>   /* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
>   bfqg_get(bfqg);
> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
> index cd224aa..7038952 100644
> --- a/block/bfq-iosched.h
> +++ b/block/bfq-iosched.h
> @@ -986,7 +986,6 @@ struct bfq_group *bfq_find_set_group(struct bfq_data 
> *bfqd,
> struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
> struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
> struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node);
> -void bfqg_and_blkg_get(struct bfq_group *bfqg);
> void bfqg_and_blkg_put(struct bfq_group *bfqg);
> 
> #ifdef CONFIG_BFQ_GROUP_IOSCHED
> diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
> index 34ad095..6a363bb 100644
> --- a/block/bfq-wf2q.c
> +++ b/block/bfq-wf2q.c
> @@ -529,13 +529,14 @@ static void bfq_get_entity(struct bfq_entity *entity)
> {
>   struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
> 
> + /* Grab reference only for bfq_queue's objects, bfq_group ones
> +  * are owned by blkcg_gq
> +  */
>   if (bfqq) {
>   bfqq->ref++;
>   bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
>bfqq, bfqq->ref);
> - } else
> - bfqg_and_blkg_get(container_of(entity, struct bfq_group,
> -entity));
> + }
> }
> 
> /**
> @@ -649,14 +650,8 @@ static void bfq_forget_entity(struct bfq_service_tree 
> *st,
> 
>   entity->on_st_or_in_serv = false;
>   st->wsum -= entity->weight;
> - if (is_in_service)
> - return;
> -
> - if (bfqq)
> + if (bfqq && !is_in_service)
>   bfq_put_queue(bfqq);
> - else
> - bfqg_and_blkg_put(container_of(entity, 

Re: [PATCH 0/2] block, bfq: make bfq disable iocost and present a double interface

2019-10-09 Thread Paolo Valente
Jens, Tejun,
can we proceed with this double-interface solution?

Thanks,
Paolo

> Il giorno 1 ott 2019, alle ore 21:33, Paolo Valente 
>  ha scritto:
> 
> Hi Jens,
> 
> the first patch in this series is Tejun's patch for making BFQ disable
> io.cost. The second patch makes BFQ present both the bfq-prefixes
> parameters and non-prefixed parameters, as suggested by Tejun [1].
> 
> In the first patch I've tried to use macros not to repeat code
> twice. checkpatch complains that these macros should be enclosed in
> parentheses. I don't see how to do it. I'm willing to switch to any
> better solution.
> 
> Thanks,
> Paolo
> 
> [1] https://lkml.org/lkml/2019/9/18/736
> 
> Paolo Valente (1):
>  block, bfq: present a double cgroups interface
> 
> Tejun Heo (1):
>  blkcg: Make bfq disable iocost when enabled
> 
> Documentation/admin-guide/cgroup-v2.rst |   8 +-
> Documentation/block/bfq-iosched.rst |  40 ++--
> block/bfq-cgroup.c  | 260 
> block/bfq-iosched.c |  32 +++
> block/blk-iocost.c  |   5 +-
> include/linux/blk-cgroup.h  |   5 +
> kernel/cgroup/cgroup.c  |   2 +
> 7 files changed, 201 insertions(+), 151 deletions(-)
> 
> --
> 2.20.1



Re: [PATCHSET v3 block/for-linus] IO cost model based work-conserving porportional controller

2019-08-29 Thread Paolo Valente
Hi,
I see an important interface problem.  Userspace has been waiting for
io.weight to become eventually the file name for setting the weight
for the proportional-share policy [1,2].  If you use that name, how
will we solve this?

Thanks,
Paolo

[1] https://github.com/systemd/systemd/issues/7057#issuecomment-522747575
[2] https://github.com/systemd/systemd/pull/13335#issuecomment-523035303

> Il giorno 29 ago 2019, alle ore 00:05, Tejun Heo  ha scritto:
> 
> Changes from v2[2]:
> 
> * Fixed a divide-by-zero bug in current_hweight().
> 
> * pre_start_time and friends renamed to alloc_time and now has its own
>  CONFIG option which is selected by IOCOST.
> 
> Changes from v1[1]:
> 
> * Prerequisite patchsets had cosmetic changes and merged.  Refreshed
>  on top.
> 
> * Renamed from ioweight to iocost.  All source code and tools are
>  updated accordingly.  Control knobs io.weight.qos and
>  io.weight.cost_model are renamed to io.cost.qos and io.cost.model
>  respectively.  This is a more fitting name which won't become a
>  misnomer when, for example, cost based io.max is added.
> 
> * Various bug fixes and improvements.  A few bugs were discovered
>  while testing against high-iops nvme device.  Auto parameter
>  selection improved and verified across different classes of SSDs.
> 
> * Dropped bpf iocost support for now.
> 
> * Added coef generation script.
> 
> * Verified on high-iops nvme device.  Result is included below.
> 
> One challenge of controlling IO resources is the lack of trivially
> observable cost metric.  This is distinguished from CPU and memory
> where wallclock time and the number of bytes can serve as accurate
> enough approximations.
> 
> Bandwidth and iops are the most commonly used metrics for IO devices
> but depending on the type and specifics of the device, different IO
> patterns easily lead to multiple orders of magnitude variations
> rendering them useless for the purpose of IO capacity distribution.
> While on-device time, with a lot of clutches, could serve as a useful
> approximation for non-queued rotational devices, this is no longer
> viable with modern devices, even the rotational ones.
> 
> While there is no cost metric we can trivially observe, it isn't a
> complete mystery.  For example, on a rotational device, seek cost
> dominates while a contiguous transfer contributes a smaller amount
> proportional to the size.  If we can characterize at least the
> relative costs of these different types of IOs, it should be possible
> to implement a reasonable work-conserving proportional IO resource
> distribution.
> 
> This patchset implements IO cost model based work-conserving
> proportional controller.  It currently has a simple linear cost model
> builtin where each IO is classified as sequential or random and given
> a base cost accordingly and additional size-proportional cost is added
> on top.  Each IO is given a cost based on the model and the controller
> issues IOs for each cgroup according to their hierarchical weight.
> 
> By default, the controller adapts its overall IO rate so that it
> doesn't build up buffer bloat in the request_queue layer, which
> guarantees that the controller doesn't lose significant amount of
> total work.  However, this may not provide sufficient differentiation
> as the underlying device may have a deep queue and not be fair in how
> the queued IOs are serviced.  The controller provides extra QoS
> control knobs which allow tightening control feedback loop as
> necessary.
> 
> For more details on the control mechanism, implementation and
> interface, please refer to the comment at the top of
> block/blk-iocost.c and Documentation/admin-guide/cgroup-v2.rst changes
> in the "blkcg: implement blk-iocost" patch.
> 
> Here are some test results.  Each test run goes through the following
> combinations with each combination running for a minute.  All tests
> are performed against regular files on btrfs w/ deadline as the IO
> scheduler.  Random IOs are direct w/ queue depth of 64.  Sequential
> are normal buffered IOs.
> 
>high priority (weight=500)  low priority (weight=100)
> 
>Rand read   None
>ditto   Rand read
>ditto   Seq  read
>ditto   Rand write
>ditto   Seq  write
>Seq  read   None
>ditto   Rand read
>ditto   Seq  read
>ditto   Rand write
>ditto   Seq  write
>Rand write  None
>ditto   Rand read
>ditto   Seq  read
>ditto   Rand write
>ditto   Seq  write
>Seq  write  None
>ditto   Rand read
>ditto 

Re: RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0

2019-08-08 Thread Paolo Valente



> Il giorno 8 ago 2019, alle ore 12:21, Sander Eikelenboom 
>  ha scritto:
> 
> On 08/08/2019 11:10, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 8 ago 2019, alle ore 11:05, Sander Eikelenboom 
>>>  ha scritto:
>>> 
>>> L.S.,
>>> 
>>> While testing a linux 5.3-rc3 kernel on my Xen server I come across the 
>>> splat below when trying to shutdown all the VM's.
>>> This is after the server has ran for a few days without any problem. It 
>>> seems to happen consistently.
>>> 
>>> It seems it's in the same area as dbc3117d4ca9e17819ac73501e914b8422686750, 
>>> but already rc3 incorporates that patch.
>>> 
>>> Any ideas ?
>>> 
>> 
>> Could you try these fixes I proposed yesterday:
>> https://lkml.org/lkml/2019/8/7/536
>> or, on patchwork:
>> https://patchwork.kernel.org/patch/11082247/
>> https://patchwork.kernel.org/patch/11082249/
> 
> Hi Paolo,
> 
> These two above seem to fix the issue !
> So thanks for the swift reply (and the patchwork links for easy
> downloading the patches).
> 
> I will test the third unrelated patch as well, but if you don't hear
> back , it's all good.
> 

Great! Thank you for offering to test also the other patch. Tested-by are 
welcome too :)

Thanks,
Paolo

> Thanks again !
> 
> --
> Sander
> 
>> I posted a further fix too, which should be unrelated. But, just in case:
>> https://lkml.org/lkml/2019/8/7/715
>> or, on patchwork:
>> https://patchwork.kernel.org/patch/11082521/
>> 
>> Crossing my fingers (and think you for reporting this),
>> Paolo
>> 
>>> --
>>> Sander
>>> 
>>> 
>>> [80915.716048] BUG: unable to handle page fault for address: 
>>> 1008
>>> [80915.724188] #PF: supervisor write access in kernel mode
>>> [80915.733182] #PF: error_code(0x0002) - not-present page
>>> [80915.741455] PGD 0 P4D 0 
>>> [80915.750538] Oops: 0002 [#1] SMP NOPTI
>>> [80915.758425] CPU: 4 PID: 11407 Comm: 17.hda-2 Tainted: GW 
>>> 5.3.0-rc3-20190807-doflr+ #1
>>> [80915.766137] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>>> V1.8B1 09/13/2010
>>> [80915.773737] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
>>> [80915.781294] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 
>>> f0 01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 
>>> <48> 89 4e 08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
>>> [80915.796792] RSP: e02b:c9000473be28 EFLAGS: 00010006
>>> [80915.804419] RAX: 888070393200 RBX: 888076c4a800 RCX: 
>>> 888076c4a9f8
>>> [80915.810254] device vif17.0 left promiscuous mode
>>> [80915.811906] RDX: 1000 RSI: 1000 RDI: 
>>> 
>>> [80915.811908] RBP: 888077efc398 R08: 0004 R09: 
>>> 81106800
>>> [80915.811909] R10: 88807804ca40 R11: c9000473be31 R12: 
>>> 888005256bf0
>>> [80915.811909] R13:  R14: 888005256800 R15: 
>>> 82a6a3c0
>>> [80915.811919] FS:  7f1c30a8dbc0() GS:88807d50() 
>>> knlGS:
>>> [80915.819456] xen_bridge: port 18(vif17.0) entered disabled state
>>> [80915.826569] CS:  1e030 DS:  ES:  CR0: 80050033
>>> [80915.826571] CR2: 1008 CR3: 5d9d CR4: 
>>> 0660
>>> [80915.826575] Call Trace:
>>> [80915.826592]  bfq_exit_icq+0xe/0x20
>>> [80915.826595]  put_io_context_active+0x52/0x80
>>> [80915.826599]  do_exit+0x774/0xac0
>>> [80915.906037]  ? xen_blkif_be_int+0x30/0x30
>>> [80915.913311]  kthread+0xda/0x130
>>> [80915.920398]  ? kthread_park+0x80/0x80
>>> [80915.927524]  ret_from_fork+0x22/0x40
>>> [80915.934512] Modules linked in:
>>> [80915.941412] CR2: 1008
>>> [80915.948221] ---[ end trace 61315493e0f8ef40 ]---
>>> [80915.954984] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
>>> [80915.961850] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 
>>> f0 01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 
>>> <48> 89 4e 08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
>>> [80915.976124] RSP: e02b:c9000473be28 EFLAGS: 00010006
>>> [80915.983205] RAX: 888070393200 RBX: 888076c4a800 RCX: 
>>> 888076c4a9f8
>>> [80915.990321] RDX: 1000 RSI: 1000 RDI: 
>>> 
>>> [80915.997319] RBP: 888077efc398 R08: 0004 R09: 
>>> 81106800
>>> [80916.004427] R10: 88807804ca40 R11: c9000473be31 R12: 
>>> 888005256bf0
>>> [80916.011525] R13:  R14: 888005256800 R15: 
>>> 82a6a3c0
>>> [80916.018679] FS:  7f1c30a8dbc0() GS:88807d50() 
>>> knlGS:
>>> [80916.025897] CS:  1e030 DS:  ES:  CR0: 80050033
>>> [80916.033116] CR2: 1008 CR3: 5d9d CR4: 
>>> 0660
>>> [80916.040348] Fixing recursive fault but reboot is needed!



Re: RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0

2019-08-08 Thread Paolo Valente



> Il giorno 8 ago 2019, alle ore 11:05, Sander Eikelenboom 
>  ha scritto:
> 
> L.S.,
> 
> While testing a linux 5.3-rc3 kernel on my Xen server I come across the splat 
> below when trying to shutdown all the VM's.
> This is after the server has ran for a few days without any problem. It seems 
> to happen consistently.
> 
> It seems it's in the same area as dbc3117d4ca9e17819ac73501e914b8422686750, 
> but already rc3 incorporates that patch.
> 
> Any ideas ?
> 

Could you try these fixes I proposed yesterday:
https://lkml.org/lkml/2019/8/7/536
or, on patchwork:
https://patchwork.kernel.org/patch/11082247/
https://patchwork.kernel.org/patch/11082249/

I posted a further fix too, which should be unrelated. But, just in case:
https://lkml.org/lkml/2019/8/7/715
or, on patchwork:
https://patchwork.kernel.org/patch/11082521/

Crossing my fingers (and think you for reporting this),
Paolo

> --
> Sander
> 
> 
> [80915.716048] BUG: unable to handle page fault for address: 1008
> [80915.724188] #PF: supervisor write access in kernel mode
> [80915.733182] #PF: error_code(0x0002) - not-present page
> [80915.741455] PGD 0 P4D 0 
> [80915.750538] Oops: 0002 [#1] SMP NOPTI
> [80915.758425] CPU: 4 PID: 11407 Comm: 17.hda-2 Tainted: GW 
> 5.3.0-rc3-20190807-doflr+ #1
> [80915.766137] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
> V1.8B1 09/13/2010
> [80915.773737] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
> [80915.781294] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 f0 
> 01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 <48> 89 
> 4e 08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
> [80915.796792] RSP: e02b:c9000473be28 EFLAGS: 00010006
> [80915.804419] RAX: 888070393200 RBX: 888076c4a800 RCX: 
> 888076c4a9f8
> [80915.810254] device vif17.0 left promiscuous mode
> [80915.811906] RDX: 1000 RSI: 1000 RDI: 
> 
> [80915.811908] RBP: 888077efc398 R08: 0004 R09: 
> 81106800
> [80915.811909] R10: 88807804ca40 R11: c9000473be31 R12: 
> 888005256bf0
> [80915.811909] R13:  R14: 888005256800 R15: 
> 82a6a3c0
> [80915.811919] FS:  7f1c30a8dbc0() GS:88807d50() 
> knlGS:
> [80915.819456] xen_bridge: port 18(vif17.0) entered disabled state
> [80915.826569] CS:  1e030 DS:  ES:  CR0: 80050033
> [80915.826571] CR2: 1008 CR3: 5d9d CR4: 
> 0660
> [80915.826575] Call Trace:
> [80915.826592]  bfq_exit_icq+0xe/0x20
> [80915.826595]  put_io_context_active+0x52/0x80
> [80915.826599]  do_exit+0x774/0xac0
> [80915.906037]  ? xen_blkif_be_int+0x30/0x30
> [80915.913311]  kthread+0xda/0x130
> [80915.920398]  ? kthread_park+0x80/0x80
> [80915.927524]  ret_from_fork+0x22/0x40
> [80915.934512] Modules linked in:
> [80915.941412] CR2: 1008
> [80915.948221] ---[ end trace 61315493e0f8ef40 ]---
> [80915.954984] RIP: e030:bfq_exit_icq_bfqq+0x147/0x1c0
> [80915.961850] Code: 00 00 00 00 00 00 48 0f ba b0 20 01 00 00 0c 48 8b 88 f0 
> 01 00 00 48 85 c9 74 29 48 8b b0 e8 01 00 00 48 89 31 48 85 f6 74 04 <48> 89 
> 4e 08 48 c7 80 e8 01 00 00 00 00 00 00 48 c7 80 f0 01 00 00
> [80915.976124] RSP: e02b:c9000473be28 EFLAGS: 00010006
> [80915.983205] RAX: 888070393200 RBX: 888076c4a800 RCX: 
> 888076c4a9f8
> [80915.990321] RDX: 1000 RSI: 1000 RDI: 
> 
> [80915.997319] RBP: 888077efc398 R08: 0004 R09: 
> 81106800
> [80916.004427] R10: 88807804ca40 R11: c9000473be31 R12: 
> 888005256bf0
> [80916.011525] R13:  R14: 888005256800 R15: 
> 82a6a3c0
> [80916.018679] FS:  7f1c30a8dbc0() GS:88807d50() 
> knlGS:
> [80916.025897] CS:  1e030 DS:  ES:  CR0: 80050033
> [80916.033116] CR2: 1008 CR3: 5d9d CR4: 
> 0660
> [80916.040348] Fixing recursive fault but reboot is needed!



[BUGFIX 0/1] handle NULL return value by bfq_init_rq()

2019-08-07 Thread Paolo Valente
Hi Jens,
this is a hopefully complete version of the fix proposed by Guenter [1].

Thanks,
Paolo

[1] https://lkml.org/lkml/2019/7/22/824

Paolo Valente (1):
  block, bfq: handle NULL return value by bfq_init_rq()

 block/bfq-iosched.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

--
2.20.1


[PATCH BUGFIX IMPROVEMENT 1/1] block, bfq: check also in-flight I/O in dispatch plugging

2019-07-15 Thread Paolo Valente
Consider a sync bfq_queue Q that remains empty while in service, and
suppose that, when this happens, there is a fair amount of already
in-flight I/O not belonging to Q. This situation is not considered
when deciding whether to plug I/O dispatching (until new I/O arrives
for Q). But it has to be checked, for the following reason.

The drive may decide to serve in-flight non-Q's I/O requests before
Q's ones, thereby delaying the arrival of new I/O requests for Q
(recall that Q is sync). If I/O-dispatching is not plugged, then,
while Q remains empty, a basically uncontrolled amount of I/O from
other queues may be dispatched too, possibly causing the service of
Q's I/O to be delayed even longer in the drive. This problem gets more
and more serious as the speed and the queue depth of the drive grow,
because, as these two quantities grow, the probability to find no
queue busy but many requests in flight grows too.

This commit performs I/O-dispatch plugging in this scenario.  Plugging
minimizes the delay induced by already in-flight I/O, and enables Q to
recover the bandwidth it may lose because of this delay.

As a practical example, under write load on a Samsung SSD 970 PRO,
gnome-terminal starts in 1.5 seconds after this fix, against 15
seconds before the fix (as a reference, gnome-terminal takes about 35
seconds to start with any of the other I/O schedulers).

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 67 +
 1 file changed, 43 insertions(+), 24 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ddf042b36549..c719020fa121 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3744,38 +3744,57 @@ static void bfq_dispatch_remove(struct request_queue 
*q, struct request *rq)
  * there is no active group, then the primary expectation for
  * this device is probably a high throughput.
  *
- * We are now left only with explaining the additional
- * compound condition that is checked below for deciding
- * whether the scenario is asymmetric. To explain this
- * compound condition, we need to add that the function
+ * We are now left only with explaining the two sub-conditions in the
+ * additional compound condition that is checked below for deciding
+ * whether the scenario is asymmetric. To explain the first
+ * sub-condition, we need to add that the function
  * bfq_asymmetric_scenario checks the weights of only
- * non-weight-raised queues, for efficiency reasons (see
- * comments on bfq_weights_tree_add()). Then the fact that
- * bfqq is weight-raised is checked explicitly here. More
- * precisely, the compound condition below takes into account
- * also the fact that, even if bfqq is being weight-raised,
- * the scenario is still symmetric if all queues with requests
- * waiting for completion happen to be
- * weight-raised. Actually, we should be even more precise
- * here, and differentiate between interactive weight raising
- * and soft real-time weight raising.
+ * non-weight-raised queues, for efficiency reasons (see comments on
+ * bfq_weights_tree_add()). Then the fact that bfqq is weight-raised
+ * is checked explicitly here. More precisely, the compound condition
+ * below takes into account also the fact that, even if bfqq is being
+ * weight-raised, the scenario is still symmetric if all queues with
+ * requests waiting for completion happen to be
+ * weight-raised. Actually, we should be even more precise here, and
+ * differentiate between interactive weight raising and soft real-time
+ * weight raising.
+ *
+ * The second sub-condition checked in the compound condition is
+ * whether there is a fair amount of already in-flight I/O not
+ * belonging to bfqq. If so, I/O dispatching is to be plugged, for the
+ * following reason. The drive may decide to serve in-flight
+ * non-bfqq's I/O requests before bfqq's ones, thereby delaying the
+ * arrival of new I/O requests for bfqq (recall that bfqq is sync). If
+ * I/O-dispatching is not plugged, then, while bfqq remains empty, a
+ * basically uncontrolled amount of I/O from other queues may be
+ * dispatched too, possibly causing the service of bfqq's I/O to be
+ * delayed even longer in the drive. This problem gets more and more
+ * serious as the speed and the queue depth of the drive grow,
+ * because, as these two quantities grow, the probability to find no
+ * queue busy but many requests in flight grows too. By contrast,
+ * plugging I/O dispatching minimizes the delay induced by already
+ * in-flight I/O, and enables bfqq to recover the bandwidth it may
+ * lose because of this delay.
  *
  * As a side note, it is worth considering that the above
- * device-idling countermeasures may however fail in the
- * following unlucky scenario: if idling is (correctly)
- * disabled in a time period during which all symmetry
- * sub-conditions hold, and hence the device is allowed to
- * enqueue many requests, but at some later point in time some
- * sub-condition

Re: [PATCH 6/6] block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUG

2019-06-07 Thread Paolo Valente


> Il giorno 6 giu 2019, alle ore 12:26, Christoph Hellwig  ha 
> scritto:
> 
> This option is entirely bfq specific, give it an appropinquate name.
> 
> Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all
> the functionality already does so anyway.
> 

Acked-by: Paolo Valente 

> Signed-off-by: Christoph Hellwig 
> ---
> Documentation/block/bfq-iosched.txt  | 12 -
> Documentation/cgroup-v1/blkio-controller.txt | 12 -
> block/Kconfig.iosched|  7 +
> block/bfq-cgroup.c   | 27 ++--
> block/bfq-iosched.c  |  8 +++---
> block/bfq-iosched.h  |  4 +--
> init/Kconfig |  8 --
> 7 files changed, 38 insertions(+), 40 deletions(-)
> 
> diff --git a/Documentation/block/bfq-iosched.txt 
> b/Documentation/block/bfq-iosched.txt
> index 1a0f2ac02eb6..f02163fabf80 100644
> --- a/Documentation/block/bfq-iosched.txt
> +++ b/Documentation/block/bfq-iosched.txt
> @@ -38,13 +38,13 @@ stack). To give an idea of the limits with BFQ, on slow 
> or average
> CPUs, here are, first, the limits of BFQ for three different CPUs, on,
> respectively, an average laptop, an old desktop, and a cheap embedded
> system, in case full hierarchical support is enabled (i.e.,
> -CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not
> +CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not
> set (Section 4-2):
> - Intel i7-4850HQ: 400 KIOPS
> - AMD A8-3850: 250 KIOPS
> - ARM CortexTM-A53 Octa-core: 80 KIOPS
> 
> -If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical
> +If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical
> support is enabled), then the sustainable throughput with BFQ
> decreases, because all blkio.bfq* statistics are created and updated
> (Section 4-2). For BFQ, this leads to the following maximum
> @@ -537,19 +537,19 @@ or io.bfq.weight.
> 
> As for cgroups-v1 (blkio controller), the exact set of stat files
> created, and kept up-to-date by bfq, depends on whether
> -CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
> +CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all
> the stat files documented in
> Documentation/cgroup-v1/blkio-controller.txt. If, instead,
> -CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
> +CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files
> blkio.bfq.io_service_bytes
> blkio.bfq.io_service_bytes_recursive
> blkio.bfq.io_serviced
> blkio.bfq.io_serviced_recursive
> 
> -The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum
> +The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum
> throughput sustainable with bfq, because updating the blkio.bfq.*
> stats is rather costly, especially for some of the stats enabled by
> -CONFIG_DEBUG_BLK_CGROUP.
> +CONFIG_BFQ_CGROUP_DEBUG.
> 
> Parameters to set
> -
> diff --git a/Documentation/cgroup-v1/blkio-controller.txt 
> b/Documentation/cgroup-v1/blkio-controller.txt
> index 673dc34d3f78..47cf84102f88 100644
> --- a/Documentation/cgroup-v1/blkio-controller.txt
> +++ b/Documentation/cgroup-v1/blkio-controller.txt
> @@ -126,7 +126,7 @@ Various user visible config options
> CONFIG_BLK_CGROUP
>   - Block IO controller.
> 
> -CONFIG_DEBUG_BLK_CGROUP
> +CONFIG_BFQ_CGROUP_DEBUG
>   - Debug help. Right now some additional stats file show up in cgroup
> if this option is enabled.
> 
> @@ -246,13 +246,13 @@ Proportional weight policy files
> write, sync or async.
> 
> - blkio.avg_queue_size
> - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
> + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
> The average queue size for this cgroup over the entire time of this
> cgroup's existence. Queue size samples are taken each time one of the
> queues of this cgroup gets a timeslice.
> 
> - blkio.group_wait_time
> - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
> + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
> This is the amount of time the cgroup had to wait since it became busy
> (i.e., went from 0 to 1 request queued) to get a timeslice for one of
> its queues. This is different from the io_wait_time which is the
> @@ -263,7 +263,7 @@ Proportional weight policy files
> got a timeslice and will not include the current delta.
> 
> - blkio.empty_time
> - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
> + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
>

Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-22 Thread Paolo Valente


> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat 
>  ha scritto:
> 
> On 5/22/19 2:09 AM, Paolo Valente wrote:
>> 
>> First, thank you very much for testing my patches, and, above all, for
>> sharing those huge traces!
>> 
>> According to the your traces, the residual 20% lower throughput that you
>> record is due to the fact that the BFQ injection mechanism takes a few
>> hundredths of seconds to stabilize, at the beginning of the workload.
>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>> that you see without this new patch.  After that time, there
>> seems to be no loss according to the trace.
>> 
>> The problem is that a loss lasting only a few hundredths of seconds is
>> however not negligible for a write workload that lasts only 3-4
>> seconds.  Could you please try writing a larger file?
>> 
> 
> I tried running dd for longer (about 100 seconds), but still saw around
> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
> mq-deadline and noop.

Ok, then now the cause is the periodic reset of the mechanism.

It would be super easy to fill this gap, by just gearing the mechanism
toward a very aggressive injection.  The problem is maintaining
control.  As you can imagine from the performance gap between CFQ (or
BFQ with malfunctioning injection) and BFQ with this fix, it is very
hard to succeed in maximizing the throughput while at the same time
preserving control on per-group I/O.

On the bright side, you might be interested in one of the benefits
that BFQ gives in return for this ~10% loss of throughput, in a
scenario that may be important for you (according to affiliation you
report): from ~500% to ~1000% higher throughput when you have to serve
the I/O of multiple VMs, and to guarantee at least no starvation to
any VM [1].  The same holds with multiple clients or containers, and
in general with any set of entities that may compete for storage.

[1] 
https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

> But I'm not too worried about that difference.
> 
>> In addition, I wanted to ask you whether you measured BFQ throughput
>> with traces disabled.  This may make a difference.
>> 
> 
> The above result (1.4 MB/s) was obtained with traces disabled.
> 
>> After trying writing a larger file, you can try with low_latency on.
>> On my side, it causes results to become a little unstable across
>> repetitions (which is expected).
>> 
> With low_latency on, I get between 60 KB/s - 100 KB/s.
> 

Gosh, full regression.  Fortunately, it is simply meaningless to use
low_latency in a scenario where the goal is to guarantee per-group
bandwidths.  Low-latency heuristics, to reach their (low-latency)
goals, modify the I/O schedule compared to the best schedule for
honoring group weights and boosting throughput.  So, as recommended in
BFQ documentation, just switch low_latency off if you want to control
I/O with groups.  It may still make sense to leave low_latency on
in some specific case, which I don't want to bother you about.

However, I feel bad with such a low throughput :)  Would you be so
kind to provide me with a trace?

Thanks,
Paolo

> Regards,
> Srivatsa
> VMware Photon OS



signature.asc
Description: Message signed with OpenPGP


Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-21 Thread Paolo Valente


> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat 
>>  ha scritto:
>> 
>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for BFQ,
>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>> for you.
>>> 
>> 
>> Hi Paolo,
>> 
>> Thank you for looking into this!
>> 
>> I just tried current mainline at commit 72cf0b07, but unfortunately
>> didn't see any improvement:
>> 
>> dd if=/dev/zero of=/root/test.img bs=512 count=1 oflag=dsync
>> 
>> With mq-deadline, I get:
>> 
>> 512 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>> 
>> With bfq, I get:
>> 512 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>> 
> 
> Hi Srivatsa,
> thanks for reproducing this on mainline.  I seem to have reproduced a
> bonsai-tree version of this issue.

Hi again Srivatsa,
I've analyzed the trace, and I've found the cause of the loss of
throughput in on my side.  To find out whether it is the same cause as
on your side, I've prepared a script that executes your test and takes
a trace during the test.  If ok for you, could you please
- change the value for the DEVS parameter in the attached script, if
  needed
- execute the script
- send me the trace file that the script will leave in your working
dir

Looking forward to your trace,
Paolo



dsync_test.sh
Description: Binary data

>  Before digging into the block
> trace, I'd like to ask you for some feedback.
> 
> First, in my test, the total throughput of the disk happens to be
> about 20 times as high as that enjoyed by dd, regardless of the I/O
> scheduler.  I guess this massive overhead is normal with dsync, but
> I'd like know whether it is about the same on your side.  This will
> help me understand whether I'll actually be analyzing about the same
> problem as yours.
> 
> Second, the commands I used follow.  Do they implement your test case
> correctly?
> 
> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
> [root@localhost tmp]# echo $BASHPID > 
> /sys/fs/cgroup/blkio/testgrp/cgroup.procs
> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
> [mq-deadline] bfq none
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=1 
> oflag=dsync
> 1+0 record dentro
> 1+0 record fuori
> 512 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=1 
> oflag=dsync
> 1+0 record dentro
> 1+0 record fuori
> 512 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
> 
> Thanks,
> Paolo
> 
>> Please let me know if any more info about my setup might be helpful.
>> 
>> Thank you!
>> 
>> Regards,
>> Srivatsa
>> VMware Photon OS
>> 
>>> 
>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat 
>>>>  ha scritto:
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>> running the following command, with the CFQ I/O scheduler:
>>>> 
>>>> dd if=/dev/zero of=/root/test.img bs=512 count=1 oflags=dsync
>>>> 
>>>> Throughput with CFQ: 60 KB/s
>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>> 
>>>> I spent some time looking into it and found that this is caused by the
>>>> undesirable interaction between 4 different components:
>>>> 
>>>> - blkio cgroup controller enabled
>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>> 
>>>> 
>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>> system.slice to run system services (and docker) under it, and a
>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>> dd is invoked, it runs under user.slice.
>>>> 
>>>> The dd command above includes the dsync flag, which performs an
>>>> fdatasync after every write to the output file. Since dd is writing to
>>>> a file on ext4, jbd2 will be active, committing transactions
>>>> c

Re: [PATCH 1/1] block, bfq: delete "bfq" prefix from cgroup filenames

2019-04-08 Thread Paolo Valente



> Il giorno 8 apr 2019, alle ore 17:13, Jens Axboe  ha scritto:
> 
> On 4/8/19 9:08 AM, Johannes Thumshirn wrote:
>> On Mon, Apr 08, 2019 at 09:05:19AM -0600, Jens Axboe wrote:
>>> I did consider that, and that would be doable. But honestly, I'm having a
>>> hard time seeing what issue we are attempting to fix by doing this.
>> 
>> Yeah, I guess the real fix would be to update the documentation and the
>> expectations user-space has. Including eventual re-write of some udev rules 
>> or
>> whatever is facing these files. But to me that sounds more like a systemd or
>> even distro thing than a kernel thing.
> 
> I agree. Trying to force someones hand by renaming isn't going to fix
> anything, but it will potentially cause issues.
> 

Potential issues against concrete, big issues already with us.  The
proportional share interface doesn't match the idea people have of it.

I don't want to push for this solution, but we cannot pretend we don't
have a big problem already.  Any solution that could really work is ok
for me, including symlinks.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH 1/1] block, bfq: delete "bfq" prefix from cgroup filenames

2019-04-08 Thread Paolo Valente



> Il giorno 8 apr 2019, alle ore 17:05, Jens Axboe  ha scritto:
> 
> On 4/8/19 9:04 AM, Johannes Thumshirn wrote:
>> [+Cc Michal ]
>> On Mon, Apr 08, 2019 at 04:54:39PM +0200, Paolo Valente wrote:
>>> 
>>> 
>>>> Il giorno 8 apr 2019, alle ore 16:49, Johannes Thumshirn 
>>>>  ha scritto:
>>>> 
>>>> On Mon, Apr 08, 2019 at 04:39:35PM +0200, Paolo Valente wrote:
>>>>> From: Angelo Ruocco 
>>>>> 
>>>>> When bfq was merged into mainline, there were two I/O schedulers that
>>>>> implemented the proportional-share policy: bfq for blk-mq and cfq for
>>>>> legacy blk. bfq's interface files in the blkio/io controller have the
>>>>> same names as cfq. But the cgroups interface doesn't allow two
>>>>> entities to use the same name for their files, so for bfq we had to
>>>>> prepend the "bfq" prefix to each of its files. However no legacy code
>>>>> uses these modified file names. This naming also causes confusion, as,
>>>>> e.g., in [1].
>>>>> 
>>>>> Now cfq has gone with legacy blk, so there is no need any longer for
>>>>> these prefixes in (the never used) bfq names. In view of this fact, this
>>>>> commit removes these prefixes, thereby enabling legacy code to truly
>>>>> use the proportional share policy in blk-mq.
>>>>> 
>>>>> [1] https://github.com/systemd/systemd/issues/7057
>>>> 
>>>> Hmm, but isn't this a user-space facing interface and thus some sort of 
>>>> ABI?
>>>> Do you know what's using it and what breaks due to this conversion?
>>>> 
>>> 
>>> Yep, but AFAIK, the problem is exactly the opposite: nobody uses these
>>> names for the proportional-share policy, or wants to use these names.  I'm
>>> CCing Lennart too, in case he has some improbable news on this.
>>> 
>>> So the idea is to align names to what people expect, possibly before
>>> more confusion arises.
>> 
>> OK, crazy idea, not sure if Jens and Tejun will beat me for this, but
>> symlinks?
>> 
>> This way we can a) keep the old files and b) have them point to the new 
>> (a.k.a
>> cfq style) files.
> 
> I did consider that, and that would be doable. But honestly, I'm having a
> hard time seeing what issue we are attempting to fix by doing this.
> 

The problem is ~100% of people and software believing to set weights and not 
doing it.

Paolo

> -- 
> Jens Axboe



Re: Bisected GFP in bfq_bfqq_expire on v5.1-rc1

2019-04-01 Thread Paolo Valente



> Il giorno 1 apr 2019, alle ore 10:55, Dmitrii Tcvetkov  
> ha scritto:
> 
> On Mon, 1 Apr 2019 09:29:16 +0200
> Paolo Valente  wrote:
>> 
>> 
>>> Il giorno 29 mar 2019, alle ore 15:10, Jens Axboe 
>>> ha scritto:
>>> 
>>> On 3/29/19 7:02 AM, Dmitrii Tcvetkov wrote:
>>>> Hi,
>>>> 
>>>> I got kernel panic since v5.1-rc1 when working with files on block
>>>> device with BFQ scheduler assigned. I didn't find trivial way to
>>>> reproduce the panic but "git checkout origin/linux-5.0.y"
>>>> on linux-stable-rc[1] git repo on btrfs filesystem reproduces the
>>>> problem 100% of the time on my bare-metal machine and in a VM.
>>>> 
>>>> Bisect led me to commit 9dee8b3b057e1 (block, bfq: fix queue
>>>> removal from weights tree). After reverting this commit on top of
>>>> current mainline master(9936328b41ce) I can't reproduce the
>>>> problem.
>>>> 
>>>> dmesg with the panic and bisect log attached.
>>>> 
>>>> [1]
>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
>>> 
>>> Paolo, can you please take a look at this?
>>> 
>>> 
>> 
>> Yep.
>> 
>> That you very much Dmitrii for also bisecting.  I feel like this
>> failure may be caused by the typo fixed by this patch:
>> https://patchwork.kernel.org/patch/10877113/
>> 
>> Could you please give this fix a try?
> 
> Still reproduces with the patch on top of current mainline
> master(v5.1-rc3).
> 
> Crashes with and without CONFIG_BFQ_GROUP_IOSCHED look same to me.
> Original dmesg was also from kernel with CONFIG_BFQ_GROUP_IOSCHED=n.
> 
> gpf.txt contains crash with the patch and CONFIG_BFQ_GROUP_IOSCHED=n
> gpf-w-bfq-group-iosched.txt - with the patch and CONFIG_BFQ_GROUP_IOSCHED=y
> config.txt - kernel config for the VM with CONFIG_BFQ_GROUP_IOSCHED=n
> 
> 

Ok, thank you. Could you please do a

list *(bfq_bfqq_expire+0x1f3)

for me?

Thanks,
Paolo

> 
> 



Re: [PATCH BUGFIX IMPROVEMENT V2 7/9] block, bfq: print SHARED instead of pid for shared queues in logs

2019-03-11 Thread Paolo Valente



> Il giorno 11 mar 2019, alle ore 10:08, Holger Hoffstätte 
>  ha scritto:
> 
> Hi,
> 
> Guess what - more problems ;-)

The curse of the print SHARED :)

Thank you very much Holger for testing what I guiltily did not.  I'll
send a v3 as Francesco fixes this too.

Paolo

> This time when building without CONFIG_BFQ_GROUP_IOSCHED, see below..
> 
> On 3/10/19 7:11 PM, Paolo Valente wrote:
>> From: Francesco Pollicino 
>> The function "bfq_log_bfqq" prints the pid of the process
>> associated with the queue passed as input.
>> Unfortunately, if the queue is shared, then more than one process
>> is associated with the queue. The pid that gets printed in this
>> case is the pid of one of the associated processes.
>> Which process gets printed depends on the exact sequence of merge
>> events the queue underwent. So printing such a pid is rather
>> useless and above all is often rather confusing because it
>> reports a random pid between those of the associated processes.
>> This commit addresses this issue by printing SHARED instead of a pid
>> if the queue is shared.
>> Signed-off-by: Francesco Pollicino 
>> Signed-off-by: Paolo Valente 
>> ---
>>  block/bfq-iosched.c | 10 ++
>>  block/bfq-iosched.h | 23 +++
>>  2 files changed, 29 insertions(+), 4 deletions(-)
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 500b04df9efa..7d95d9c01036 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -2590,6 +2590,16 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct 
>> bfq_io_cq *bic,
>>   *   assignment causes no harm).
>>   */
>>  new_bfqq->bic = NULL;
>> +/*
>> + * If the queue is shared, the pid is the pid of one of the associated
>> + * processes. Which pid depends on the exact sequence of merge events
>> + * the queue underwent. So printing such a pid is useless and confusing
>> + * because it reports a random pid between those of the associated
>> + * processes.
>> + * We mark such a queue with a pid -1, and then print SHARED instead of
>> + * a pid in logging messages.
>> + */
>> +new_bfqq->pid = -1;
>>  bfqq->bic = NULL;
>>  /* release process reference to bfqq */
>>  bfq_put_queue(bfqq);
>> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
>> index 829730b96fb2..6410cc9a064d 100644
>> --- a/block/bfq-iosched.h
>> +++ b/block/bfq-iosched.h
>> @@ -32,6 +32,8 @@
>>  #define BFQ_DEFAULT_GRP_IOPRIO  0
>>  #define BFQ_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
>>  +#define MAX_PID_STR_LENGTH 12
>> +
>>  /*
>>   * Soft real-time applications are extremely more latency sensitive
>>   * than interactive ones. Over-raise the weight of the former to
>> @@ -1016,13 +1018,23 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct 
>> bfq_queue *bfqq);
>>  /* --- end of interface of B-WF2Q+  */
>>/* Logging facilities. */
>> +static void bfq_pid_to_str(int pid, char *str, int len)
>> +{
>> +if (pid != -1)
>> +snprintf(str, len, "%d", pid);
>> +else
>> +snprintf(str, len, "SHARED-");
>> +}
>> +
>>  #ifdef CONFIG_BFQ_GROUP_IOSCHED
>>  struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
>>#define bfq_log_bfqq(bfqd, bfqq, fmt, args...)do {
>> \
>> +char pid_str[MAX_PID_STR_LENGTH];   \
>> +bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);   \
>>  blk_add_cgroup_trace_msg((bfqd)->queue, \
>>  bfqg_to_blkg(bfqq_group(bfqq))->blkcg,  \
>> -"bfq%d%c " fmt, (bfqq)->pid,\
>> +"bfq%s%c " fmt, pid_str,\
>>  bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
>>  } while (0)
>>  @@ -1033,10 +1045,13 @@ struct bfq_group *bfqq_group(struct bfq_queue 
>> *bfqq);
>>#else /* CONFIG_BFQ_GROUP_IOSCHED */
>>  -#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
>> -blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid,   \
>> +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
>> +char pid_str[MAX_PID_STR_LENGTH];   \
>> +bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);   \
>> +blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str,   \
>>   

[PATCH BUGFIX IMPROVEMENT V2 4/9] block, bfq: do not merge queues on flash storage with queueing

2019-03-10 Thread Paolo Valente
To boost throughput with a set of processes doing interleaved I/O
(i.e., a set of processes whose individual I/O is random, but whose
merged cumulative I/O is sequential), BFQ merges the queues associated
with these processes, i.e., redirects the I/O of these processes into a
common, shared queue. In the shared queue, I/O requests are ordered by
their position on the medium, thus sequential I/O gets dispatched to
the device when the shared queue is served.

Queue merging costs execution time, because, to detect which queues to
merge, BFQ must maintain a list of the head I/O requests of active
queues, ordered by request positions. Measurements showed that this
costs about 10% of BFQ's total per-request processing time.

Request processing time becomes more and more critical as the speed of
the underlying storage device grows. Yet, fortunately, queue merging
is basically useless on the very devices that are so fast to make
request processing time critical. To reach a high throughput, these
devices must have many requests queued at the same time. But, in this
configuration, the internal scheduling algorithms of these devices do
also the job of queue merging: they reorder requests so as to obtain
as much as possible a sequential I/O pattern. As a consequence, with
processes doing interleaved I/O, the throughput reached by one such
device is likely to be the same, with and without queue merging.

In view of this fact, this commit disables queue merging, and all
related housekeeping, for non-rotational devices with internal
queueing. The total, single-lock-protected, per-request processing
time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
(time measured with simple code instrumentation, and using the
throughput-sync.sh script of the S suite [1], in performance-profiling
mode). To put this result into context, the total,
single-lock-protected, per-request execution time of the lightest I/O
scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
~800 LOC, against ~10500 LOC for BFQ).

Disabling merging provides a further, remarkable benefit in terms of
throughput. Merging tends to make many workloads artificially more
uneven, mainly because of shared queues remaining non empty for
incomparably more time than normal queues. So, if, e.g., one of the
queues in a set of merged queues has a higher weight than a normal
queue, then the shared queue may inherit such a high weight and, by
staying almost always active, may force BFQ to perform I/O plugging
most of the time. This evidently makes it harder for BFQ to let the
device reach a high throughput.

As a practical example of this problem, and of the benefits of this
commit, we measured again the throughput in the nasty scenario
considered in previous commit messages: dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes. With
this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
of the other I/O schedulers. As such, this is also likely to be the
maximum possible throughput reachable with this workload on this
device, because I/O is mostly random, and the other schedulers
basically just pass I/O requests to the drive as fast as possible.

[1] https://github.com/Algodev-github/S

Tested-by: Francesco Pollicino 
Signed-off-by: Alessio Masola 
Signed-off-by: Paolo Valente 
---
 block/bfq-cgroup.c  |  3 +-
 block/bfq-iosched.c | 73 +
 block/bfq-iosched.h |  3 ++
 3 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c6113af31960..2a74a3f2a8f7 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -578,7 +578,8 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue 
*bfqq,
bfqg_and_blkg_get(bfqg);
 
if (bfq_bfqq_busy(bfqq)) {
-   bfq_pos_tree_add_move(bfqd, bfqq);
+   if (unlikely(!bfqd->nonrot_with_queueing))
+   bfq_pos_tree_add_move(bfqd, bfqq);
bfq_activate_bfqq(bfqd, bfqq);
}
 
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 41364c0cca8c..b96be3764b8a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -595,7 +595,16 @@ static bool bfq_too_late_for_merging(struct bfq_queue 
*bfqq)
   bfq_merge_time_limit);
 }
 
-void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+/*
+ * The following function is not marked as __cold because it is
+ * actually cold, but for the same performance goal described in the
+ * comments on the likely() at the beginning of
+ * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
+ * execution time for the case where this function is not invoked, we
+ * had to add an unlikely() in each involved if().
+ */
+void __cold
+bfq_pos_tree_add_m

[PATCH BUGFIX IMPROVEMENT V2 5/9] block, bfq: do not tag totally seeky queues as soft rt

2019-03-10 Thread Paolo Valente
Sync random I/O is likely to be confused with soft real-time I/O,
because it is characterized by limited throughput and apparently
isochronous arrival pattern. To avoid false positives, this commits
prevents bfq_queues containing only random (seeky) I/O from being
tagged as soft real-time.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b96be3764b8a..d34b80e5c47d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -242,6 +242,14 @@ static struct kmem_cache *bfq_pool;
  blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 19)
+/*
+ * Sync random I/O is likely to be confused with soft real-time I/O,
+ * because it is characterized by limited throughput and apparently
+ * isochronous arrival pattern. To avoid false positives, queues
+ * containing only random (seeky) I/O are prevented from being tagged
+ * as soft real-time.
+ */
+#define BFQQ_TOTALLY_SEEKY(bfqq)   (bfqq->seek_history & -1)
 
 /* Min number of samples required to perform peak-rate update */
 #define BFQ_RATE_MIN_SAMPLES   32
@@ -1622,6 +1630,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct 
bfq_data *bfqd,
 */
in_burst = bfq_bfqq_in_large_burst(bfqq);
soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+   !BFQQ_TOTALLY_SEEKY(bfqq) &&
!in_burst &&
time_is_before_jiffies(bfqq->soft_rt_next_start) &&
bfqq->dispatched == 0;
@@ -4816,6 +4825,11 @@ bfq_update_io_seektime(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 {
bfqq->seek_history <<= 1;
bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
+
+   if (bfqq->wr_coeff > 1 &&
+   bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+   BFQQ_TOTALLY_SEEKY(bfqq))
+   bfq_bfqq_end_wr(bfqq);
 }
 
 static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT V2 8/9] block, bfq: save & resume weight on a queue merge/split

2019-03-10 Thread Paolo Valente
From: Francesco Pollicino 

bfq saves the state of a queue each time a merge occurs, to be
able to resume such a state when the queue is associated again
with its original process, on a split.

Unfortunately bfq does not save & restore also the weight of the
queue. If the weight is not correctly resumed when the queue is
recycled, then the weight of the recycled queue could differ
from the weight of the original queue.

This commit adds the missing save & resume of the weight.

Signed-off-by: Francesco Pollicino 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 ++
 block/bfq-iosched.h | 9 +
 2 files changed, 11 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7d95d9c01036..1712d12340c0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1028,6 +1028,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
else
bfq_clear_bfqq_IO_bound(bfqq);
 
+   bfqq->entity.new_weight = bic->saved_weight;
bfqq->ttime = bic->saved_ttime;
bfqq->wr_coeff = bic->saved_wr_coeff;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
@@ -2502,6 +2503,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
if (!bic)
return;
 
+   bic->saved_weight = bfqq->entity.orig_weight;
bic->saved_ttime = bfqq->ttime;
bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 6410cc9a064d..1e34cce59ba7 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -404,6 +404,15 @@ struct bfq_io_cq {
 */
bool was_in_burst_list;
 
+   /*
+* Save the weight when a merge occurs, to be able
+* to restore it in case of split. If the weight is not
+* correctly resumed when the queue is recycled,
+* then the weight of the recycled queue could differ
+* from the weight of the original queue.
+*/
+   unsigned int saved_weight;
+
/*
 * Similar to previous fields: save wr information.
 */
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT V2 6/9] block, bfq: always protect newly-created queues from existing active queues

2019-03-10 Thread Paolo Valente
If many bfq_queues belonging to the same group happen to be created
shortly after each other, then the processes associated with these
queues have typically a common goal. In particular, bursts of queue
creations are usually caused by services or applications that spawn
many parallel threads/processes. Examples are systemd during boot, or
git grep. If there are no other active queues, then, to help these
processes get their job done as soon as possible, the best thing to do
is to reach a high throughput. To this goal, it is usually better to
not grant either weight-raising or device idling to the queues
associated with these processes. And this is exactly what BFQ
currently does.

There is however a drawback: if, in contrast, some other queues are
already active, then the newly created queues must be protected from
the I/O flowing through the already existing queues. In this case, the
best thing to do is the opposite as in the other case: it is much
better to grant weight-raising and device idling to the newly-created
queues, if they deserve it. This commit addresses this issue by doing
so if there are already other active queues.

This change also helps eliminating false positives, which occur when
the newly-created queues do not belong to an actual large burst of
creations, but some background task (e.g., a service) happens to
trigger the creation of new queues in the middle, i.e., very close to
when the victim queues are created. These false positive may cause
total loss of control on process latencies.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 64 -
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index d34b80e5c47d..500b04df9efa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1075,8 +1075,18 @@ static void bfq_reset_burst_list(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
 
hlist_for_each_entry_safe(item, n, >burst_list, burst_list_node)
hlist_del_init(>burst_list_node);
-   hlist_add_head(>burst_list_node, >burst_list);
-   bfqd->burst_size = 1;
+
+   /*
+* Start the creation of a new burst list only if there is no
+* active queue. See comments on the conditional invocation of
+* bfq_handle_burst().
+*/
+   if (bfq_tot_busy_queues(bfqd) == 0) {
+   hlist_add_head(>burst_list_node, >burst_list);
+   bfqd->burst_size = 1;
+   } else
+   bfqd->burst_size = 0;
+
bfqd->burst_parent_entity = bfqq->entity.parent;
 }
 
@@ -1132,7 +1142,8 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
  * many parallel threads/processes. Examples are systemd during boot,
  * or git grep. To help these processes get their job done as soon as
  * possible, it is usually better to not grant either weight-raising
- * or device idling to their queues.
+ * or device idling to their queues, unless these queues must be
+ * protected from the I/O flowing through other active queues.
  *
  * In this comment we describe, firstly, the reasons why this fact
  * holds, and, secondly, the next function, which implements the main
@@ -1144,7 +1155,10 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
  * cumulatively served, the sooner the target job of these queues gets
  * completed. As a consequence, weight-raising any of these queues,
  * which also implies idling the device for it, is almost always
- * counterproductive. In most cases it just lowers throughput.
+ * counterproductive, unless there are other active queues to isolate
+ * these new queues from. If there no other active queues, then
+ * weight-raising these new queues just lowers throughput in most
+ * cases.
  *
  * On the other hand, a burst of queue creations may be caused also by
  * the start of an application that does not consist of a lot of
@@ -1178,14 +1192,16 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
  * are very rare. They typically occur if some service happens to
  * start doing I/O exactly when the interactive task starts.
  *
- * Turning back to the next function, it implements all the steps
- * needed to detect the occurrence of a large burst and to properly
- * mark all the queues belonging to it (so that they can then be
- * treated in a different way). This goal is achieved by maintaining a
- * "burst list" that holds, temporarily, the queues that belong to the
- * burst in progress. The list is then used to mark these queues as
- * belonging to a large burst if the burst does become large. The main
- * steps are the following.
+ * Turning back to the next function, it is invoked only if there are
+ * no active queues (apart from active queues that would belong to the
+ * same, possible burst bfqq would belong to), and it implements all
+ * the steps neede

[PATCH BUGFIX IMPROVEMENT V2 9/9] doc, block, bfq: add information on bfq execution time

2019-03-10 Thread Paolo Valente
The execution time of BFQ has been slightly lowered. Report the new
execution time in BFQ documentation.

Signed-off-by: Paolo Valente 
---
 Documentation/block/bfq-iosched.txt | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/Documentation/block/bfq-iosched.txt 
b/Documentation/block/bfq-iosched.txt
index 98a8dd5ee385..1a0f2ac02eb6 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -20,13 +20,26 @@ for that device, by setting low_latency to 0. See Section 3 
for
 details on how to configure BFQ for the desired tradeoff between
 latency and throughput, or on how to maximize throughput.
 
-BFQ has a non-null overhead, which limits the maximum IOPS that a CPU
-can process for a device scheduled with BFQ. To give an idea of the
-limits on slow or average CPUs, here are, first, the limits of BFQ for
-three different CPUs, on, respectively, an average laptop, an old
-desktop, and a cheap embedded system, in case full hierarchical
-support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but
-CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2):
+As every I/O scheduler, BFQ adds some overhead to per-I/O-request
+processing. To give an idea of this overhead, the total,
+single-lock-protected, per-request processing time of BFQ---i.e., the
+sum of the execution times of the request insertion, dispatch and
+completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
+(dated CPU for notebooks; time measured with simple code
+instrumentation, and using the throughput-sync.sh script of the S
+suite [1], in performance-profiling mode). To put this result into
+context, the total, single-lock-protected, per-request execution time
+of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7
+us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ).
+
+Scheduling overhead further limits the maximum IOPS that a CPU can
+process (already limited by the execution of the rest of the I/O
+stack). To give an idea of the limits with BFQ, on slow or average
+CPUs, here are, first, the limits of BFQ for three different CPUs, on,
+respectively, an average laptop, an old desktop, and a cheap embedded
+system, in case full hierarchical support is enabled (i.e.,
+CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not
+set (Section 4-2):
 - Intel i7-4850HQ: 400 KIOPS
 - AMD A8-3850: 250 KIOPS
 - ARM CortexTM-A53 Octa-core: 80 KIOPS
@@ -566,3 +579,5 @@ applications. Unset this tunable if you need/want to 
control weights.
 Slightly extended version:
 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
+
+[3] https://github.com/Algodev-github/S
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT V2 7/9] block, bfq: print SHARED instead of pid for shared queues in logs

2019-03-10 Thread Paolo Valente
From: Francesco Pollicino 

The function "bfq_log_bfqq" prints the pid of the process
associated with the queue passed as input.

Unfortunately, if the queue is shared, then more than one process
is associated with the queue. The pid that gets printed in this
case is the pid of one of the associated processes.
Which process gets printed depends on the exact sequence of merge
events the queue underwent. So printing such a pid is rather
useless and above all is often rather confusing because it
reports a random pid between those of the associated processes.

This commit addresses this issue by printing SHARED instead of a pid
if the queue is shared.

Signed-off-by: Francesco Pollicino 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 10 ++
 block/bfq-iosched.h | 23 +++
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 500b04df9efa..7d95d9c01036 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2590,6 +2590,16 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq 
*bic,
 *   assignment causes no harm).
 */
new_bfqq->bic = NULL;
+   /*
+* If the queue is shared, the pid is the pid of one of the associated
+* processes. Which pid depends on the exact sequence of merge events
+* the queue underwent. So printing such a pid is useless and confusing
+* because it reports a random pid between those of the associated
+* processes.
+* We mark such a queue with a pid -1, and then print SHARED instead of
+* a pid in logging messages.
+*/
+   new_bfqq->pid = -1;
bfqq->bic = NULL;
/* release process reference to bfqq */
bfq_put_queue(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 829730b96fb2..6410cc9a064d 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -32,6 +32,8 @@
 #define BFQ_DEFAULT_GRP_IOPRIO 0
 #define BFQ_DEFAULT_GRP_CLASS  IOPRIO_CLASS_BE
 
+#define MAX_PID_STR_LENGTH 12
+
 /*
  * Soft real-time applications are extremely more latency sensitive
  * than interactive ones. Over-raise the weight of the former to
@@ -1016,13 +1018,23 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct 
bfq_queue *bfqq);
 /* --- end of interface of B-WF2Q+  */
 
 /* Logging facilities. */
+static void bfq_pid_to_str(int pid, char *str, int len)
+{
+   if (pid != -1)
+   snprintf(str, len, "%d", pid);
+   else
+   snprintf(str, len, "SHARED-");
+}
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do {\
+   char pid_str[MAX_PID_STR_LENGTH];   \
+   bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);   \
blk_add_cgroup_trace_msg((bfqd)->queue, \
bfqg_to_blkg(bfqq_group(bfqq))->blkcg,  \
-   "bfq%d%c " fmt, (bfqq)->pid,\
+   "bfq%s%c " fmt, pid_str,\
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
 } while (0)
 
@@ -1033,10 +1045,13 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
-#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
-   blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid,   \
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do {\
+   char pid_str[MAX_PID_STR_LENGTH];   \
+   bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);   \
+   blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str,   \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A',  \
-   ##args)
+   ##args) \
+} while (0)
 #define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0)
 
 #endif /* CONFIG_BFQ_GROUP_IOSCHED */
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT V2 3/9] block, bfq: tune service injection basing on request service times

2019-03-10 Thread Paolo Valente
The processes associated with a bfq_queue, say Q, may happen to
generate their cumulative I/O at a lower rate than the rate at which
the device could serve the same I/O. This is rather probable, e.g., if
only one process is associated with Q and the device is an SSD. It
results in Q becoming often empty while in service. If BFQ is not
allowed to switch to another queue when Q becomes empty, then, during
the service of Q, there will be frequent "service holes", i.e., time
intervals during which Q gets empty and the device can only consume
the I/O already queued in its hardware queues. This easily causes
considerable losses of throughput.

To counter this problem, BFQ implements a request injection mechanism,
which tries to fill the above service holes with I/O requests taken
from other bfq_queues. The hard part in this mechanism is finding the
right amount of I/O to inject, so as to both boost throughput and not
break Q's bandwidth and latency guarantees. To this goal, the current
version of this mechanism measures the bandwidth enjoyed by Q while it
is being served, and tries to inject the maximum possible amount of
extra service that does not cause Q's bandwidth to decrease too
much.

This solution has an important shortcoming. For bandwidth measurements
to be stable and reliable, Q must remain in service for a much longer
time than that needed to serve a single I/O request. Unfortunately,
this does not hold with many workloads. This commit addresses this
issue by changing the way the amount of injection allowed is
dynamically computed. It tunes injection as a function of the service
times of single I/O requests of Q, instead of Q's
bandwidth. Single-request service times are evidently meaningful even
if Q gets very few I/O requests completed while it is in service.

As a testbed for this new solution, we measured the throughput reached
by BFQ for one of the nastiest workloads and configurations for this
scheduler: the workload generated by the dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes.
With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
a PLEXTOR PX-256M5.

Tested-by: Francesco Pollicino 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 417 
 block/bfq-iosched.h |  51 +++---
 2 files changed, 409 insertions(+), 59 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2be504f25b09..41364c0cca8c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1721,6 +1721,123 @@ static void bfq_add_request(struct request *rq)
bfqq->queued[rq_is_sync(rq)]++;
bfqd->queued++;
 
+   if (RB_EMPTY_ROOT(>sort_list) && bfq_bfqq_sync(bfqq)) {
+   /*
+* Periodically reset inject limit, to make sure that
+* the latter eventually drops in case workload
+* changes, see step (3) in the comments on
+* bfq_update_inject_limit().
+*/
+   if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+msecs_to_jiffies(1000))) {
+   /* invalidate baseline total service time */
+   bfqq->last_serv_time_ns = 0;
+
+   /*
+* Reset pointer in case we are waiting for
+* some request completion.
+*/
+   bfqd->waited_rq = NULL;
+
+   /*
+* If bfqq has a short think time, then start
+* by setting the inject limit to 0
+* prudentially, because the service time of
+* an injected I/O request may be higher than
+* the think time of bfqq, and therefore, if
+* one request was injected when bfqq remains
+* empty, this injected request might delay
+* the service of the next I/O request for
+* bfqq significantly. In case bfqq can
+* actually tolerate some injection, then the
+* adaptive update will however raise the
+* limit soon. This lucky circumstance holds
+* exactly because bfqq has a short think
+* time, and thus, after remaining empty, is
+* likely to get new I/O enqueued---and then
+* completed---before being expired. This is
+* the very pattern that gives the
+* limit-update algorithm the chance to
+* measure the effect of injection on request
+* service times, and then to update the limit
+ 

[PATCH BUGFIX IMPROVEMENT V2 1/9] block, bfq: increase idling for weight-raised queues

2019-03-10 Thread Paolo Valente
If a sync bfq_queue has a higher weight than some other queue, and
remains temporarily empty while in service, then, to preserve the
bandwidth share of the queue, it is necessary to plug I/O dispatching
until a new request arrives for the queue. In addition, a timeout
needs to be set, to avoid waiting for ever if the process associated
with the queue has actually finished its I/O.

Even with the above timeout, the device is however not fed with new
I/O for a while, if the process has finished its I/O. If this happens
often, then throughput drops and latencies grow. For this reason, the
timeout is kept rather low: 8 ms is the current default.

Unfortunately, such a low value may cause, on the opposite end, a
violation of bandwidth guarantees for a process that happens to issue
new I/O too late. The higher the system load, the higher the
probability that this happens to some process. This is a problem in
scenarios where service guarantees matter more than throughput. One
important case are weight-raised queues, which need to be granted a
very high fraction of the bandwidth.

To address this issue, this commit lower-bounds the plugging timeout
for weight-raised queues to 20 ms. This simple change provides
relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
gnome-terminal starts in 0.6 seconds if there is no other I/O in
progress, the same applications starts in
- 0.8 seconds, instead of 1.2 seconds, if ten files are being read
  sequentially in parallel
- 1 second, instead of 2 seconds, if, in parallel, five files are
  being read sequentially, and five more files are being written
  sequentially

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 4c592496a16a..eb658de3cc40 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2545,6 +2545,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
bfq_symmetric_scenario(bfqd))
sl = min_t(u64, sl, BFQ_MIN_TT);
+   else if (bfqq->wr_coeff > 1)
+   sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
 
bfqd->last_idling_start = ktime_get();
hrtimer_start(>idle_slice_timer, ns_to_ktime(sl),
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT V2 2/9] block, bfq: do not idle for lowest-weight queues

2019-03-10 Thread Paolo Valente
In most cases, it is detrimental for throughput to plug I/O dispatch
when the in-service bfq_queue becomes temporarily empty (plugging is
performed to wait for the possible arrival, soon, of new I/O from the
in-service queue). There is however a case where plugging is needed
for service guarantees. If a bfq_queue, say Q, has a higher weight
than some other active bfq_queue, and is sync, i.e., contains sync
I/O, then, to guarantee that Q does receive a higher share of the
throughput than other lower-weight queues, it is necessary to plug I/O
dispatch when Q remains temporarily empty while being served.

For this reason, BFQ performs I/O plugging when some active bfq_queue
has a higher weight than some other active bfq_queue. But this is
unnecessarily overkill. In fact, if the in-service bfq_queue actually
has a weight lower than or equal to the other queues, then the queue
*must not* be guaranteed a higher share of the throughput than the
other queues. So, not plugging I/O cannot cause any harm to the
queue. And can boost throughput.

Taking advantage of this fact, this commit does not plug I/O for sync
bfq_queues with a weight lower than or equal to the weights of the
other queues. Here is an example of the resulting throughput boost
with the dbench workload, which is particularly nasty for BFQ. With
the dbench test in the Phoronix suite, BFQ reaches its lowest total
throughput with 6 clients on a filesystem with journaling, in case the
journaling daemon has a higher weight than normal processes. Before
this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
after this commit it is ~100 MB/sec.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 204 +---
 block/bfq-iosched.h |   6 +-
 block/bfq-wf2q.c|   2 +-
 3 files changed, 118 insertions(+), 94 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index eb658de3cc40..2be504f25b09 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -629,12 +629,19 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
 }
 
 /*
- * The following function returns true if every queue must receive the
- * same share of the throughput (this condition is used when deciding
- * whether idling may be disabled, see the comments in the function
- * bfq_better_to_idle()).
+ * The following function returns false either if every active queue
+ * must receive the same share of the throughput (symmetric scenario),
+ * or, as a special case, if bfqq must receive a share of the
+ * throughput lower than or equal to the share that every other active
+ * queue must receive.  If bfqq does sync I/O, then these are the only
+ * two cases where bfqq happens to be guaranteed its share of the
+ * throughput even if I/O dispatching is not plugged when bfqq remains
+ * temporarily empty (for more details, see the comments in the
+ * function bfq_better_to_idle()). For this reason, the return value
+ * of this function is used to check whether I/O-dispatch plugging can
+ * be avoided.
  *
- * Such a scenario occurs when:
+ * The above first case (symmetric scenario) occurs when:
  * 1) all active queues have the same weight,
  * 2) all active queues belong to the same I/O-priority class,
  * 3) all active groups at the same level in the groups tree have the same
@@ -654,30 +661,36 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
  * support or the cgroups interface are not enabled, thus no state
  * needs to be maintained in this case.
  */
-static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
+static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
+  struct bfq_queue *bfqq)
 {
+   bool smallest_weight = bfqq &&
+   bfqq->weight_counter &&
+   bfqq->weight_counter ==
+   container_of(
+   rb_first_cached(>queue_weights_tree),
+   struct bfq_weight_counter,
+   weights_node);
+
/*
 * For queue weights to differ, queue_weights_tree must contain
 * at least two nodes.
 */
-   bool varied_queue_weights = !RB_EMPTY_ROOT(>queue_weights_tree) &&
-   (bfqd->queue_weights_tree.rb_node->rb_left ||
-bfqd->queue_weights_tree.rb_node->rb_right);
+   bool varied_queue_weights = !smallest_weight &&
+   !RB_EMPTY_ROOT(>queue_weights_tree.rb_root) &&
+   (bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
+bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
 
bool multiple_classes_busy =
(bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
(bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
(bfqd->busy_queues[1] && bfqd->busy_queue

[PATCH BUGFIX IMPROVEMENT V2 0/9] block, bfq: fix bugs, reduce exec time and boost performance

2019-03-10 Thread Paolo Valente
Hi,
this is the v2 of the series
https://lkml.org/lkml/2019/3/7/461
that fixes some bug affecting performance, reduces execution time a
little bit, and boosts throughput and responsiveness.

The difference w.r.t. v1 is that Francesco has fixed compilation
issues of patch "block, bfq: print SHARED instead of pid for shared
queues in logs".

I took the opportunity of this v2 to also add BFQ's execution time to
the documentation.

Let me remind again that these patches are meant to be applied on top
of the last series I submitted:
https://lkml.org/lkml/2019/1/29/368

Thanks,
Paolo

Francesco Pollicino (2):
  block, bfq: print SHARED instead of pid for shared queues in logs
  block, bfq: save & resume weight on a queue merge/split

Paolo Valente (7):
  block, bfq: increase idling for weight-raised queues
  block, bfq: do not idle for lowest-weight queues
  block, bfq: tune service injection basing on request service times
  block, bfq: do not merge queues on flash storage with queueing
  block, bfq: do not tag totally seeky queues as soft rt
  block, bfq: always protect newly-created queues from existing active
queues
  doc, block, bfq: add information on bfq execution time

 Documentation/block/bfq-iosched.txt |  29 +-
 block/bfq-cgroup.c  |   3 +-
 block/bfq-iosched.c | 786 +++-
 block/bfq-iosched.h |  92 ++--
 block/bfq-wf2q.c|   2 +-
 5 files changed, 729 insertions(+), 183 deletions(-)

--
2.20.1


[PATCH BUGFIX IMPROVEMENT 5/8] block, bfq: do not tag totally seeky queues as soft rt

2019-03-07 Thread Paolo Valente
Sync random I/O is likely to be confused with soft real-time I/O,
because it is characterized by limited throughput and apparently
isochronous arrival pattern. To avoid false positives, this commits
prevents bfq_queues containing only random (seeky) I/O from being
tagged as soft real-time.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b96be3764b8a..d34b80e5c47d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -242,6 +242,14 @@ static struct kmem_cache *bfq_pool;
  blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 19)
+/*
+ * Sync random I/O is likely to be confused with soft real-time I/O,
+ * because it is characterized by limited throughput and apparently
+ * isochronous arrival pattern. To avoid false positives, queues
+ * containing only random (seeky) I/O are prevented from being tagged
+ * as soft real-time.
+ */
+#define BFQQ_TOTALLY_SEEKY(bfqq)   (bfqq->seek_history & -1)
 
 /* Min number of samples required to perform peak-rate update */
 #define BFQ_RATE_MIN_SAMPLES   32
@@ -1622,6 +1630,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct 
bfq_data *bfqd,
 */
in_burst = bfq_bfqq_in_large_burst(bfqq);
soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+   !BFQQ_TOTALLY_SEEKY(bfqq) &&
!in_burst &&
time_is_before_jiffies(bfqq->soft_rt_next_start) &&
bfqq->dispatched == 0;
@@ -4816,6 +4825,11 @@ bfq_update_io_seektime(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 {
bfqq->seek_history <<= 1;
bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
+
+   if (bfqq->wr_coeff > 1 &&
+   bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+   BFQQ_TOTALLY_SEEKY(bfqq))
+   bfq_bfqq_end_wr(bfqq);
 }
 
 static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 1/8] block, bfq: increase idling for weight-raised queues

2019-03-07 Thread Paolo Valente
If a sync bfq_queue has a higher weight than some other queue, and
remains temporarily empty while in service, then, to preserve the
bandwidth share of the queue, it is necessary to plug I/O dispatching
until a new request arrives for the queue. In addition, a timeout
needs to be set, to avoid waiting for ever if the process associated
with the queue has actually finished its I/O.

Even with the above timeout, the device is however not fed with new
I/O for a while, if the process has finished its I/O. If this happens
often, then throughput drops and latencies grow. For this reason, the
timeout is kept rather low: 8 ms is the current default.

Unfortunately, such a low value may cause, on the opposite end, a
violation of bandwidth guarantees for a process that happens to issue
new I/O too late. The higher the system load, the higher the
probability that this happens to some process. This is a problem in
scenarios where service guarantees matter more than throughput. One
important case are weight-raised queues, which need to be granted a
very high fraction of the bandwidth.

To address this issue, this commit lower-bounds the plugging timeout
for weight-raised queues to 20 ms. This simple change provides
relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
gnome-terminal starts in 0.6 seconds if there is no other I/O in
progress, the same applications starts in
- 0.8 seconds, instead of 1.2 seconds, if ten files are being read
  sequentially in parallel
- 1 second, instead of 2 seconds, if, in parallel, five files are
  being read sequentially, and five more files are being written
  sequentially

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 4c592496a16a..eb658de3cc40 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2545,6 +2545,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
bfq_symmetric_scenario(bfqd))
sl = min_t(u64, sl, BFQ_MIN_TT);
+   else if (bfqq->wr_coeff > 1)
+   sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
 
bfqd->last_idling_start = ktime_get();
hrtimer_start(>idle_slice_timer, ns_to_ktime(sl),
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 8/8] block, bfq: save & resume weight on a queue merge/split

2019-03-07 Thread Paolo Valente
From: Francesco Pollicino 

bfq saves the state of a queue each time a merge occurs, to be
able to resume such a state when the queue is associated again
with its original process, on a split.

Unfortunately bfq does not save & restore also the weight of the
queue. If the weight is not correctly resumed when the queue is
recycled, then the weight of the recycled queue could differ
from the weight of the original queue.

This commit adds the missing save & resume of the weight.

Signed-off-by: Francesco Pollicino 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 ++
 block/bfq-iosched.h | 9 +
 2 files changed, 11 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7d95d9c01036..1712d12340c0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1028,6 +1028,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
else
bfq_clear_bfqq_IO_bound(bfqq);
 
+   bfqq->entity.new_weight = bic->saved_weight;
bfqq->ttime = bic->saved_ttime;
bfqq->wr_coeff = bic->saved_wr_coeff;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
@@ -2502,6 +2503,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
if (!bic)
return;
 
+   bic->saved_weight = bfqq->entity.orig_weight;
bic->saved_ttime = bfqq->ttime;
bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 21ddb19cc322..18cc0e996abf 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -404,6 +404,15 @@ struct bfq_io_cq {
 */
bool was_in_burst_list;
 
+   /*
+* Save the weight when a merge occurs, to be able
+* to restore it in case of split. If the weight is not
+* correctly resumed when the queue is recycled,
+* then the weight of the recycled queue could differ
+* from the weight of the original queue.
+*/
+   unsigned int saved_weight;
+
/*
 * Similar to previous fields: save wr information.
 */
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 7/8] block, bfq: print SHARED instead of pid for shared queues in logs

2019-03-07 Thread Paolo Valente
From: Francesco Pollicino 

The function "bfq_log_bfqq" prints the pid of the process
associated with the queue passed as input.

Unfortunately, if the queue is shared, then more than one process
is associated with the queue. The pid that gets printed in this
case is the pid of one of the associated processes.
Which process gets printed depends on the exact sequence of merge
events the queue underwent. So printing such a pid is rather
useless and above all is often rather confusing because it
reports a random pid between those of the associated processes.

This commit addresses this issue by printing SHARED instead of a pid
if the queue is shared.

Signed-off-by: Francesco Pollicino 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 10 ++
 block/bfq-iosched.h | 18 --
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 500b04df9efa..7d95d9c01036 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2590,6 +2590,16 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq 
*bic,
 *   assignment causes no harm).
 */
new_bfqq->bic = NULL;
+   /*
+* If the queue is shared, the pid is the pid of one of the associated
+* processes. Which pid depends on the exact sequence of merge events
+* the queue underwent. So printing such a pid is useless and confusing
+* because it reports a random pid between those of the associated
+* processes.
+* We mark such a queue with a pid -1, and then print SHARED instead of
+* a pid in logging messages.
+*/
+   new_bfqq->pid = -1;
bfqq->bic = NULL;
/* release process reference to bfqq */
bfq_put_queue(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 829730b96fb2..21ddb19cc322 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -32,6 +32,8 @@
 #define BFQ_DEFAULT_GRP_IOPRIO 0
 #define BFQ_DEFAULT_GRP_CLASS  IOPRIO_CLASS_BE
 
+#define MAX_PID_STR_LENGTH 12
+
 /*
  * Soft real-time applications are extremely more latency sensitive
  * than interactive ones. Over-raise the weight of the former to
@@ -1016,13 +1018,23 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct 
bfq_queue *bfqq);
 /* --- end of interface of B-WF2Q+  */
 
 /* Logging facilities. */
+void bfq_pid_to_str(int pid, char *str, int len)
+{
+   if (pid != -1)
+   snprintf(str, len, "%d", pid);
+   else
+   snprintf(str, len, "SHARED-");
+}
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do {\
+   char pid_str[MAX_PID_STR_LENGTH];   \
+   bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);   \
blk_add_cgroup_trace_msg((bfqd)->queue, \
bfqg_to_blkg(bfqq_group(bfqq))->blkcg,  \
-   "bfq%d%c " fmt, (bfqq)->pid,\
+   "bfq%s%c " fmt, pid_str,\
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
 } while (0)
 
@@ -1034,7 +1046,9 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
-   blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid,   \
+   char pid_str[MAX_PID_STR_LENGTH];   \
+   bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);   \
+   blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str,   \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A',  \
##args)
 #define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0)
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 2/8] block, bfq: do not idle for lowest-weight queues

2019-03-07 Thread Paolo Valente
In most cases, it is detrimental for throughput to plug I/O dispatch
when the in-service bfq_queue becomes temporarily empty (plugging is
performed to wait for the possible arrival, soon, of new I/O from the
in-service queue). There is however a case where plugging is needed
for service guarantees. If a bfq_queue, say Q, has a higher weight
than some other active bfq_queue, and is sync, i.e., contains sync
I/O, then, to guarantee that Q does receive a higher share of the
throughput than other lower-weight queues, it is necessary to plug I/O
dispatch when Q remains temporarily empty while being served.

For this reason, BFQ performs I/O plugging when some active bfq_queue
has a higher weight than some other active bfq_queue. But this is
unnecessarily overkill. In fact, if the in-service bfq_queue actually
has a weight lower than or equal to the other queues, then the queue
*must not* be guaranteed a higher share of the throughput than the
other queues. So, not plugging I/O cannot cause any harm to the
queue. And can boost throughput.

Taking advantage of this fact, this commit does not plug I/O for sync
bfq_queues with a weight lower than or equal to the weights of the
other queues. Here is an example of the resulting throughput boost
with the dbench workload, which is particularly nasty for BFQ. With
the dbench test in the Phoronix suite, BFQ reaches its lowest total
throughput with 6 clients on a filesystem with journaling, in case the
journaling daemon has a higher weight than normal processes. Before
this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
after this commit it is ~100 MB/sec.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 204 +---
 block/bfq-iosched.h |   6 +-
 block/bfq-wf2q.c|   2 +-
 3 files changed, 118 insertions(+), 94 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index eb658de3cc40..2be504f25b09 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -629,12 +629,19 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
 }
 
 /*
- * The following function returns true if every queue must receive the
- * same share of the throughput (this condition is used when deciding
- * whether idling may be disabled, see the comments in the function
- * bfq_better_to_idle()).
+ * The following function returns false either if every active queue
+ * must receive the same share of the throughput (symmetric scenario),
+ * or, as a special case, if bfqq must receive a share of the
+ * throughput lower than or equal to the share that every other active
+ * queue must receive.  If bfqq does sync I/O, then these are the only
+ * two cases where bfqq happens to be guaranteed its share of the
+ * throughput even if I/O dispatching is not plugged when bfqq remains
+ * temporarily empty (for more details, see the comments in the
+ * function bfq_better_to_idle()). For this reason, the return value
+ * of this function is used to check whether I/O-dispatch plugging can
+ * be avoided.
  *
- * Such a scenario occurs when:
+ * The above first case (symmetric scenario) occurs when:
  * 1) all active queues have the same weight,
  * 2) all active queues belong to the same I/O-priority class,
  * 3) all active groups at the same level in the groups tree have the same
@@ -654,30 +661,36 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
  * support or the cgroups interface are not enabled, thus no state
  * needs to be maintained in this case.
  */
-static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
+static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
+  struct bfq_queue *bfqq)
 {
+   bool smallest_weight = bfqq &&
+   bfqq->weight_counter &&
+   bfqq->weight_counter ==
+   container_of(
+   rb_first_cached(>queue_weights_tree),
+   struct bfq_weight_counter,
+   weights_node);
+
/*
 * For queue weights to differ, queue_weights_tree must contain
 * at least two nodes.
 */
-   bool varied_queue_weights = !RB_EMPTY_ROOT(>queue_weights_tree) &&
-   (bfqd->queue_weights_tree.rb_node->rb_left ||
-bfqd->queue_weights_tree.rb_node->rb_right);
+   bool varied_queue_weights = !smallest_weight &&
+   !RB_EMPTY_ROOT(>queue_weights_tree.rb_root) &&
+   (bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
+bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
 
bool multiple_classes_busy =
(bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
(bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
(bfqd->busy_queues[1] && bfqd->busy_queue

[PATCH BUGFIX IMPROVEMENT 4/8] block, bfq: do not merge queues on flash storage with queueing

2019-03-07 Thread Paolo Valente
To boost throughput with a set of processes doing interleaved I/O
(i.e., a set of processes whose individual I/O is random, but whose
merged cumulative I/O is sequential), BFQ merges the queues associated
with these processes, i.e., redirects the I/O of these processes into a
common, shared queue. In the shared queue, I/O requests are ordered by
their position on the medium, thus sequential I/O gets dispatched to
the device when the shared queue is served.

Queue merging costs execution time, because, to detect which queues to
merge, BFQ must maintain a list of the head I/O requests of active
queues, ordered by request positions. Measurements showed that this
costs about 10% of BFQ's total per-request processing time.

Request processing time becomes more and more critical as the speed of
the underlying storage device grows. Yet, fortunately, queue merging
is basically useless on the very devices that are so fast to make
request processing time critical. To reach a high throughput, these
devices must have many requests queued at the same time. But, in this
configuration, the internal scheduling algorithms of these devices do
also the job of queue merging: they reorder requests so as to obtain
as much as possible a sequential I/O pattern. As a consequence, with
processes doing interleaved I/O, the throughput reached by one such
device is likely to be the same, with and without queue merging.

In view of this fact, this commit disables queue merging, and all
related housekeeping, for non-rotational devices with internal
queueing. The total, single-lock-protected, per-request processing
time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
(time measured with simple code instrumentation, and using the
throughput-sync.sh script of the S suite [1], in performance-profiling
mode). To put this result into context, the total,
single-lock-protected, per-request execution time of the lightest I/O
scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
~800 LOC, against ~10500 LOC for BFQ).

Disabling merging provides a further, remarkable benefit in terms of
throughput. Merging tends to make many workloads artificially more
uneven, mainly because of shared queues remaining non empty for
incomparably more time than normal queues. So, if, e.g., one of the
queues in a set of merged queues has a higher weight than a normal
queue, then the shared queue may inherit such a high weight and, by
staying almost always active, may force BFQ to perform I/O plugging
most of the time. This evidently makes it harder for BFQ to let the
device reach a high throughput.

As a practical example of this problem, and of the benefits of this
commit, we measured again the throughput in the nasty scenario
considered in previous commit messages: dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes. With
this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
of the other I/O schedulers. As such, this is also likely to be the
maximum possible throughput reachable with this workload on this
device, because I/O is mostly random, and the other schedulers
basically just pass I/O requests to the drive as fast as possible.

[1] https://github.com/Algodev-github/S

Tested-by: Francesco Pollicino 
Signed-off-by: Alessio Masola 
Signed-off-by: Paolo Valente 
---
 block/bfq-cgroup.c  |  3 +-
 block/bfq-iosched.c | 73 +
 block/bfq-iosched.h |  3 ++
 3 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c6113af31960..2a74a3f2a8f7 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -578,7 +578,8 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue 
*bfqq,
bfqg_and_blkg_get(bfqg);
 
if (bfq_bfqq_busy(bfqq)) {
-   bfq_pos_tree_add_move(bfqd, bfqq);
+   if (unlikely(!bfqd->nonrot_with_queueing))
+   bfq_pos_tree_add_move(bfqd, bfqq);
bfq_activate_bfqq(bfqd, bfqq);
}
 
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 41364c0cca8c..b96be3764b8a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -595,7 +595,16 @@ static bool bfq_too_late_for_merging(struct bfq_queue 
*bfqq)
   bfq_merge_time_limit);
 }
 
-void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+/*
+ * The following function is not marked as __cold because it is
+ * actually cold, but for the same performance goal described in the
+ * comments on the likely() at the beginning of
+ * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
+ * execution time for the case where this function is not invoked, we
+ * had to add an unlikely() in each involved if().
+ */
+void __cold
+bfq_pos_tree_add_m

[PATCH BUGFIX IMPROVEMENT 6/8] block, bfq: always protect newly-created queues from existing active queues

2019-03-07 Thread Paolo Valente
If many bfq_queues belonging to the same group happen to be created
shortly after each other, then the processes associated with these
queues have typically a common goal. In particular, bursts of queue
creations are usually caused by services or applications that spawn
many parallel threads/processes. Examples are systemd during boot, or
git grep. If there are no other active queues, then, to help these
processes get their job done as soon as possible, the best thing to do
is to reach a high throughput. To this goal, it is usually better to
not grant either weight-raising or device idling to the queues
associated with these processes. And this is exactly what BFQ
currently does.

There is however a drawback: if, in contrast, some other queues are
already active, then the newly created queues must be protected from
the I/O flowing through the already existing queues. In this case, the
best thing to do is the opposite as in the other case: it is much
better to grant weight-raising and device idling to the newly-created
queues, if they deserve it. This commit addresses this issue by doing
so if there are already other active queues.

This change also helps eliminating false positives, which occur when
the newly-created queues do not belong to an actual large burst of
creations, but some background task (e.g., a service) happens to
trigger the creation of new queues in the middle, i.e., very close to
when the victim queues are created. These false positive may cause
total loss of control on process latencies.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 64 -
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index d34b80e5c47d..500b04df9efa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1075,8 +1075,18 @@ static void bfq_reset_burst_list(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
 
hlist_for_each_entry_safe(item, n, >burst_list, burst_list_node)
hlist_del_init(>burst_list_node);
-   hlist_add_head(>burst_list_node, >burst_list);
-   bfqd->burst_size = 1;
+
+   /*
+* Start the creation of a new burst list only if there is no
+* active queue. See comments on the conditional invocation of
+* bfq_handle_burst().
+*/
+   if (bfq_tot_busy_queues(bfqd) == 0) {
+   hlist_add_head(>burst_list_node, >burst_list);
+   bfqd->burst_size = 1;
+   } else
+   bfqd->burst_size = 0;
+
bfqd->burst_parent_entity = bfqq->entity.parent;
 }
 
@@ -1132,7 +1142,8 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
  * many parallel threads/processes. Examples are systemd during boot,
  * or git grep. To help these processes get their job done as soon as
  * possible, it is usually better to not grant either weight-raising
- * or device idling to their queues.
+ * or device idling to their queues, unless these queues must be
+ * protected from the I/O flowing through other active queues.
  *
  * In this comment we describe, firstly, the reasons why this fact
  * holds, and, secondly, the next function, which implements the main
@@ -1144,7 +1155,10 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
  * cumulatively served, the sooner the target job of these queues gets
  * completed. As a consequence, weight-raising any of these queues,
  * which also implies idling the device for it, is almost always
- * counterproductive. In most cases it just lowers throughput.
+ * counterproductive, unless there are other active queues to isolate
+ * these new queues from. If there no other active queues, then
+ * weight-raising these new queues just lowers throughput in most
+ * cases.
  *
  * On the other hand, a burst of queue creations may be caused also by
  * the start of an application that does not consist of a lot of
@@ -1178,14 +1192,16 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
  * are very rare. They typically occur if some service happens to
  * start doing I/O exactly when the interactive task starts.
  *
- * Turning back to the next function, it implements all the steps
- * needed to detect the occurrence of a large burst and to properly
- * mark all the queues belonging to it (so that they can then be
- * treated in a different way). This goal is achieved by maintaining a
- * "burst list" that holds, temporarily, the queues that belong to the
- * burst in progress. The list is then used to mark these queues as
- * belonging to a large burst if the burst does become large. The main
- * steps are the following.
+ * Turning back to the next function, it is invoked only if there are
+ * no active queues (apart from active queues that would belong to the
+ * same, possible burst bfqq would belong to), and it implements all
+ * the steps neede

[PATCH BUGFIX IMPROVEMENT 3/8] block, bfq: tune service injection basing on request service times

2019-03-07 Thread Paolo Valente
The processes associated with a bfq_queue, say Q, may happen to
generate their cumulative I/O at a lower rate than the rate at which
the device could serve the same I/O. This is rather probable, e.g., if
only one process is associated with Q and the device is an SSD. It
results in Q becoming often empty while in service. If BFQ is not
allowed to switch to another queue when Q becomes empty, then, during
the service of Q, there will be frequent "service holes", i.e., time
intervals during which Q gets empty and the device can only consume
the I/O already queued in its hardware queues. This easily causes
considerable losses of throughput.

To counter this problem, BFQ implements a request injection mechanism,
which tries to fill the above service holes with I/O requests taken
from other bfq_queues. The hard part in this mechanism is finding the
right amount of I/O to inject, so as to both boost throughput and not
break Q's bandwidth and latency guarantees. To this goal, the current
version of this mechanism measures the bandwidth enjoyed by Q while it
is being served, and tries to inject the maximum possible amount of
extra service that does not cause Q's bandwidth to decrease too
much.

This solution has an important shortcoming. For bandwidth measurements
to be stable and reliable, Q must remain in service for a much longer
time than that needed to serve a single I/O request. Unfortunately,
this does not hold with many workloads. This commit addresses this
issue by changing the way the amount of injection allowed is
dynamically computed. It tunes injection as a function of the service
times of single I/O requests of Q, instead of Q's
bandwidth. Single-request service times are evidently meaningful even
if Q gets very few I/O requests completed while it is in service.

As a testbed for this new solution, we measured the throughput reached
by BFQ for one of the nastiest workloads and configurations for this
scheduler: the workload generated by the dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes.
With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
a PLEXTOR PX-256M5.

Tested-by: Francesco Pollicino 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 417 
 block/bfq-iosched.h |  51 +++---
 2 files changed, 409 insertions(+), 59 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2be504f25b09..41364c0cca8c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1721,6 +1721,123 @@ static void bfq_add_request(struct request *rq)
bfqq->queued[rq_is_sync(rq)]++;
bfqd->queued++;
 
+   if (RB_EMPTY_ROOT(>sort_list) && bfq_bfqq_sync(bfqq)) {
+   /*
+* Periodically reset inject limit, to make sure that
+* the latter eventually drops in case workload
+* changes, see step (3) in the comments on
+* bfq_update_inject_limit().
+*/
+   if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+msecs_to_jiffies(1000))) {
+   /* invalidate baseline total service time */
+   bfqq->last_serv_time_ns = 0;
+
+   /*
+* Reset pointer in case we are waiting for
+* some request completion.
+*/
+   bfqd->waited_rq = NULL;
+
+   /*
+* If bfqq has a short think time, then start
+* by setting the inject limit to 0
+* prudentially, because the service time of
+* an injected I/O request may be higher than
+* the think time of bfqq, and therefore, if
+* one request was injected when bfqq remains
+* empty, this injected request might delay
+* the service of the next I/O request for
+* bfqq significantly. In case bfqq can
+* actually tolerate some injection, then the
+* adaptive update will however raise the
+* limit soon. This lucky circumstance holds
+* exactly because bfqq has a short think
+* time, and thus, after remaining empty, is
+* likely to get new I/O enqueued---and then
+* completed---before being expired. This is
+* the very pattern that gives the
+* limit-update algorithm the chance to
+* measure the effect of injection on request
+* service times, and then to update the limit
+ 

[PATCH BUGFIX IMPROVEMENT 0/8] block, bfq: fix bugs, reduce exec time and boost performance

2019-03-07 Thread Paolo Valente
Hi,
since I didn't make it to submit these ones for 5.1, let me be
early for 5.2 :)

These patches fix some bug affecting performance, reduce execution
time a little bit, and boost throughput and responsiveness.

They are meant to be applied on top of the last series I submitted:
https://lkml.org/lkml/2019/1/29/368

Thanks,
Paolo

Francesco Pollicino (2):
  block, bfq: print SHARED instead of pid for shared queues in logs
  block, bfq: save & resume weight on a queue merge/split

Paolo Valente (6):
  block, bfq: increase idling for weight-raised queues
  block, bfq: do not idle for lowest-weight queues
  block, bfq: tune service injection basing on request service times
  block, bfq: do not merge queues on flash storage with queueing
  block, bfq: do not tag totally seeky queues as soft rt
  block, bfq: always protect newly-created queues from existing active
queues

 block/bfq-cgroup.c  |   3 +-
 block/bfq-iosched.c | 786 
 block/bfq-iosched.h |  87 +++--
 block/bfq-wf2q.c|   2 +-
 4 files changed, 704 insertions(+), 174 deletions(-)

--
2.20.1


[PATCH BUGFIX IMPROVEMENT 03/14] block, bfq: make sure queue budgets are not below service received

2019-01-29 Thread Paolo Valente
With some unlucky sequences of events, the function
bfq_updated_next_req updates the current budget of a bfq_queue to a
lower value than the service received by the queue using such a
budget. Unfortunately, if this happens, then the return value of the
function bfq_bfqq_budget_left becomes inconsistent.  This commit
solves this problem by lower-bounding the budget computed in
bfq_updated_next_req to the service currently charged to the queue.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9ea2c4f42501..b0e8006475be 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -907,8 +907,10 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 */
return;
 
-   new_budget = max_t(unsigned long, bfqq->max_budget,
-  bfq_serv_to_charge(next_rq, bfqq));
+   new_budget = max_t(unsigned long,
+  max_t(unsigned long, bfqq->max_budget,
+bfq_serv_to_charge(next_rq, bfqq)),
+  entity->service);
if (entity->budget != new_budget) {
entity->budget = new_budget;
bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 12/14] block, bfq: port commit "cfq-iosched: improve hw_tag detection"

2019-01-29 Thread Paolo Valente
The original commit is
commit 1a1238a7dd48 ("cfq-iosched: improve hw_tag detection")
and has the following commit message.

If active queue hasn't enough requests and idle window opens, cfq will not
dispatch sufficient requests to hardware. In such situation, current code
will zero hw_tag. But this is because cfq doesn't dispatch enough requests
instead of hardware queue doesn't work. Don't zero hw_tag in such case.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 48b579032d14..2ab53d93ba12 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4786,6 +4786,8 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx 
*hctx,
 
 static void bfq_update_hw_tag(struct bfq_data *bfqd)
 {
+   struct bfq_queue *bfqq = bfqd->in_service_queue;
+
bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
   bfqd->rq_in_driver);
 
@@ -4801,6 +4803,17 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
if (bfqd->rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
return;
 
+   /*
+* If active queue hasn't enough requests and can idle, bfq might not
+* dispatch sufficient requests to hardware. Don't zero hw_tag in this
+* case
+*/
+   if (bfqq && bfq_bfqq_has_short_ttime(bfqq) &&
+   bfqq->dispatched + bfqq->queued[0] + bfqq->queued[1] <
+   BFQ_HW_QUEUE_THRESHOLD &&
+   bfqd->rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)
+   return;
+
if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
return;
 
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 04/14] block, bfq: remove case of redirected bic from insert_request

2019-01-29 Thread Paolo Valente
Before commit 18e5a57d7987 ("block, bfq: postpone rq preparation to
insert or merge"), the destination queue for a request was chosen by a
different hook than the one that then inserted the request. So,
between the execution of the two hooks, the bic of the process
generating the request could happen to be redirected to a different
bfq_queue. As a consequence, the destination bfq_queue stored in the
request could be wrong. Such an event does not need to ba handled any
longer.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b0e8006475be..a9275ed57726 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4633,8 +4633,6 @@ static bool __bfq_insert_request(struct bfq_data *bfqd, 
struct request *rq)
bool waiting, idle_timer_disabled = false;
 
if (new_bfqq) {
-   if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
-   new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
/*
 * Release the request's reference to the old bfqq
 * and make sure one is taken to the shared queue.
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 10/14] block, bfq: fix queue removal from weights tree

2019-01-29 Thread Paolo Valente
bfq maintains an ordered list, through a red-black tree, of unique
weights of active bfq_queues. This list is used to detect whether
there are active queues with differentiated weights. The weight of a
queue is removed from the list when both the following two conditions
become true:
(1) the bfq_queue is flagged as inactive
(2) the has no in-flight request any longer;

Unfortunately, in the rare cases where condition (2) becomes true
before condition (1), the removal fails, because the function to
remove the weight of the queue (bfq_weights_tree_remove) is rightly
invoked in the path that deactivates the bfq_queue, but mistakenly
invoked *before* the function that actually performs the deactivation
(bfq_deactivate_bfqq).

This commits moves the invocation of bfq_weights_tree_remove for
condition (1) to after bfq_deactivate_bfqq. As a consequence of this
move, it is necessary to add a further reference to the queue when the
weight of a queue is added, because the queue might otherwise be freed
before bfq_weights_tree_remove is invoked. This commit adds this
reference and makes all related modifications.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 17 +
 block/bfq-wf2q.c|  6 +++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 12228af16198..bf585ad29bb5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -754,6 +754,7 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 
 inc_counter:
bfqq->weight_counter->num_active++;
+   bfqq->ref++;
 }
 
 /*
@@ -778,6 +779,7 @@ void __bfq_weights_tree_remove(struct bfq_data *bfqd,
 
 reset_entity_pointer:
bfqq->weight_counter = NULL;
+   bfq_put_queue(bfqq);
 }
 
 /*
@@ -789,9 +791,6 @@ void bfq_weights_tree_remove(struct bfq_data *bfqd,
 {
struct bfq_entity *entity = bfqq->entity.parent;
 
-   __bfq_weights_tree_remove(bfqd, bfqq,
- >queue_weights_tree);
-
for_each_entity(entity) {
struct bfq_sched_data *sd = entity->my_sched_data;
 
@@ -825,6 +824,15 @@ void bfq_weights_tree_remove(struct bfq_data *bfqd,
bfqd->num_groups_with_pending_reqs--;
}
}
+
+   /*
+* Next function is invoked last, because it causes bfqq to be
+* freed if the following holds: bfqq is not in service and
+* has no dispatched request. DO NOT use bfqq after the next
+* function invocation.
+*/
+   __bfq_weights_tree_remove(bfqd, bfqq,
+ >queue_weights_tree);
 }
 
 /*
@@ -1020,7 +1028,8 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_data *bfqd,
 
 static int bfqq_process_refs(struct bfq_queue *bfqq)
 {
-   return bfqq->ref - bfqq->allocated - bfqq->entity.on_st;
+   return bfqq->ref - bfqq->allocated - bfqq->entity.on_st -
+   (bfqq->weight_counter != NULL);
 }
 
 /* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index ce37d709a34f..63311d1ff1ed 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1673,15 +1673,15 @@ void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 
bfqd->busy_queues[bfqq->ioprio_class - 1]--;
 
-   if (!bfqq->dispatched)
-   bfq_weights_tree_remove(bfqd, bfqq);
-
if (bfqq->wr_coeff > 1)
bfqd->wr_busy_queues--;
 
bfqg_stats_update_dequeue(bfqq_group(bfqq));
 
bfq_deactivate_bfqq(bfqd, bfqq, true, expiration);
+
+   if (!bfqq->dispatched)
+   bfq_weights_tree_remove(bfqd, bfqq);
 }
 
 /*
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 13/14] block, bfq: do not overcharge writes in asymmetric scenarios

2019-01-29 Thread Paolo Valente
Writes tend to starve reads. bfq counters this problem by overcharging
writes with an inflated service w.r.t. the actual service (number of
sector written) they receive.

Yet his overcharging is useless, and actually causes unfairness in the
opposite direction, when bfq happens to be enforcing strong I/O
control. bfq does this enforcing when the scenario is asymmetric,
i.e., when some bfq_queue or group of bfq_queues is to be granted a
different bandwidth than some other bfq_queue or group of
bfq_queues. So, in such a scenario, this commit disables write
overcharging.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2ab53d93ba12..06268449d2ca 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -888,7 +888,8 @@ static struct request *bfq_find_next_rq(struct bfq_data 
*bfqd,
 static unsigned long bfq_serv_to_charge(struct request *rq,
struct bfq_queue *bfqq)
 {
-   if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
+   if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||
+   !bfq_symmetric_scenario(bfqq->bfqd))
return blk_rq_sectors(rq);
 
return blk_rq_sectors(rq) * bfq_async_charge_factor;
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 05/14] block, bfq: consider also ioprio classes in symmetry detection

2019-01-29 Thread Paolo Valente
In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs
to be guaranteed a different bandwidth than other bfq_queues or
bfq_groups, these service guaranteed can be provided only by plugging
I/O dispatch, completely or partially, when the queue in service
remains temporarily empty. A case where asymmetry is particularly
strong is when some active bfq_queues belong to a higher-priority
class than some other active bfq_queues. Unfortunately, this important
case is not considered at all in the code for detecting asymmetric
scenarios. This commit adds the missing logic.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 86 -
 block/bfq-iosched.h |  8 +++--
 block/bfq-wf2q.c| 12 +--
 3 files changed, 59 insertions(+), 47 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a9275ed57726..6bfbfa65610b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -623,26 +623,6 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
bfqq->pos_root = NULL;
 }
 
-/*
- * Tell whether there are active queues with different weights or
- * active groups.
- */
-static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd)
-{
-   /*
-* For queue weights to differ, queue_weights_tree must contain
-* at least two nodes.
-*/
-   return (!RB_EMPTY_ROOT(>queue_weights_tree) &&
-   (bfqd->queue_weights_tree.rb_node->rb_left ||
-bfqd->queue_weights_tree.rb_node->rb_right)
-#ifdef CONFIG_BFQ_GROUP_IOSCHED
-  ) ||
-   (bfqd->num_groups_with_pending_reqs > 0
-#endif
-  );
-}
-
 /*
  * The following function returns true if every queue must receive the
  * same share of the throughput (this condition is used when deciding
@@ -651,25 +631,48 @@ static bool 
bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd)
  *
  * Such a scenario occurs when:
  * 1) all active queues have the same weight,
- * 2) all active groups at the same level in the groups tree have the same
- *weight,
+ * 2) all active queues belong to the same I/O-priority class,
  * 3) all active groups at the same level in the groups tree have the same
+ *weight,
+ * 4) all active groups at the same level in the groups tree have the same
  *number of children.
  *
  * Unfortunately, keeping the necessary state for evaluating exactly
  * the last two symmetry sub-conditions above would be quite complex
- * and time consuming.  Therefore this function evaluates, instead,
- * only the following stronger two sub-conditions, for which it is
+ * and time consuming. Therefore this function evaluates, instead,
+ * only the following stronger three sub-conditions, for which it is
  * much easier to maintain the needed state:
  * 1) all active queues have the same weight,
- * 2) there are no active groups.
+ * 2) all active queues belong to the same I/O-priority class,
+ * 3) there are no active groups.
  * In particular, the last condition is always true if hierarchical
  * support or the cgroups interface are not enabled, thus no state
  * needs to be maintained in this case.
  */
 static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
 {
-   return !bfq_varied_queue_weights_or_active_groups(bfqd);
+   /*
+* For queue weights to differ, queue_weights_tree must contain
+* at least two nodes.
+*/
+   bool varied_queue_weights = !RB_EMPTY_ROOT(>queue_weights_tree) &&
+   (bfqd->queue_weights_tree.rb_node->rb_left ||
+bfqd->queue_weights_tree.rb_node->rb_right);
+
+   bool multiple_classes_busy =
+   (bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
+   (bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
+   (bfqd->busy_queues[1] && bfqd->busy_queues[2]);
+
+   /*
+* For queue weights to differ, queue_weights_tree must contain
+* at least two nodes.
+*/
+   return !(varied_queue_weights || multiple_classes_busy
+#ifdef BFQ_GROUP_IOSCHED_ENABLED
+  || bfqd->num_groups_with_pending_reqs > 0
+#endif
+   );
 }
 
 /*
@@ -728,15 +731,14 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
/*
 * In the unlucky event of an allocation failure, we just
 * exit. This will cause the weight of queue to not be
-* considered in bfq_varied_queue_weights_or_active_groups,
-* which, in its turn, causes the scenario to be deemed
-* wrongly symmetric in case bfqq's weight would have been
-* the only weight making the scenario asymmetric.  On the
-* bright side, no unbalance will however occur when bfqq
-* becomes inactive again (the invocation of this function
-* is triggered by

[PATCH BUGFIX IMPROVEMENT 07/14] block, bfq: do not plug I/O of in-service queue when harmful

2019-01-29 Thread Paolo Valente
If the in-service bfq_queue is sync and remains temporarily idle, then
I/O dispatching (from other queues) may be plugged. It may be dome for
two reasons: either to boost throughput, or to preserve the bandwidth
share of the in-service queue. In the first case, if the I/O of the
in-service queue, when it finally arrives, consists only of one small
I/O request, then it makes sense to plug even the I/O of the
in-service queue. In fact, serving such a small request immediately is
likely to lower throughput instead of boosting it, whereas waiting a
little bit is likely to let that request grow, thanks to request
merging, and become more profitable in terms of throughput (this is
likely to happen exactly because the I/O of the queue has been
detected to boost throughput).

On the opposite end, if I/O dispatching is being plugged only to
preserve the bandwidth of the in-service queue, then it would be
better not to plug also the I/O of the in-service queue, because such
a plugging is likely to cause only loss of bandwidth for the queue.

Unfortunately, no distinction is made between the two cases, and the
I/O of the in-service queue is always plugged in case just a small I/O
request arrives. This commit draws this missing distinction and does
not perform harmful plugging.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2756f4b1432b..a6fe60114ade 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4599,28 +4599,31 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, 
struct bfq_queue *bfqq,
bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);
 
/*
-* There is just this request queued: if the request
-* is small and the queue is not to be expired, then
-* just exit.
+* There is just this request queued: if
+* - the request is small, and
+* - we are idling to boost throughput, and
+* - the queue is not to be expired,
+* then just exit.
 *
 * In this way, if the device is being idled to wait
 * for a new request from the in-service queue, we
 * avoid unplugging the device and committing the
-* device to serve just a small request. On the
-* contrary, we wait for the block layer to decide
-* when to unplug the device: hopefully, new requests
-* will be merged to this one quickly, then the device
-* will be unplugged and larger requests will be
-* dispatched.
+* device to serve just a small request. In contrast
+* we wait for the block layer to decide when to
+* unplug the device: hopefully, new requests will be
+* merged to this one quickly, then the device will be
+* unplugged and larger requests will be dispatched.
 */
-   if (small_req && !budget_timeout)
+   if (small_req && idling_boosts_thr_without_issues(bfqd, bfqq) &&
+   !budget_timeout)
return;
 
/*
-* A large enough request arrived, or the queue is to
-* be expired: in both cases disk idling is to be
-* stopped, so clear wait_request flag and reset
-* timer.
+* A large enough request arrived, or idling is being
+* performed to preserve service guarantees, or
+* finally the queue is to be expired: in all these
+* cases disk idling is to be stopped, so clear
+* wait_request flag and reset timer.
 */
bfq_clear_bfqq_wait_request(bfqq);
hrtimer_try_to_cancel(>idle_slice_timer);
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 08/14] block, bfq: unconditionally plug I/O in asymmetric scenarios

2019-01-29 Thread Paolo Valente
bfq detects the creation of multiple bfq_queues shortly after each
other, namely a burst of queue creations in the terminology used in
the code. If the burst is large, then no queue in the burst is granted
- either I/O-dispatch plugging when the queue remains temporarily
  idle while in service;
- or weight raising, because it causes even longer plugging.

In fact, such a plugging tends to lower throughput, while these bursts
are typically due to applications or services that spawn multiple
processes, to reach a common goal as soon as possible. Examples are a
"git grep" or the booting of a system.

Unfortunately, disabling plugging may cause a loss of service
guarantees in asymmetric scenarios, i.e., if queue weights are
differentiated or if more than one group is active.

This commit addresses this issue by no longer disabling I/O-dispatch
plugging for queues in large bursts.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 346 +---
 1 file changed, 165 insertions(+), 181 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a6fe60114ade..c1bb5e5fcdc4 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3479,191 +3479,175 @@ static bool idling_boosts_thr_without_issues(struct 
bfq_data *bfqd,
bfqd->wr_busy_queues == 0;
 }
 
+/*
+ * There is a case where idling must be performed not for
+ * throughput concerns, but to preserve service guarantees.
+ *
+ * To introduce this case, we can note that allowing the drive
+ * to enqueue more than one request at a time, and hence
+ * delegating de facto final scheduling decisions to the
+ * drive's internal scheduler, entails loss of control on the
+ * actual request service order. In particular, the critical
+ * situation is when requests from different processes happen
+ * to be present, at the same time, in the internal queue(s)
+ * of the drive. In such a situation, the drive, by deciding
+ * the service order of the internally-queued requests, does
+ * determine also the actual throughput distribution among
+ * these processes. But the drive typically has no notion or
+ * concern about per-process throughput distribution, and
+ * makes its decisions only on a per-request basis. Therefore,
+ * the service distribution enforced by the drive's internal
+ * scheduler is likely to coincide with the desired
+ * device-throughput distribution only in a completely
+ * symmetric scenario where:
+ * (i)  each of these processes must get the same throughput as
+ *  the others;
+ * (ii) the I/O of each process has the same properties, in
+ *  terms of locality (sequential or random), direction
+ *  (reads or writes), request sizes, greediness
+ *  (from I/O-bound to sporadic), and so on.
+ * In fact, in such a scenario, the drive tends to treat
+ * the requests of each of these processes in about the same
+ * way as the requests of the others, and thus to provide
+ * each of these processes with about the same throughput
+ * (which is exactly the desired throughput distribution). In
+ * contrast, in any asymmetric scenario, device idling is
+ * certainly needed to guarantee that bfqq receives its
+ * assigned fraction of the device throughput (see [1] for
+ * details).
+ * The problem is that idling may significantly reduce
+ * throughput with certain combinations of types of I/O and
+ * devices. An important example is sync random I/O, on flash
+ * storage with command queueing. So, unless bfqq falls in the
+ * above cases where idling also boosts throughput, it would
+ * be important to check conditions (i) and (ii) accurately,
+ * so as to avoid idling when not strictly needed for service
+ * guarantees.
+ *
+ * Unfortunately, it is extremely difficult to thoroughly
+ * check condition (ii). And, in case there are active groups,
+ * it becomes very difficult to check condition (i) too. In
+ * fact, if there are active groups, then, for condition (i)
+ * to become false, it is enough that an active group contains
+ * more active processes or sub-groups than some other active
+ * group. More precisely, for condition (i) to hold because of
+ * such a group, it is not even necessary that the group is
+ * (still) active: it is sufficient that, even if the group
+ * has become inactive, some of its descendant processes still
+ * have some request already dispatched but still waiting for
+ * completion. In fact, requests have still to be guaranteed
+ * their share of the throughput even after being
+ * dispatched. In this respect, it is easy to show that, if a
+ * group frequently becomes inactive while still having
+ * in-flight requests, and if, when this happens, the group is
+ * not considered in the calculation of whether the scenario
+ * is asymmetric, then the group may fail to be guaranteed its
+ * fair share of the throughput (basically because idling may
+ * not be performed for the descendant processes of the group,
+ * but it had to be)

[PATCH BUGFIX IMPROVEMENT 09/14] block, bfq: fix sequential rq detection in rate estimation

2019-01-29 Thread Paolo Valente
In bfq_update_peak_rate, to check whether an I/O request rq is
sequential, only the seek distance of rq w.r.t. the last request
dispatched is controlled. This is not sufficient for non-rotational
storage, where the size of rq is at least as relevant. This commit
adds the missing control.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c1bb5e5fcdc4..12228af16198 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -235,6 +235,11 @@ static struct kmem_cache *bfq_pool;
 
 #define BFQQ_SEEK_THR  (sector_t)(8 * 100)
 #define BFQQ_SECT_THR_NONROT   (sector_t)(2 * 32)
+#define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \
+   (get_sdist(last_pos, rq) >  \
+BFQQ_SEEK_THR &&   \
+(!blk_queue_nonrot(bfqd->queue) || \
+ blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 19)
 
@@ -2754,7 +2759,7 @@ static void bfq_update_peak_rate(struct bfq_data *bfqd, 
struct request *rq)
 
if ((bfqd->rq_in_driver > 0 ||
now_ns - bfqd->last_completion < BFQ_MIN_TT)
-&& get_sdist(bfqd->last_position, rq) < BFQQ_SEEK_THR)
+   && !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))
bfqd->sequential_samples++;
 
bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
@@ -4511,10 +4516,7 @@ bfq_update_io_seektime(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
   struct request *rq)
 {
bfqq->seek_history <<= 1;
-   bfqq->seek_history |=
-   get_sdist(bfqq->last_request_pos, rq) > BFQQ_SEEK_THR &&
-   (!blk_queue_nonrot(bfqd->queue) ||
-blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT);
+   bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
 }
 
 static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 11/14] block, bfq: reduce threshold for detecting command queueing

2019-01-29 Thread Paolo Valente
bfq borrowed from cfq a simple heuristic for detecting whether the
drive performs command queueing: check whether the average number of
in-flight requests is above a given threshold. Unfortunately this
heuristic does fail to detect queueing (on drives with queueing) if
processes doing I/O are few and issue I/O with a low depth.

To reduce false negatives, this commit lowers the threshold.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bf585ad29bb5..48b579032d14 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -230,7 +230,7 @@ static struct kmem_cache *bfq_pool;
 #define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
 
 /* hw_tag detection: parallel requests threshold and min samples needed. */
-#define BFQ_HW_QUEUE_THRESHOLD 4
+#define BFQ_HW_QUEUE_THRESHOLD 3
 #define BFQ_HW_QUEUE_SAMPLES   32
 
 #define BFQQ_SEEK_THR  (sector_t)(8 * 100)
@@ -4798,7 +4798,7 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
 * sum is not exact, as it's not taking into account deactivated
 * requests.
 */
-   if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+   if (bfqd->rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
return;
 
if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 06/14] block, bfq: split function bfq_better_to_idle

2019-01-29 Thread Paolo Valente
This is a preparatory commit for commits that need to check only one
of the two main reasons for idling. This change should also improve
the quality of the code a little bit, by splitting a function that
contains very long, non-trivial and little related comments.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 155 +++-
 1 file changed, 82 insertions(+), 73 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6bfbfa65610b..2756f4b1432b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3404,53 +3404,13 @@ static bool bfq_may_expire_for_budg_timeout(struct 
bfq_queue *bfqq)
bfq_bfqq_budget_timeout(bfqq);
 }
 
-/*
- * For a queue that becomes empty, device idling is allowed only if
- * this function returns true for the queue. As a consequence, since
- * device idling plays a critical role in both throughput boosting and
- * service guarantees, the return value of this function plays a
- * critical role in both these aspects as well.
- *
- * In a nutshell, this function returns true only if idling is
- * beneficial for throughput or, even if detrimental for throughput,
- * idling is however necessary to preserve service guarantees (low
- * latency, desired throughput distribution, ...). In particular, on
- * NCQ-capable devices, this function tries to return false, so as to
- * help keep the drives' internal queues full, whenever this helps the
- * device boost the throughput without causing any service-guarantee
- * issue.
- *
- * In more detail, the return value of this function is obtained by,
- * first, computing a number of boolean variables that take into
- * account throughput and service-guarantee issues, and, then,
- * combining these variables in a logical expression. Most of the
- * issues taken into account are not trivial. We discuss these issues
- * individually while introducing the variables.
- */
-static bool bfq_better_to_idle(struct bfq_queue *bfqq)
+static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
+struct bfq_queue *bfqq)
 {
-   struct bfq_data *bfqd = bfqq->bfqd;
bool rot_without_queueing =
!blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
bfqq_sequential_and_IO_bound,
-   idling_boosts_thr, idling_boosts_thr_without_issues,
-   idling_needed_for_service_guarantees,
-   asymmetric_scenario;
-
-   if (bfqd->strict_guarantees)
-   return true;
-
-   /*
-* Idling is performed only if slice_idle > 0. In addition, we
-* do not idle if
-* (a) bfqq is async
-* (b) bfqq is in the idle io prio class: in this case we do
-* not idle because we want to minimize the bandwidth that
-* queues in this class can steal to higher-priority queues
-*/
-   if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
-   bfq_class_idle(bfqq))
-   return false;
+   idling_boosts_thr;
 
bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);
@@ -3482,8 +3442,7 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
 bfqq_sequential_and_IO_bound);
 
/*
-* The value of the next variable,
-* idling_boosts_thr_without_issues, is equal to that of
+* The return value of this function is equal to that of
 * idling_boosts_thr, unless a special case holds. In this
 * special case, described below, idling may cause problems to
 * weight-raised queues.
@@ -3500,32 +3459,35 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
 * which enqueue several requests in advance, and further
 * reorder internally-queued requests.
 *
-* For this reason, we force to false the value of
-* idling_boosts_thr_without_issues if there are weight-raised
-* busy queues. In this case, and if bfqq is not weight-raised,
-* this guarantees that the device is not idled for bfqq (if,
-* instead, bfqq is weight-raised, then idling will be
-* guaranteed by another variable, see below). Combined with
-* the timestamping rules of BFQ (see [1] for details), this
-* behavior causes bfqq, and hence any sync non-weight-raised
-* queue, to get a lower number of requests served, and thus
-* to ask for a lower number of requests from the request
-* pool, before the busy weight-raised queues get served
-* again. This often mitigates starvation problems in the
-* presence of heavy write workloads and NCQ, thereby
-* guaranteeing a higher application and system responsiveness
-* in these hostile scenarios.
+* For this reason, we force to false the return value if
+   

[PATCH BUGFIX IMPROVEMENT 02/14] block, bfq: avoid selecting a queue w/o budget

2019-01-29 Thread Paolo Valente
To boost throughput on devices with internal queueing and in scenarios
where device idling is not strictly needed, bfq immediately starts
serving a new bfq_queue if the in-service bfq_queue remains without
pending I/O, even if new I/O may arrive soon for the latter
queue. Then, if such I/O actually arrives soon, bfq preempts the new
in-service bfq_queue so as to give the previous queue a chance to go
on being served (in case the previous queue should actually be the one
to be served, according to its timestamps).

However, the in-service bfq_queue, say Q, may also be without further
budget when it remains also pending I/O. Since bfq changes budgets
dynamically to fit the needs of bfq_queues, this happens more often
than one may expect. If this happens, then there is no point in trying
to go on serving Q when new I/O arrives for it soon: Q would be
expired immediately after being selected for service. This would only
cause useless overhead. This commit avoids such a useless selection.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c7a4a15c7c19..9ea2c4f42501 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1380,7 +1380,15 @@ static bool bfq_bfqq_update_budg_for_activation(struct 
bfq_data *bfqd,
 {
struct bfq_entity *entity = >entity;
 
-   if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time) {
+   /*
+* In the next compound condition, we check also whether there
+* is some budget left, because otherwise there is no point in
+* trying to go on serving bfqq with this same budget: bfqq
+* would be expired immediately after being selected for
+* service. This would only cause useless overhead.
+*/
+   if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time &&
+   bfq_bfqq_budget_left(bfqq) > 0) {
/*
 * We do not clear the flag non_blocking_wait_rq here, as
 * the latter is used in bfq_activate_bfqq to signal
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 14/14] block, bfq: fix in-service-queue check for queue merging

2019-01-29 Thread Paolo Valente
When a new I/O request arrives for a bfq_queue, say Q, bfq checks
whether that request is close to
(a) the head request of some other queue waiting to be served, or
(b) the last request dispatched for the in-service queue (in case Q
itself is not the in-service queue)

If a queue, say Q2, is found for which the above condition holds, then
bfq merges Q and Q2, to hopefully get a more sequential I/O in the
resulting merged queue, and thus a possibly higher throughput.

Case (b) is checked by comparing the new request for Q with the last
request dispatched, assuming that the latter necessarily belonged to
the in-service queue. Unfortunately, this assumption is no longer
always correct, since commit d0edc2473be9 ("block, bfq: inject
other-queue I/O into seeky idle queues on NCQ flash").

When the assumption does not hold, queues that must not be merged may
be merged, causing unexpected loss of control on per-queue service
guarantees.

This commit solves this problem by adding an extra field, which stores
the actual last request dispatched for the in-service queue, and by
using this new field to correctly check case (b).

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 5 -
 block/bfq-iosched.h | 3 +++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 06268449d2ca..4c592496a16a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2251,7 +2251,8 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 
if (in_service_bfqq && in_service_bfqq != bfqq &&
likely(in_service_bfqq != >oom_bfqq) &&
-   bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+   bfq_rq_close_to_sector(io_struct, request,
+  bfqd->in_serv_last_pos) &&
bfqq->entity.parent == in_service_bfqq->entity.parent &&
bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {
new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
@@ -2791,6 +2792,8 @@ static void bfq_update_peak_rate(struct bfq_data *bfqd, 
struct request *rq)
bfq_update_rate_reset(bfqd, rq);
 update_last_values:
bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+   if (RQ_BFQQ(rq) == bfqd->in_service_queue)
+   bfqd->in_serv_last_pos = bfqd->last_position;
bfqd->last_dispatch = now_ns;
 }
 
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 30be669be465..062e1c4787f4 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -538,6 +538,9 @@ struct bfq_data {
/* on-disk position of the last served request */
sector_t last_position;
 
+   /* position of the last served request for the in-service queue */
+   sector_t in_serv_last_pos;
+
/* time of last request completion (ns) */
u64 last_completion;
 
-- 
2.20.1



[PATCH BUGFIX IMPROVEMENT 00/14] batch of patches for next linux release

2019-01-29 Thread Paolo Valente
Hi,
this batch of patches provides fixes and improvements for throughput
and latency. Every patch has been under test for at least one month,
some patches for much longer.

Thanks,
Paolo

Paolo Valente (14):
  block, bfq: do not consider interactive queues in srt filtering
  block, bfq: avoid selecting a queue w/o budget
  block, bfq: make sure queue budgets are not below service received
  block, bfq: remove case of redirected bic from insert_request
  block, bfq: consider also ioprio classes in symmetry detection
  block, bfq: split function bfq_better_to_idle
  block, bfq: do not plug I/O of in-service queue when harmful
  block, bfq: unconditionally plug I/O in asymmetric scenarios
  block, bfq: fix sequential rq detection in rate estimation
  block, bfq: fix queue removal from weights tree
  block, bfq: reduce threshold for detecting command queueing
  block, bfq: port commit "cfq-iosched: improve hw_tag detection"
  block, bfq: do not overcharge writes in asymmetric scenarios
  block, bfq: fix in-service-queue check for queue merging

 block/bfq-iosched.c | 705 
 block/bfq-iosched.h |  11 +-
 block/bfq-wf2q.c|  18 +-
 3 files changed, 400 insertions(+), 334 deletions(-)

--
2.20.1


[PATCH BUGFIX IMPROVEMENT 01/14] block, bfq: do not consider interactive queues in srt filtering

2019-01-29 Thread Paolo Valente
The speed at which a bfq_queue receives I/O is one of the parameters
by which bfq decides whether the queue is soft real-time (i.e.,
whether the queue contains the I/O of a soft real-time
application). In particular, when a bfq_queue remains without
outstanding I/O requests, bfq computes the minimum time instant, named
soft_rt_next_start, at which the next request of the queue may arrive
for the queue to be deemed as soft real time.

Unfortunately this filtering may cause problems with a queue in
interactive weight raising. In fact, such a queue may be conveying the
I/O needed to load a soft real-time application. The latter will
actually exhibit a soft real-time I/O pattern after it finally starts
doing its job. But, if soft_rt_next_start is updated for an
interactive bfq_queue, and the queue has received a lot of service
before remaining with no outstanding request (likely to happen on a
fast device), then soft_rt_next_start is assigned such a high value
that, for a very long time, the queue is prevented from being possibly
considered as soft real time.

This commit removes the updating of soft_rt_next_start for bfq_queues
in interactive weight raising.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 39 +--
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index cd307767a134..c7a4a15c7c19 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3274,16 +3274,32 @@ void bfq_bfqq_expire(struct bfq_data *bfqd,
 * requests, then the request pattern is isochronous
 * (see the comments on the function
 * bfq_bfqq_softrt_next_start()). Thus we can compute
-* soft_rt_next_start. If, instead, the queue still
-* has outstanding requests, then we have to wait for
-* the completion of all the outstanding requests to
-* discover whether the request pattern is actually
-* isochronous.
+* soft_rt_next_start. And we do it, unless bfqq is in
+* interactive weight raising. We do not do it in the
+* latter subcase, for the following reason. bfqq may
+* be conveying the I/O needed to load a soft
+* real-time application. Such an application will
+* actually exhibit a soft real-time I/O pattern after
+* it finally starts doing its job. But, if
+* soft_rt_next_start is computed here for an
+* interactive bfqq, and bfqq had received a lot of
+* service before remaining with no outstanding
+* request (likely to happen on a fast device), then
+* soft_rt_next_start would be assigned such a high
+* value that, for a very long time, bfqq would be
+* prevented from being possibly considered as soft
+* real time.
+*
+* If, instead, the queue still has outstanding
+* requests, then we have to wait for the completion
+* of all the outstanding requests to discover whether
+* the request pattern is actually isochronous.
 */
-   if (bfqq->dispatched == 0)
+   if (bfqq->dispatched == 0 &&
+   bfqq->wr_coeff != bfqd->bfq_wr_coeff)
bfqq->soft_rt_next_start =
bfq_bfqq_softrt_next_start(bfqd, bfqq);
-   else {
+   else if (bfqq->dispatched > 0) {
/*
 * Schedule an update of soft_rt_next_start to when
 * the task may be discovered to be isochronous.
@@ -4834,11 +4850,14 @@ static void bfq_completed_request(struct bfq_queue 
*bfqq, struct bfq_data *bfqd)
 * isochronous, and both requisites for this condition to hold
 * are now satisfied, then compute soft_rt_next_start (see the
 * comments on the function bfq_bfqq_softrt_next_start()). We
-* schedule this delayed check when bfqq expires, if it still
-* has in-flight requests.
+* do not compute soft_rt_next_start if bfqq is in interactive
+* weight raising (see the comments in bfq_bfqq_expire() for
+* an explanation). We schedule this delayed update when bfqq
+* expires, if it still has in-flight requests.
 */
if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
-   RB_EMPTY_ROOT(>sort_list))
+   RB_EMPTY_ROOT(>sort_list) &&
+   bfqq->wr_coeff != bfqd->bfq_wr_coeff)
bfqq->soft_rt_next_start =
bfq_bfqq_softrt_next_start(bfqd, bfqq);
 
-- 
2.20.1



Re: [PATCH BUGFIX RFC 0/2] reverting two commits causing freezes

2019-01-18 Thread Paolo Valente



> Il giorno 18 gen 2019, alle ore 14:35, Jens Axboe  ha 
> scritto:
> 
> On 1/18/19 4:52 AM, Paolo Valente wrote:
>> Hi Jens,
>> a user reported a warning, followed by freezes, in case he increases
>> nr_requests to more than 64 [1]. After reproducing the issues, I
>> reverted the commit f0635b8a416e ("bfq: calculate shallow depths at
>> init time"), plus the related commit bd7d4ef6a4c9 ("bfq-iosched:
>> remove unused variable"). The problem went away.
> 
> For reverts, please put the justification into the actual revert
> commit. With this series, if applied as-is, we'd have two patches
> in the tree that just says "revert X" without any hint as to why
> that was done.
> 

I forget to say explicitly that these patches were meant only to give
you and anybody else something concrete to test and check.

With me you're as safe as houses, in terms of amount of comments in
final patches :)

>> Maybe the assumption in commit f0635b8a416e ("bfq: calculate shallow
>> depths at init time") does not hold true?
> 
> It apparently doesn't! But let's try and figure this out instead of
> blindly reverting it.

Totally agree.

> OK, I think I see it. For the sched_tags
> case, when we grow the requests, we allocate a new set. Hence any
> old cache would be stale at that point.
> 

ok

> How about something like this? It still keeps the code of having
> to update this out of the hot IO path, and only calls it when we
> actually change the depths.
> 

Looks rather clean and efficient.

> Totally untested...
> 

It seems to work here too.

Thanks,
Paolo

> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index cd307767a134..b09589915667 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -5342,7 +5342,7 @@ static unsigned int bfq_update_depths(struct bfq_data 
> *bfqd,
>   return min_shallow;
> }
> 
> -static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
> +static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
> {
>   struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>   struct blk_mq_tags *tags = hctx->sched_tags;
> @@ -5350,6 +5350,11 @@ static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, 
> unsigned int index)
> 
>   min_shallow = bfq_update_depths(bfqd, >bitmap_tags);
>   sbitmap_queue_min_shallow_depth(>bitmap_tags, min_shallow);
> +}
> +
> +static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
> +{
> + bfq_depth_updated(hctx);
>   return 0;
> }
> 
> @@ -5772,6 +5777,7 @@ static struct elevator_type iosched_bfq_mq = {
>   .requests_merged= bfq_requests_merged,
>   .request_merged = bfq_request_merged,
>   .has_work   = bfq_has_work,
> + .depth_updated  = bfq_depth_updated,
>   .init_hctx  = bfq_init_hctx,
>   .init_sched = bfq_init_queue,
>   .exit_sched = bfq_exit_queue,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ba37b9e15e9..a047b297ade5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3101,6 +3101,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
> unsigned int nr)
>   }
>   if (ret)
>   break;
> + if (q->elevator && q->elevator->type->ops.depth_updated)
> + q->elevator->type->ops.depth_updated(hctx);
>   }
> 
>   if (!ret)
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 2e9e2763bf47..6e8bc53740f0 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -31,6 +31,7 @@ struct elevator_mq_ops {
>   void (*exit_sched)(struct elevator_queue *);
>   int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
>   void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
> + void (*depth_updated)(struct blk_mq_hw_ctx *);
> 
>   bool (*allow_merge)(struct request_queue *, struct request *, struct 
> bio *);
>   bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
> 
> -- 
> Jens Axboe
> 



Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Paolo Valente



> Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik  ha 
> scritto:
> 
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
>> This is a redesign of my old cgroup-io-throttle controller:
>> https://lwn.net/Articles/330531/
>> 
>> I'm resuming this old patch to point out a problem that I think is still
>> not solved completely.
>> 
>> = Problem =
>> 
>> The io.max controller works really well at limiting synchronous I/O
>> (READs), but a lot of I/O requests are initiated outside the context of
>> the process that is ultimately responsible for its creation (e.g.,
>> WRITEs).
>> 
>> Throttling at the block layer in some cases is too late and we may end
>> up slowing down processes that are not responsible for the I/O that
>> is being processed at that level.
> 
> How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> properly.  So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled.  Then you are 
> throttled
> at balance_dirty_pages() because the writeout is taking longer.
> 

IIUC, Andrea described this problem: certain processes in a certain group dirty 
a
lot of pages, causing write back to start.  Then some other blameless
process in the same group experiences very high latency, in spite of
the fact that it has to do little I/O.

Does your blk_cgroup_congested() stuff solves this issue?

Or simply I didn't get what Andrea meant at all :)

Thanks,
Paolo

> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such.  If
> you are running into this case that may be something worth using.  Course it
> only works for io.latency now but there's no reason you can't add support to 
> it
> for io.max or whatever.
> 
>> 
>> = Proposed solution =
>> 
>> The main idea of this controller is to split I/O measurement and I/O
>> throttling: I/O is measured at the block layer for READS, at page cache
>> (dirty pages) for WRITEs, and processes are limited while they're
>> generating I/O at the VFS level, based on the measured I/O.
>> 
> 
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things.  Then 
> just
> sprinkle it around the areas where we indirectly generate IO.  Thanks,
> 
> Josef



[PATCH BUGFIX RFC 0/2] reverting two commits causing freezes

2019-01-18 Thread Paolo Valente
Hi Jens,
a user reported a warning, followed by freezes, in case he increases
nr_requests to more than 64 [1]. After reproducing the issues, I
reverted the commit f0635b8a416e ("bfq: calculate shallow depths at
init time"), plus the related commit bd7d4ef6a4c9 ("bfq-iosched:
remove unused variable"). The problem went away.

Maybe the assumption in commit f0635b8a416e ("bfq: calculate shallow
depths at init time") does not hold true?

Thanks,
Paolo

[1] https://bugzilla.kernel.org/show_bug.cgi?id=200813

Paolo Valente (2):
  Revert "bfq-iosched: remove unused variable"
  Revert "bfq: calculate shallow depths at init time"

 block/bfq-iosched.c | 116 ++--
 block/bfq-iosched.h |   6 +++
 2 files changed, 63 insertions(+), 59 deletions(-)

--
2.20.1


  1   2   3   4   5   6   7   8   9   10   >