Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq

2017-03-19 Thread Paolo Valente

> Il giorno 18 mar 2017, alle ore 13:46, Bart Van Assche 
>  ha scritto:
> 
> On Sat, 2017-03-18 at 18:09 +0100, Linus Walleij wrote:
>> On Sat, Mar 18, 2017 at 11:52 AM, Paolo Valente
>>  wrote:
>>>> Il giorno 14 mar 2017, alle ore 16:32, Bart Van Assche 
>>>>  ha scritto:
>>>> (...) what should
>>>> a developer do who only has access to a small subset of all the storage
>>>> devices that are supported by the Linux kernel and hence who can not run 
>>>> the
>>>> benchmark against every supported storage device?
>> 
>> Don't we use the community for that? We are dependent on people
>> downloading and testing our code eventually, I mean sure it's good if
>> we make some reasonable effort to test changes we do, but we are
>> only humans, and we get corrected by the experience of other humans.
> 
> Hello Linus,
> 
> Do you mean relying on the community to test other storage devices before
> or after a patch is upstream? Relying on the community to file bug reports
> after a patch is upstream would be wrong. The Linux kernel should not be
> used for experiments. As you know patches that are sent upstream should
> not introduce regressions.
> 
> My primary concern about BFQ is that it is a very complicated I/O scheduler
> and also that the concepts used internally in that I/O scheduler are far
> away from the concepts we are used to when reasoning about I/O devices.

Hi Bart,
could you elaborate a little bit more on this?  To hopefully help you
highlight where the problem is, here is a summary of what the patches
introduce.

1.  BFQ engine.  This initial piece of code has been obtained mainly
by copying (verbatim) CFQ, replacing all cfq_ prefixes with bfq_,
replacing the round-robin algorithm at the hearth of BFQ with wf2q+, a
well-know and widely studied variant of the classical wfq algorithm,
and, finally, by adapting the code around the new engine to accomodate
the latter.  In particular, budgets, measured in number of sectors, are
used instead of time slices, to achieve bandwidth fairness.

2. Support for cgroups and hierarchical scheduling.

3.  Heuristics to improve service quality and boost throughput.  These
additional pieces are introduced and documented one by one.  The most
complex are: improving responsiveness by privileging the I/O of
interactive applications, improving audio/video playback/streaming by
privileging their I/O, boosting throughput with interleaved I/O (such
as KVM I/O) by merging the queues associated with the processes doing
such an I/O, boosting throughput for applications that span several
processes.

Which of these contributions contain deviations from the I/O concepts
you are used to, and what are these deviations?

Thanks,
Paolo


> I'm
> concerned that this will make the BFQ I/O scheduler hard to maintain.
> 
> Bart.



Re: [PATCH RFC 10/14] block, bfq: add Early Queue Merge (EQM)

2017-03-15 Thread Paolo Valente

> Il giorno 07 mar 2017, alle ore 18:44, Jens Axboe  ha 
> scritto:
> 
> On 03/04/2017 09:01 AM, Paolo Valente wrote:
>> @@ -560,6 +600,15 @@ struct bfq_data {
>>  struct bfq_io_cq *bio_bic;
>>  /* bfqq associated with the task issuing current bio for merging */
>>  struct bfq_queue *bio_bfqq;
>> +
>> +/*
>> + * io context to put right after bfqd->lock is released. This
>> + * filed is used to perform put_io_context, when needed, to
>> + * after the scheduler lock has been released, and thus
>> + * prevent an ioc->lock from being possibly taken while the
>> + * scheduler lock is being held.
>> + */
>> +struct io_context *ioc_to_put;
>> };
> 
> The logic around this is nasty, effectively you end up having locking
> around sections of code instea of structures, which is never a good
> idea.
> 
> The helper functions for unlocking and dropping the ioc add to the mess
> as well.
> 

Hi Jens,
fortunately I seem to have found and fixed the bug causing the failure
your reported in one of your previous emails, so I've started addressing
the issue you raise here.  But your suggestion below raised doubts
that I was not able to solve.  So I'm bailing out and asking for help.

> Can't we simply pass back a pointer to an ioc to free? That should be
> possible, given that we must have grabbed the bfqd lock ourselves
> further up in the call chain. So we _know_ that we'll drop it later on.
> If that wasn't the case, the existing logic wouldn't work.
> 

One of the two functions that discover that an ioc has to bee freed,
namely __bfq_bfqd_reset_in_service, is invoked at the end of several
relatively long chains of function invocations.  The heads of these
chains take and release the scheduler lock.  One example is:

bfq_dispatch_request -> __bfq_dispatch_request -> bfq_select_queue -> 
bfq_bfqq_expire  -> __bfq_bfqq_expire -> __bfq_bfqd_reset_in_service

To implement your proposal, all the functions involved in these chains
should be extended to pass back the ioc to put.  The resulting, heavy
version of the code seems really unadvisable, and prone to errors when
one modifies or adds some chain.

So I have certainly misunderstood something.  As usual, to help you
help me more quickly, here is a summary of what I have understood on
this matter.

1.  For similar, if not exactly the same, lock-nesting issue related
to io-context putting, deferred work is used.  Probably deferred work
is used also for other reasons, but for sure it does solve this issue too.

2.  My solution (which I'm not defending; I'm just trying to
understand) solves the same issue as above: put the io
context after the other lock is released.  But it solves it with no
work-queueing overhead.  Instead of queueing work, it 'queues' the ioc
to put, and puts it right after releasing the scheduler lock.

Where is my mistake?  And what is the correct interpretation of your
proposal to pass back the pointer (instead of storing it in a field of
the device data structure)?

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH RFC 10/14] block, bfq: add Early Queue Merge (EQM)

2017-03-15 Thread Paolo Valente

> Il giorno 15 mar 2017, alle ore 17:30, Jens Axboe  ha 
> scritto:
> 
> On 03/15/2017 09:47 AM, Jens Axboe wrote:
>> I think you understood me correctly. Currently I think the putting of
>> the io context is somewhat of a mess. You have seemingly random places
>> where you have to use special unlock functions, to ensure that you
>> notice that some caller deeper down has set ->ioc_to_put. I took a quick
>> look at it, and by far most of the cases can return an io_context to
>> free quite easily. You can mark these functions __must_check to ensure
>> that we don't drop an io_context, inadvertently. That's already a win
>> over the random ->ioc_to_put store. And you can then get rid of
>> bfq_unlock_put_ioc and it's irq variant as well.
>> 
>> The places where you are already returning a value, like off dispatch
>> for instance, you can just pass in a pointer to an io_context pointer.
>> 
>> If you get this right, it'll be a lot less fragile and hacky than your
>> current approach.
> 
> Even just looking a little closer, you also find cases where you
> potentially twice store ->ioc_to_put. That kind of mixup can't happen if
> you return it properly.
> 
> In __bfq_dispatch_request(), for instance. You call bfq_select_queue(),
> and that in turn calls bfq_bfqq_expire(), which calls
> __bfq_bfqq_expire() which can set ->ioc_to_put. But later on,
> __bfq_dispatch_request() calls bfq_dispatch_rq_from_bfqq(), which in
> turn calls bfq_bfqq_expire() that can also set ->ioc_to_put. There's no
> "magic" bfq_unlock_and_put_ioc() in-between those. Maybe the former call
> never sets ->ioc_to_put if it returns with bfqq == NULL? Hard to tell.
> 
> Or __bfq_insert_request(), it calls bfq_add_request(), which may set
> ->ioc_to_put through bfq_bfqq_handle_idle_busy_switch() ->
> bfq_bfqq_expire(). And then from calling bfq_rq_enqueued() ->
> bfq_bfqq_expire().
> 

I have checked that.  Basically, since a queue can't be expired twice,
then it should never happen that ioc_to_put is set twice before being
used.  Yet, I do agree that using a shared field and exploiting
collateral effects makes code very complex and fragile (maybe even
buggy if my speculative check is wrong).  Just, it has been the best
solution I found, to avoid deferred work as you asked.  In fact, I
still find quite heavy the alternative of passing a pointer to an ioc
forth and back across seven or eight nested functions.

> There might be more, but I think the above is plenty of evidence that
> the current ->ioc_to_put solution is a bad hack, fragile, and already
> has bugs.
> 
> How often do you expect this putting of the io_context to happen?

Unfortunately often, as it must be done also every time the in-service
queue is reset.  But, in this respect, are we sure that we do need to
grab a reference to the ioc when we set a queue in service (as done in
cfq, and copied into bfq)?  I mean, we have the hook exit_ioc for
controlling the disappearing of an ioc.  Am I missing something here
too?

Thanks,
Paolo

> If
> it's not a very frequent occurence, maybe using a deferred workqueue to
> put it IS the right solution. As it currently stands, the code doesn't
> really work, and it's fragile. It can't be cleaned up without
> refactoring, since the call paths are all extremely intermingled.
> 
> -- 
> Jens Axboe
> 



Re: [PATCH RFC 10/14] block, bfq: add Early Queue Merge (EQM)

2017-03-15 Thread Paolo Valente

> Il giorno 15 mar 2017, alle ore 17:56, Jens Axboe  ha 
> scritto:
> 
> On 03/04/2017 09:01 AM, Paolo Valente wrote:
>> @@ -6330,7 +7012,41 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, 
>> struct bfq_queue *bfqq,
>> 
>> static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
>> {
>> -struct bfq_queue *bfqq = RQ_BFQQ(rq);
>> +struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
>> +
>> +/*
>> + * An unplug may trigger a requeue of a request from the device
>> + * driver: make sure we are in process context while trying to
>> + * merge two bfq_queues.
>> + */
>> +if (!in_interrupt()) {
> 
> What's the reason for this?

None :(

Just pre-existing, working code that I did not update, sorry.

> Don't use in_interrupt() to guide any of
> your decision making here.
> 

Of course, sorry for these silly mistakes.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: races between blk-cgroup operations and I/O scheds in blk-mq (?)

2017-05-18 Thread Paolo Valente

> Il giorno 17 mag 2017, alle ore 21:12, Tejun Heo  ha scritto:
> 
> Hello,
> 
> On Mon, May 15, 2017 at 09:49:13PM +0200, Paolo Valente wrote:
>> So, unless you tell me that there are other races I haven't seen, or,
>> even worse, that I'm just talking nonsense, I have thought of a simple
>> solution to address this issue without resorting to the request_queue
>> lock: further caching, on blkg lookups, the only policy or blkg data
>> the scheduler may use, and access this data directly when needed.  By
>> doing so, the issue is reduced to the occasional use of stale data.
>> And apparently this already happens, e.g., in cfq when it uses the
>> weight of a cfq_queue associated with a process whose group has just
>> been changed (and for which a blkg_lookup has not yet been invoked).
>> The same should happen when cfq invokes cfq_log_cfqq for such a
>> cfq_queue, as this function prints the path of the group the bfq_queue
>> belongs to.
> 
> I haven't studied the code but the problem sounds correct to me.  All
> of blkcg code assumes the use of rq lock.  And, yeah, none of the hot
> paths requires strong synchornization.  All the actual management
> operations can be synchronized separately and the hot lookup path can
> be protected with rcu and maybe percpu reference counters.
> 

Great, thanks for this ack.  User reports do confirm the problem, and,
so far, the effectiveness of a solution I have implemented.  I'm
finalizing the patch for submission.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-22 Thread Paolo Valente

> Il giorno 19 mag 2017, alle ore 16:37, Jens Axboe  ha 
> scritto:
> 
> On 05/19/2017 02:39 AM, Paolo Valente wrote:
>> @@ -692,8 +725,7 @@ void bfq_pd_offline(struct blkg_policy_data *pd)
>>  /*
>>   * The idle tree may still contain bfq_queues belonging
>>   * to exited task because they never migrated to a different
>> - * cgroup from the one being destroyed now.  No one else
>> - * can access them so it's safe to act without any lock.
>> ++* cgroup from the one being destroyed now.
>>   */
>>  bfq_flush_idle_tree(st);
>> 
> 
> Looks like an extra '+' snuck into that hunk.
> 

Yes, sorry.  Before possibly submitting a fixed version, I'll wait for
a reply on my previous email in this thread, as the issue now seems
more serious to me, and affecting CFQ too.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Paolo Valente

> Il giorno 23 mag 2017, alle ore 21:42, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:
>> Consider a process or a group that is moved from a given source group
>> to a different group, or simply removed from a group (although I
>> didn't yet succeed in just removing a process from a group :) ).  The
>> pointer to the [b|c]fq_group contained in the schedulable entity
>> belonging to the source group *is not* updated, in BFQ, if the entity
>> is idle, and *is not* updated *unconditionally* in CFQ.  The update
>> will happen in bfq_get_rq_private or cfq_set_request, on the arrival
>> of a new request.  But, if the move happens right after the arrival of
>> a request, then all the scheduler functions executed until a new
>> request arrives for that entity will see a stale [b|c]fq_group.  Much
> 
> Limited staleness is fine.  Especially in this case, it isn't too
> weird to claim that the order between the two operations isn't clearly
> defined.
> 

ok

>> worse, if also a blkcg_deactivate_policy or a blkg_destroy are
>> executed right after the move, then both the policy data pointed by
>> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
>> So, all the functions of the scheduler invoked before next request
>> arrival may use dangling references!
> 
> Hmm... but cfq_group is allocated along with blkcg and blkcg always
> ensures that there are no blkg left before freeing the pd area in
> blkcg_css_offline().
> 

Exact, but even after all blkgs, as well as the cfq_group and pd, are
gone, the children cfq_queues of the gone cfq_group continue to point
to unexisting objects, until new cfq_set_requests are executed for
those cfq_queues.  To try to make this statement clearer, here is the
critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
say cfqg:

1 cfq_set_request for a request rq of cfqq
2 removal of (the process associated with cfqq) from bfqg
3 destruction of the blkg that bfqg is associated with
4 destruction of the blkcg the above blkg belongs to
5 destruction of the pd pointed to by cfqg, and of cfqg itself
!!!-> from now on cfqq->cfqg is a dangling reference <-!!!
6 execution of cfq functions, different from cfq_set_request, on cfqq
. cfq_insert, cfq_dispatch, cfq_completed_rq, ...
7 execution of a new cfq_set_request for cfqq
-> now cfqq->cfqg is again a sane pointer <-

Every function executed at step 6 sees a dangling reference for
cfqq->cfqg.

My fix for caching data doesn't solve this more serious problem.

Where have I been mistaken?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: [block] question about potential null pointer dereference

2017-05-24 Thread Paolo Valente

> Il giorno 23 mag 2017, alle ore 22:52, Gustavo A. R. Silva 
>  ha scritto:
> 
> 
> Hello everybody,
> 

Hi

> While looking into Coverity ID 1408828 I ran into the following piece of code 
> at block/bfq-wf2q.c:542:
> 
> 542static struct rb_node *bfq_find_deepest(struct rb_node *node)
> 543{
> 544struct rb_node *deepest;
> 545
> 546if (!node->rb_right && !node->rb_left)
> 547deepest = rb_parent(node);
> 548else if (!node->rb_right)
> 549deepest = node->rb_left;
> 550else if (!node->rb_left)
> 551deepest = node->rb_right;
> 552else {
> 553deepest = rb_next(node);
> 554if (deepest->rb_right)
> 555deepest = deepest->rb_right;
> 556else if (rb_parent(deepest) != node)
> 557deepest = rb_parent(deepest);
> 558}
> 559
> 560return deepest;
> 561}
> 
> The issue here is that there is a potential NULL pointer dereference at line 
> 554, in case function rb_next() returns NULL.
> 

Can rb_next(node) return NULL, although node is guaranteed to have both
a left and a right child at line 554?

Thanks,
Paolo

> Maybe a patch like the following could be applied in order to avoid any 
> chance of a NULL pointer dereference:
> 
> index 8726ede..28d8b90 100644
> --- a/block/bfq-wf2q.c
> +++ b/block/bfq-wf2q.c
> @@ -551,6 +551,8 @@ static struct rb_node *bfq_find_deepest(struct rb_node 
> *node)
>deepest = node->rb_right;
>else {
>deepest = rb_next(node);
> +   if (!deepest)
> +   return NULL;
>if (deepest->rb_right)
>deepest = deepest->rb_right;
>else if (rb_parent(deepest) != node)
> 
> What do you think?
> 
> I'd really appreciate any comment on this.
> 
> Thank you!
> --
> Gustavo A. R. Silva
> 
> 
> 
> 



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Paolo Valente

> Il giorno 24 mag 2017, alle ore 12:53, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 23 mag 2017, alle ore 21:42, Tejun Heo  ha 
>> scritto:
>> 
>> Hello, Paolo.
>> 
>> On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:
>>> Consider a process or a group that is moved from a given source group
>>> to a different group, or simply removed from a group (although I
>>> didn't yet succeed in just removing a process from a group :) ).  The
>>> pointer to the [b|c]fq_group contained in the schedulable entity
>>> belonging to the source group *is not* updated, in BFQ, if the entity
>>> is idle, and *is not* updated *unconditionally* in CFQ.  The update
>>> will happen in bfq_get_rq_private or cfq_set_request, on the arrival
>>> of a new request.  But, if the move happens right after the arrival of
>>> a request, then all the scheduler functions executed until a new
>>> request arrives for that entity will see a stale [b|c]fq_group.  Much
>> 
>> Limited staleness is fine.  Especially in this case, it isn't too
>> weird to claim that the order between the two operations isn't clearly
>> defined.
>> 
> 
> ok
> 
>>> worse, if also a blkcg_deactivate_policy or a blkg_destroy are
>>> executed right after the move, then both the policy data pointed by
>>> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
>>> So, all the functions of the scheduler invoked before next request
>>> arrival may use dangling references!
>> 
>> Hmm... but cfq_group is allocated along with blkcg and blkcg always
>> ensures that there are no blkg left before freeing the pd area in
>> blkcg_css_offline().
>> 
> 
> Exact, but even after all blkgs, as well as the cfq_group and pd, are
> gone, the children cfq_queues of the gone cfq_group continue to point
> to unexisting objects, until new cfq_set_requests are executed for
> those cfq_queues.  To try to make this statement clearer, here is the
> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
> say cfqg:
> 
> 1 cfq_set_request for a request rq of cfqq

Sorry, this first event is irrelevant for the problem to occur.  What
matters is just that some scheduler hooks are invoked *after* the
deallocation of a cfq_group, and *before* a new cfq_set_request.

Paolo

> 2 removal of (the process associated with cfqq) from bfqg
> 3 destruction of the blkg that bfqg is associated with
> 4 destruction of the blkcg the above blkg belongs to
> 5 destruction of the pd pointed to by cfqg, and of cfqg itself
> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!
> 6 execution of cfq functions, different from cfq_set_request, on cfqq
>   . cfq_insert, cfq_dispatch, cfq_completed_rq, ...
> 7 execution of a new cfq_set_request for cfqq
> -> now cfqq->cfqg is again a sane pointer <-
> 
> Every function executed at step 6 sees a dangling reference for
> cfqq->cfqg.
> 
> My fix for caching data doesn't solve this more serious problem.
> 
> Where have I been mistaken?
> 
> Thanks,
> Paolo
> 
>> Thanks.
>> 
>> -- 
>> tejun



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Paolo Valente

> Il giorno 24 mag 2017, alle ore 15:50, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Wed, May 24, 2017 at 12:53:26PM +0100, Paolo Valente wrote:
>> Exact, but even after all blkgs, as well as the cfq_group and pd, are
>> gone, the children cfq_queues of the gone cfq_group continue to point
>> to unexisting objects, until new cfq_set_requests are executed for
>> those cfq_queues.  To try to make this statement clearer, here is the
>> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
>> say cfqg:
>> 
>> 1 cfq_set_request for a request rq of cfqq
>> 2 removal of (the process associated with cfqq) from bfqg
>> 3 destruction of the blkg that bfqg is associated with
>> 4 destruction of the blkcg the above blkg belongs to
>> 5 destruction of the pd pointed to by cfqg, and of cfqg itself
>> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!
>> 6 execution of cfq functions, different from cfq_set_request, on cfqq
>>  . cfq_insert, cfq_dispatch, cfq_completed_rq, ...
>> 7 execution of a new cfq_set_request for cfqq
>> -> now cfqq->cfqg is again a sane pointer <-
>> 
>> Every function executed at step 6 sees a dangling reference for
>> cfqq->cfqg.
>> 
>> My fix for caching data doesn't solve this more serious problem.
>> 
>> Where have I been mistaken?
> 
> Hmmm... cfq_set_request() invokes cfqg_get() which increases refcnt on
> the blkg, which should pin everything down till the request is done,

Yes, I missed that step, sorry. Still ...

> so none of the above objects can be destroyed before the request is
> done.
> 

... the issue seems just to move to a more subtle position: cfq is ok,
because it protects itself with rq lock, but blk-mq schedulers don't.
So, the race that leads to the (real) crashes reported by people may
actually be:
1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
copy of the content of the blkg, but the rcu mechanism doesn't prevent
destruction from going on
2 blkg_get gets executed on the copy of the original blkg
3 subsequent scheduler operations involving that stale blkg lead to
the dangling-pointer accesses we have already discussed

Could you patiently tell me whether I'm still wrong?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: [kbuild-all] [PATCH V2 16/16] block, bfq: split bfq-iosched.c into multiple source files

2017-04-12 Thread Paolo Valente

> Il giorno 12 apr 2017, alle ore 11:24, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 12 apr 2017, alle ore 10:39, Ye Xiaolong  
>> ha scritto:
>> 
>> On 04/11, Paolo Valente wrote:
>>> 
>>>> Il giorno 02 apr 2017, alle ore 12:02, kbuild test robot  
>>>> ha scritto:
>>>> 
>>>> Hi Paolo,
>>>> 
>>>> [auto build test ERROR on block/for-next]
>>>> [also build test ERROR on v4.11-rc4 next-20170331]
>>>> [if your patch is applied to the wrong git tree, please drop us a note to 
>>>> help improve the system]
>>>> 
>>> 
>>> Hi,
>>> this seems to be a false positive.  Build is correct with the tested
>>> tree and the .config.
>>> 
>> 
>> Hmm, this error is reproducible in 0day side, and you patches were applied on
>> top of 803e16d "Merge branch 'for-4.12/block' into for-next", is it the same 
>> as
>> yours?
>> 
> 
> I have downloaded the offending tree directly from the github page.
> 
> Here are my steps in super detail.
> 
> I followed the url:
> https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
> and downloaded the tree ("Browse the repository at this point in
> history" link on the top commit, then "Download ZIP"), plus the
> .config.gz attached to the email.
> 
> Then I built with no error.
> 
> To try to help understand where the mistake is, the compilation of the
> files of course fails because each of the offending files does not
> contain the definition of the reported functions.  But that definition
> is contained in one of the other files for the same module.  I mean
> one of the files listed in the following rule in block/Makefile:
> obj-$(CONFIG_IOSCHED_BFQ)   += bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
> 
> Maybe I'm making some mistake in the Makefile, or I forgot to modify
> some other configuration file?
> 
> Help! :)
> 

Ok, fortunately I've reproduced it on a different PC.  block/Makefile
was flawed, but, for unknown (to me) reasons, my system was perfectly
happy with the flaw.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> Thanks,
>> Xiaolong
>> 
>>> Thanks,
>>> Paolo
>>> 
>>>> url:
>>>> https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
>>>> base:   
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
>>>> for-next
>>>> config: i386-allmodconfig (attached as .config)
>>>> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
>>>> reproduce:
>>>>  # save the attached .config to linux build tree
>>>>  make ARCH=i386 
>>>> 
>>>> All errors (new ones prefixed by >>):
>>>> 
>>>>>> ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_clear_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_clear_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] 
>>>>>> undefined!
>>>>>> ERROR: "bfq_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_clear_bfqq_wait_request" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_timeout" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqg_stats_set_start_empty_time" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_weights_tree_add" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_put_queue" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_bfqq_sync" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqg_to_blkg" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqq_group" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_weights_tree_remove" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_bic_update_cgroup" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "bfqg_stats_set_start_idle_time" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "bfqg_stats_update_completion" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "bfq_bfqq_move" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "bfqg_put" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "next_queue_may_preempt" [block/bfq-iosched.ko] undefined!
>>>> 
>>>> ---
>>>> 0-DAY kernel test infrastructureOpen Source Technology 
>>>> Center
>>>> https://lists.01.org/pipermail/kbuild-all   Intel 
>>>> Corporation
>>>> <.config.gz>
>>> 
>>> ___
>>> kbuild-all mailing list
>>> kbuild-...@lists.01.org
>>> https://lists.01.org/mailman/listinfo/kbuild-all



Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler

2017-04-12 Thread Paolo Valente

> Il giorno 12 apr 2017, alle ore 17:30, Bart Van Assche 
>  ha scritto:
> 
> On Wed, 2017-04-12 at 08:01 +0200, Paolo Valente wrote:
>> Where is my mistake?
> 
> I think in the Makefile. How about the patch below? Please note that I'm no
> Kbuild expert.
> 

Thank you very much for finding and fixing the bug.  I was working
exactly on that, and had got to the same solution (which I guess is
the only correct one).  I'll apply these changes and resubmit.

Thanks,
Paolo

> diff --git a/block/Makefile b/block/Makefile
> index 546066ee7fa6..b3711af6b637 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -20,7 +20,8 @@ obj-$(CONFIG_IOSCHED_NOOP)+= noop-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)  += cfq-iosched.o
>  obj-$(CONFIG_MQ_IOSCHED_DEADLINE)  += mq-deadline.o
> -obj-$(CONFIG_IOSCHED_BFQ)  += bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
> +bfq-y  := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
> +obj-$(CONFIG_IOSCHED_BFQ)  += bfq.o
>obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
>  obj-$(CONFIG_BLK_CMDLINE_PARSER)   += cmdline-parser.o
> 



[PATCH V4 00/16] Introduce the BFQ I/O scheduler

2017-04-12 Thread Paolo Valente
Hi,
new patch series, addressing (both) issues raised by Bart [1], and
with block/Makefile fixed as suggested by Bart [2].

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/3/31/393
[2] https://lkml.org/lkml/2017/4/12/502

Arianna Avanzini (4):
  block, bfq: add full hierarchical scheduling and cgroups support
  block, bfq: add Early Queue Merge (EQM)
  block, bfq: reduce idling only in symmetric scenarios
  block, bfq: handle bursts of queue activations

Paolo Valente (12):
  block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
  block, bfq: improve throughput boosting
  block, bfq: modify the peak-rate estimator
  block, bfq: add more fairness with writes and slow processes
  block, bfq: improve responsiveness
  block, bfq: reduce I/O latency for soft real-time applications
  block, bfq: preserve a low latency also with NCQ-capable drives
  block, bfq: reduce latency during request-pool saturation
  block, bfq: boost the throughput on NCQ-capable flash-based devices
  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
  block, bfq: remove all get and put of I/O contexts
  block, bfq: split bfq-iosched.c into multiple source files

 Documentation/block/00-INDEX|2 +
 Documentation/block/bfq-iosched.txt |  531 
 block/Kconfig.iosched   |   21 +
 block/Makefile  |2 +
 block/bfq-cgroup.c  | 1139 
 block/bfq-iosched.c | 5047 +++
 block/bfq-iosched.h |  942 +++
 block/bfq-wf2q.c| 1616 +++
 include/linux/blkdev.h  |2 +-
 9 files changed, 9301 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/block/bfq-iosched.txt
 create mode 100644 block/bfq-cgroup.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-iosched.h
 create mode 100644 block/bfq-wf2q.c

--
2.10.0


[PATCH V4 03/16] block, bfq: improve throughput boosting

2017-04-12 Thread Paolo Valente
The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 87 +
 1 file changed, 41 insertions(+), 46 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index af1740a..1edac72 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -752,9 +752,6 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP 128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES  32
 
@@ -4074,40 +4071,6 @@ static struct bfq_queue *bfq_set_in_service_queue(struct 
bfq_data *bfqd)
return bfqq;
 }
 
-/*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-   struct bfq_queue *bfqq)
-{
-   unsigned long budget;
-
-   /*
-* When we need an estimate of the peak rate we need to avoid
-* to give budgets that are too short due to previous
-* measurements.  So, in the first 10 assignments use a
-* ``safe'' budget value. For such first assignment the value
-* of bfqd->budgets_assigned happens to be lower than 

[PATCH V4 08/16] block, bfq: preserve a low latency also with NCQ-capable drives

2017-04-12 Thread Paolo Valente
I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7f94ad3..574a5f6 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6233,7 +6233,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
if (atomic_read(>icq.ioc->active_ref) == 0 ||
bfqd->bfq_slice_idle == 0 ||
-   (bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+   (bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+   bfqq->wr_coeff == 1))
enable_idle = 0;
else if (bfq_sample_valid(bfqq->ttime.ttime_samples)) {
if (bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle &&
-- 
2.10.0



[PATCH V4 15/16] block, bfq: remove all get and put of I/O contexts

2017-04-12 Thread Paolo Valente
When a bfq queue is set in service and when it is merged, a reference
to the I/O context associated with the queue is taken. This reference
is then released when the queue is deselected from service or
split. More precisely, the release of the reference is postponed to
when the scheduler lock is released, to avoid nesting between the
scheduler and the I/O-context lock. In fact, such nesting would lead
to deadlocks, because of other code paths that take the same locks in
the opposite order. This postponing of I/O-context releases does
complicate code.

This commit addresses these issue by modifying involved operations in
such a way to not need to get the above I/O-context references any
more. Then it also removes any get and release of these references.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 143 +---
 1 file changed, 23 insertions(+), 120 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b7e3c86..30bb8f9 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -538,8 +538,6 @@ struct bfq_data {
 
/* bfq_queue in service */
struct bfq_queue *in_service_queue;
-   /* bfq_io_cq (bic) associated with the @in_service_queue */
-   struct bfq_io_cq *in_service_bic;
 
/* on-disk position of the last served request */
sector_t last_position;
@@ -704,15 +702,6 @@ struct bfq_data {
struct bfq_io_cq *bio_bic;
/* bfqq associated with the task issuing current bio for merging */
struct bfq_queue *bio_bfqq;
-
-   /*
-* io context to put right after bfqd->lock is released. This
-* filed is used to perform put_io_context, when needed, to
-* after the scheduler lock has been released, and thus
-* prevent an ioc->lock from being possibly taken while the
-* scheduler lock is being held.
-*/
-   struct io_context *ioc_to_put;
 };
 
 enum bfqq_state_flags {
@@ -1148,34 +1137,6 @@ static void bfq_schedule_dispatch(struct bfq_data *bfqd)
}
 }
 
-/*
- * Next two functions release bfqd->lock and put the io context
- * pointed by bfqd->ioc_to_put. This delayed put is used to not risk
- * to take an ioc->lock while the scheduler lock is being held.
- */
-static void bfq_unlock_put_ioc(struct bfq_data *bfqd)
-{
-   struct io_context *ioc_to_put = bfqd->ioc_to_put;
-
-   bfqd->ioc_to_put = NULL;
-   spin_unlock_irq(>lock);
-
-   if (ioc_to_put)
-   put_io_context(ioc_to_put);
-}
-
-static void bfq_unlock_put_ioc_restore(struct bfq_data *bfqd,
-  unsigned long flags)
-{
-   struct io_context *ioc_to_put = bfqd->ioc_to_put;
-
-   bfqd->ioc_to_put = NULL;
-   spin_unlock_irqrestore(>lock, flags);
-
-   if (ioc_to_put)
-   put_io_context(ioc_to_put);
-}
-
 /**
  * bfq_gt - compare two timestamps.
  * @a: first ts.
@@ -2684,18 +2645,6 @@ static void __bfq_bfqd_reset_in_service(struct bfq_data 
*bfqd)
struct bfq_entity *in_serv_entity = _serv_bfqq->entity;
struct bfq_entity *entity = in_serv_entity;
 
-   if (bfqd->in_service_bic) {
-   /*
-* Schedule the release of a reference to
-* bfqd->in_service_bic->icq.ioc to right after the
-* scheduler lock is released. This ioc is not
-* released immediately, to not risk to possibly take
-* an ioc->lock while holding the scheduler lock.
-*/
-   bfqd->ioc_to_put = bfqd->in_service_bic->icq.ioc;
-   bfqd->in_service_bic = NULL;
-   }
-
bfq_clear_bfqq_wait_request(in_serv_bfqq);
hrtimer_try_to_cancel(>idle_slice_timer);
bfqd->in_service_queue = NULL;
@@ -3495,7 +3444,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
__bfq_deactivate_entity(entity, false);
bfq_put_async_queues(bfqd, bfqg);
 
-   bfq_unlock_put_ioc_restore(bfqd, flags);
+   spin_unlock_irqrestore(>lock, flags);
/*
 * @blkg is going offline and will be ignored by
 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
@@ -5472,20 +5421,18 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct 
bfq_queue *new_bfqq)
 * first time that the requests of some process are redirected to
 * it.
 *
-* We redirect bfqq to new_bfqq and not the opposite, because we
-* are in the context of the process owning bfqq, hence we have
-* the io_cq of this process. So we can immediately configure this
-* io_cq to redirect the requests of the process to new_bfqq.
+* We redirect bfqq to new_bfqq and not the opposite, because
+* we are in the context of the process owning bfqq, thus we
+* have the io_cq of this process. So we can immediately
+* co

[PATCH V4 13/16] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs

2017-04-12 Thread Paolo Valente
This patch is basically the counterpart, for NCQ-capable rotational
devices, of the previous patch. Exactly as the previous patch does on
flash-based devices and for any workload, this patch disables device
idling on rotational devices, but only for random I/O. In fact, only
with these queues disabling idling boosts the throughput on
NCQ-capable rotational devices. To not break service guarantees,
idling is disabled for NCQ-enabled rotational devices only when the
same symmetry conditions considered in the previous patches hold.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2081784..549f030 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6439,20 +6439,15 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 * The next variable takes into account the cases where idling
 * boosts the throughput.
 *
-* The value of the variable is computed considering that
-* idling is usually beneficial for the throughput if:
+* The value of the variable is computed considering, first, that
+* idling is virtually always beneficial for the throughput if:
 * (a) the device is not NCQ-capable, or
 * (b) regardless of the presence of NCQ, the device is rotational
-* and the request pattern for bfqq is I/O-bound (possible
-* throughput losses caused by granting idling to seeky queues
-* are mitigated by the fact that, in all scenarios where
-* boosting throughput is the best thing to do, i.e., in all
-* symmetric scenarios, only a minimal idle time is allowed to
-* seeky queues).
+* and the request pattern for bfqq is I/O-bound and sequential.
 *
 * Secondly, and in contrast to the above item (b), idling an
 * NCQ-capable flash-based device would not boost the
-* throughput even with intense I/O; rather it would lower
+* throughput even with sequential I/O; rather it would lower
 * the throughput in proportion to how fast the device
 * is. Accordingly, the next variable is true if any of the
 * above conditions (a) and (b) is true, and, in particular,
@@ -6460,7 +6455,8 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 * device.
 */
idling_boosts_thr = !bfqd->hw_tag ||
-   (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq));
+   (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) &&
+bfq_bfqq_idle_window(bfqq));
 
/*
 * The value of the next variable,
-- 
2.10.0



[PATCH V4 06/16] block, bfq: improve responsiveness

2017-04-12 Thread Paolo Valente
This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following two special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

For brevity, we call just weight-raising the combination of these
two preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in the previous patch
allows BFQ to guarantee a high application responsiveness.

[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 Documentation/block/bfq-iosched.txt |   9 +
 block/bfq-iosched.c | 740 
 2 files changed, 675 insertions(+), 74 deletions(-)

diff --git a/Documentation/block/bfq-iosched.txt 
b/Documentation/block/bfq-iosched.txt
index 461b27f..1b87df6 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -375,6 +375,11 @@ default, low latency mode is enabled. If enabled, 
interactive and soft
 real-time applications are privileged and experience a lower latency,
 as explained in more detail in the description of how BFQ works.
 
+DO NOT enable this mode if you need full control on bandwidth
+distribution. In fact, if it is enabled, then BFQ automatically
+increases the bandwidth share of privileged applications, as the main
+means to guarantee a lower latency to them.
+
 timeout_sync
 
 
@@ -507,6 +512,10 @@ linear mapping between ioprio and weights, described at 
the beginning
 of the tunable section, is still valid, but all weights higher than
 IOPRIO_BE_NR*10 are mapped to ioprio 0.
 
+Recall that, if low-latency is set, then BFQ automatically raises the
+weight of the queues associated with interactive and soft real-time
+applications. Unset this tunable if you need/want to control weights.
+
 
 [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
 Scheduler", Proceedings of the First Workshop on Mobile System
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index dce273b..1a32c83 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -339,6 +339,17 @@ struct bfq_queue {
 
/* pid of the process owning the queue, used for logging purposes */
pid_t pid;
+
+   /* current maximum weight-raising time for this queue */
+   unsigned long wr_cur_max_time;
+   /*
+* Start time of the current weight-raising period if
+* the @bfq-queue is being weight-raised, otherwise
+* finish time of the last weight-raising period.
+*/
+   unsigned long last_wr_start_finish;
+   /* factor by which the weight of this queue is multiplied */
+   unsigned int wr_coeff;
 };
 
 /**
@@ -356,6 +367,11 @@ struct bfq_io_cq {
 #endif
 };
 
+enum bfq_device_speed {
+   BFQ_BFQD_FAST,
+   BFQ_BFQD_SLOW,
+};
+
 /**
  * struct bfq_data - per-device data structure.
  *
@@ -487,6 +503,34 @@ struct bfq_data {
 */
bool strict_guarantees;
 
+   /* if set to true, low-latency heuristics are enabled */
+   bool low_latency;
+   /*
+* Maximum factor by which the weight of a weight-raised queue
+* is multiplied.
+*/
+   unsigned int bfq_wr_coeff;
+   /* maximum duration of a weight-raising period (jiffies) */

[PATCH V4 12/16] block, bfq: boost the throughput on NCQ-capable flash-based devices

2017-04-12 Thread Paolo Valente
This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.

As already highlighted in a previous patch, allowing the device to
prefetch and internally reorder requests trivially causes loss of
control on the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments on the function
bfq_bfqq_may_idle(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.

Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 154 
 1 file changed, 106 insertions(+), 48 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b97801f..2081784 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6442,15 +6442,25 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 * The value of the variable is computed considering that
 * idling is usually beneficial for the throughput if:
 * (a) the device is not NCQ-capable, or
-* (b) regardless of the presence of NCQ, the request pattern
-* for bfqq is I/O-bound (possible throughput losses
-* caused by granting idling to seeky queues are mitigated
-* by the fact that, in all scenarios where boosting
-* throughput is the best thing to do, i.e., in all
-* symmetric scenarios, only a minimal idle time is
-* allowed to seeky queues).
+* (b) regardless of the presence of NCQ, the device is rotational
+* and the request pattern for bfqq is I/O-bound (possible
+* throughput losses caused by granting idling to seeky queues
+* are mitigated by the fact that, in all scenarios where
+* boosting throughput is the best thing to do, i.e., in all
+* symmetric scenarios, only a minimal idle time is allowed to
+* seeky queues).
+*
+* Secondly, and in contrast to the above item (b), idling an
+* NCQ-capable flash-based device would not boost the
+* throughput even with intense I/O; rather it would lower
+* the throughput in proportion to how fast the device
+* is. Accordingly, the next variable is true if any of the
+* above conditions (a) and (b) is true, and, in particular,
+* happens to be false if bfqd is an NCQ-capable flash-based
+* device.
 */
-   idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
+   idling_boosts_thr = !bfqd->hw_tag ||
+   (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq));
 
/*
 * The value of the next variable,
@@ -6491,14 +6501,16 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
bfqd->wr_busy_queues == 0;
 
/*
-* There is then a case where idling must be performed not for
-* throughput concerns, but to preserve service guarantees. To
-* introduce it, we can note that allowing the drive to
-* enqueue more than one request at a time, and hence
+* There is then a case where idling must be performed not
+* for throughput concerns, but to preserve service
+* guarantees.
+*
+* To introduce this case, we can note that allowing the drive
+* to enqueue more than one request at a time, and hence
 * delegating de facto final scheduling decisions to the
-* drive's internal scheduler, causes loss of control on the
+* drive's internal scheduler, entails loss of control on the
 * actual request service order. In particular, the critical
-* situation is when requests from different processes happens
+* situation is when requests from different processes happen
 * to be present, at the same time, in the internal queue(s)
 * of the drive. In such a situation, the drive, by deciding
 * the service order of the internally-queued requests, does
@@ -6509,5

[PATCH V4 07/16] block, bfq: reduce I/O latency for soft real-time applications

2017-04-12 Thread Paolo Valente
To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in the previous patch)
also the queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* their requests as quickly as they can,
whereas soft real-time applications spend some time processing data
after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time, thereby giving to the application the opportunity to be
deemed as such, only when both the following two conditions happen to
hold: 1) the queue associated with the application has expired and is
empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues its next request at time, say, t_i. At time t_c the
heuristic computes the next time instant, called soft_rt_next_start in
the code, such that, only if t_i >= soft_rt_next_start, then both the
next conditions will hold when the application issues its next
request: 1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments on the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 342 +---
 1 file changed, 323 insertions(+), 19 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1a32c83..7f94ad3 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -119,6 +119,13 @@
 #define BFQ_DEFAULT_GRP_IOPRIO 0
 #define BFQ_DEFAULT_GRP_CLASS  IOPRIO_CLASS_BE
 
+/*
+ * Soft real-time applications are extremely more latency sensitive
+ * than interactive ones. Over-raise the weight of the former to
+ * privilege them against the latter.
+ */
+#define BFQ_SOFTRT_WEIGHT_FACTOR   100
+
 struct bfq_entity;
 
 /**
@@ -343,6 +350,14 @@ struct bfq_queue {
/* current maximum weight-raising time for this queue */
unsigned long wr_cur_max_time;
/*
+* Minimum time instant such that, only if a new request is
+* enqueued after this time instant in an idle @bfq_queue with
+* no outstanding requests, then the task associated with the
+* queue it is deemed as soft real-time (see the comme

[PATCH V4 11/16] block, bfq: reduce idling only in symmetric scenarios

2017-04-12 Thread Paolo Valente
From: Arianna Avanzini 

A seeky queue (i..e, a queue containing random requests) is assigned a
very small device-idling slice, for throughput issues. Unfortunately,
given the process associated with a seeky queue, this behavior causes
the following problem: if the process, say P, performs sync I/O and
has a higher weight than some other processes doing I/O and associated
with non-seeky queues, then BFQ may fail to guarantee to P its
reserved share of the throughput. The reason is that idling is key
for providing service guarantees to processes doing sync I/O [1].

This commit addresses this issue by allowing the device-idling slice
to be reduced for a seeky queue only if the scenario happens to be
symmetric, i.e., if all the queues are to receive the same share of
the throughput.

[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

Signed-off-by: Arianna Avanzini 
Signed-off-by: Riccardo Pizzetti 
Signed-off-by: Samuele Zecchini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 287 ++--
 1 file changed, 280 insertions(+), 7 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6e7388a..b97801f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -183,6 +183,20 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ * with a given weight.
+ */
+struct bfq_weight_counter {
+   unsigned int weight; /* weight of the entities this counter refers to */
+   unsigned int num_active; /* nr of active entities with this weight */
+   /*
+* Weights tree member (see bfq_data's @queue_weights_tree and
+* @group_weights_tree)
+*/
+   struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  *
  * A bfq_entity is used to represent either a bfq_queue (leaf node in the
@@ -212,6 +226,8 @@ struct bfq_sched_data {
 struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
+   /* pointer to the weight counter associated with this entity */
+   struct bfq_weight_counter *weight_counter;
 
/*
 * Flag, true if the entity is on a tree (either the active or
@@ -456,6 +472,25 @@ struct bfq_data {
struct bfq_group *root_group;
 
/*
+* rbtree of weight counters of @bfq_queues, sorted by
+* weight. Used to keep track of whether all @bfq_queues have
+* the same weight. The tree contains one counter for each
+* distinct weight associated to some active and not
+* weight-raised @bfq_queue (see the comments to the functions
+* bfq_weights_tree_[add|remove] for further details).
+*/
+   struct rb_root queue_weights_tree;
+   /*
+* rbtree of non-queue @bfq_entity weight counters, sorted by
+* weight. Used to keep track of whether all @bfq_groups have
+* the same weight. The tree contains one counter for each
+* distinct weight associated to some active @bfq_group (see
+* the comments to the functions bfq_weights_tree_[add|remove]
+* for further details).
+*/
+   struct rb_root group_weights_tree;
+
+   /*
 * Number of bfq_queues containing requests (including the
 * queue in service, even if it is idling).
 */
@@ -791,6 +826,11 @@ struct bfq_group_data {
  * to avoid too many special cases during group creation/
  * migration.
  * @stats: stats for this bfqg.
+ * @active_entities: number of active entities belonging to the group;
+ *   unused for the root group. Used to know whether there
+ *   are groups with more than one active @bfq_entity
+ *   (see the comments to the function
+ *   bfq_bfqq_may_idle()).
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *   determining if two or more queues have interleaving
  *   requests (see bfq_find_close_cooperator()).
@@ -818,6 +858,8 @@ struct bfq_group {
 
struct bfq_entity *my_entity;
 
+   int active_entities;
+
struct rb_root rq_pos_tree;
 
struct bfqg_stats stats;
@@ -1254,12 +1296,27 @@ static bool bfq_update_parent_budget(struct bfq_entity 
*next_in_service)
  * a candidate for next service (i.e, a candidate entity to serve
  * after the in-service entity is expired). The function then returns
  * true.
+ *
+ * In contrast, the entity could stil be a candidate for next service
+ * if it is not a queue, and has more than one child. In fact, even if
+ * one of its children is about to be set in service, other children
+ * may still be the next to serve. As a consequence, a non-queue

[PATCH V4 14/16] block, bfq: handle bursts of queue activations

2017-04-12 Thread Paolo Valente
From: Arianna Avanzini 

Many popular I/O-intensive services or applications spawn or
reactivate many parallel threads/processes during short time
intervals. Examples are systemd during boot or git grep.  These
services or applications benefit mostly from a high throughput: the
quicker the I/O generated by their processes is cumulatively served,
the sooner the target job of these services or applications gets
completed. As a consequence, it is almost always counterproductive to
weight-raise any of the queues associated to the processes of these
services or applications: in most cases it would just lower the
throughput, mainly because weight-raising also implies device idling.

To address this issue, an I/O scheduler needs, first, to detect which
queues are associated with these services or applications. In this
respect, we have that, from the I/O-scheduler standpoint, these
services or applications cause bursts of activations, i.e.,
activations of different queues occurring shortly after each
other. However, a shorter burst of activations may be caused also by
the start of an application that does not consist in a lot of parallel
I/O-bound threads (see the comments on the function bfq_handle_burst
for details).

In view of these facts, this commit introduces:
1) an heuristic to detect (only) bursts of queue activations caused by
   services or applications consisting in many parallel I/O-bound
   threads;
2) the prevention of device idling and weight-raising for the queues
   belonging to these bursts.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 404 ++--
 1 file changed, 389 insertions(+), 15 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 549f030..b7e3c86 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -360,6 +360,10 @@ struct bfq_queue {
 
/* bit vector: a 1 for each seeky requests in history */
u32 seek_history;
+
+   /* node for the device's burst list */
+   struct hlist_node burst_list_node;
+
/* position of the last request enqueued */
sector_t last_request_pos;
 
@@ -443,6 +447,17 @@ struct bfq_io_cq {
bool saved_IO_bound;
 
/*
+* Same purpose as the previous fields for the value of the
+* field keeping the queue's belonging to a large burst
+*/
+   bool saved_in_large_burst;
+   /*
+* True if the queue belonged to a burst list before its merge
+* with another cooperating queue.
+*/
+   bool was_in_burst_list;
+
+   /*
 * Similar to previous fields: save wr information.
 */
unsigned long saved_wr_coeff;
@@ -609,6 +624,36 @@ struct bfq_data {
 */
bool strict_guarantees;
 
+   /*
+* Last time at which a queue entered the current burst of
+* queues being activated shortly after each other; for more
+* details about this and the following parameters related to
+* a burst of activations, see the comments on the function
+* bfq_handle_burst.
+*/
+   unsigned long last_ins_in_burst;
+   /*
+* Reference time interval used to decide whether a queue has
+* been activated shortly after @last_ins_in_burst.
+*/
+   unsigned long bfq_burst_interval;
+   /* number of queues in the current burst of queue activations */
+   int burst_size;
+
+   /* common parent entity for the queues in the burst */
+   struct bfq_entity *burst_parent_entity;
+   /* Maximum burst size above which the current queue-activation
+* burst is deemed as 'large'.
+*/
+   unsigned long bfq_large_burst_thresh;
+   /* true if a large queue-activation burst is in progress */
+   bool large_burst;
+   /*
+* Head of the burst list (as for the above fields, more
+* details in the comments on the function bfq_handle_burst).
+*/
+   struct hlist_head burst_list;
+
/* if set to true, low-latency heuristics are enabled */
bool low_latency;
/*
@@ -671,7 +716,8 @@ struct bfq_data {
 };
 
 enum bfqq_state_flags {
-   BFQQF_busy = 0, /* has requests or is in service */
+   BFQQF_just_created = 0, /* queue just allocated */
+   BFQQF_busy, /* has requests or is in service */
BFQQF_wait_request, /* waiting for a request */
BFQQF_non_blocking_wait_rq, /*
 * waiting for a request
@@ -685,6 +731,10 @@ enum bfqq_state_flags {
 * having consumed at most 2/10 of
 * its budget
 */
+   BFQQF_in_large_burst,   /*
+* bfqq activated in a large burst,
+* see comments to bfq_handle_burst

[PATCH V4 10/16] block, bfq: add Early Queue Merge (EQM)

2017-04-12 Thread Paolo Valente
From: Arianna Avanzini 

A set of processes may happen to perform interleaved reads, i.e.,
read requests whose union would give rise to a sequential read pattern.
There are two typical cases: first, processes reading fixed-size chunks
of data at a fixed distance from each other; second, processes reading
variable-size chunks at variable distances. The latter case occurs for
example with QEMU, which splits the I/O generated by a guest into
multiple chunks, and lets these chunks be served by a pool of I/O
threads, iteratively assigning the next chunk of I/O to the first
available thread. CFQ denotes as 'cooperating' a set of processes that
are doing interleaved I/O, and when it detects cooperating processes,
it merges their queues to obtain a sequential I/O pattern from the union
of their I/O requests, and hence boost the throughput.

Unfortunately, in the following frequent case, the mechanism
implemented in CFQ for detecting cooperating processes and merging
their queues is not responsive enough to handle also the fluctuating
I/O pattern of the second type of processes. Suppose that one process
of the second type issues a request close to the next request to serve
of another process of the same type. At that time the two processes
would be considered as cooperating. But, if the request issued by the
first process is to be merged with some other already-queued request,
then, from the moment at which this request arrives, to the moment
when CFQ controls whether the two processes are cooperating, the two
processes are likely to be already doing I/O in distant zones of the
disk surface or device memory.

CFQ uses however preemption to get a sequential read pattern out of
the read requests performed by the second type of processes too.  As a
consequence, CFQ uses two different mechanisms to achieve the same
goal: boosting the throughput with interleaved I/O.

This patch introduces Early Queue Merge (EQM), a unified mechanism to
get a sequential read pattern with both types of processes. The main
idea is to immediately check whether a newly-arrived request lets some
pair of processes become cooperating, both in the case of actual
request insertion and, to be responsive with the second type of
processes, in the case of request merge. Both types of processes are
then handled by just merging their queues.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Mauro Andreolini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 881 +---
 1 file changed, 840 insertions(+), 41 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index deb1f21c..6e7388a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -281,11 +281,12 @@ struct bfq_ttime {
  * struct bfq_queue - leaf schedulable entity.
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async. @cgroup holds a reference to
- * the cgroup, to be sure that it does not disappear while a bfqq
- * still references it (mostly to avoid races between request issuing
- * and task migration followed by cgroup destruction).  All the fields
- * are protected by the queue lock of the containing bfqd.
+ * io_context or more, if it  is  async or shared  between  cooperating
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it
+ * does not disappear while a bfqq still references it (mostly to avoid
+ * races between request issuing and task migration followed by cgroup
+ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
/* reference counter */
@@ -298,6 +299,16 @@ struct bfq_queue {
/* next ioprio and ioprio class if a change is in progress */
unsigned short new_ioprio, new_ioprio_class;
 
+   /*
+* Shared bfq_queue if queue is cooperating with one or more
+* other queues.
+*/
+   struct bfq_queue *new_bfqq;
+   /* request-position tree member (see bfq_group's @rq_pos_tree) */
+   struct rb_node pos_node;
+   /* request-position tree root (see bfq_group's @rq_pos_tree) */
+   struct rb_root *pos_root;
+
/* sorted list of pending requests */
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
@@ -347,6 +358,12 @@ struct bfq_queue {
/* pid of the process owning the queue, used for logging purposes */
pid_t pid;
 
+   /*
+* Pointer to the bfq_io_cq owning the bfq_queue, set to %NULL
+* if the queue is shared.
+*/
+   struct bfq_io_cq *bic;
+
/* current maximum weight-raising time for this queue */
unsigned long wr_cur_max_time;
/*
@@ -375,10 +392,13 @@ struct bfq_queue {
 * last transition from idle to backlogged.
 */
unsigned long service_from_backlogged;
+
/*
 * Value of wr start time when switching to soft rt
 */
unsigned

[PATCH V4 09/16] block, bfq: reduce latency during request-pool saturation

2017-04-12 Thread Paolo Valente
This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment on the function
bfq_bfqq_may_idle(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one. Along the same line, if there are weight-raised queues,
then this patch halves the service rate of async (write) requests for
non-weight-raised queues.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 66 ++---
 1 file changed, 63 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 574a5f6..deb1f21c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -420,6 +420,8 @@ struct bfq_data {
 * queue in service, even if it is idling).
 */
int busy_queues;
+   /* number of weight-raised busy @bfq_queues */
+   int wr_busy_queues;
/* number of queued requests */
int queued;
/* number of requests dispatched and waiting for completion */
@@ -2490,6 +2492,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, 
struct bfq_queue *bfqq,
 
bfqd->busy_queues--;
 
+   if (bfqq->wr_coeff > 1)
+   bfqd->wr_busy_queues--;
+
bfqg_stats_update_dequeue(bfqq_group(bfqq));
 
bfq_deactivate_bfqq(bfqd, bfqq, true, expiration);
@@ -2506,6 +2511,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
 
bfq_mark_bfqq_busy(bfqq);
bfqd->busy_queues++;
+
+   if (bfqq->wr_coeff > 1)
+   bfqd->wr_busy_queues++;
 }
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
@@ -3779,7 +3787,16 @@ static unsigned long bfq_serv_to_charge(struct request 
*rq,
if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
return blk_rq_sectors(rq);
 
-   return blk_rq_sectors(rq) * bfq_async_charge_factor;
+   /*
+* If there are no weight-raised queues, then amplify service
+* by just the async charge factor; otherwise amplify service
+* by twice the async charge factor, to further reduce latency
+* for weight-raised queues.
+*/
+   if (bfqq->bfqd->wr_busy_queues == 0)
+   return blk_rq_sectors(rq) * bfq_async_charge_factor;
+
+   return blk_rq_sectors(rq) * 2 * bfq_async_charge_factor;
 }
 
 /**
@@ -4234,6 +4251,7 @@ static void bfq_add_request(struct request *rq)
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 
+   bfqd->wr_busy_queues++;
bfqq->entity.prio_changed = 1;
}
if (prev != bfqq->next_rq)
@@ -4474,6 +4492,8 @@ static void bfq_requests_merged(struct request_queue *q, 
struct request *rq,
 /* Must be called with bfqq != NULL */
 static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+   if (bfq_bfqq_busy(bfqq))
+   bfqq->bfqd->wr_busy_queues--;
bfqq->wr_coeff = 1;
bfqq->wr_cur_max_time = 0;
bfqq->last_wr_start_finish = jiffies;
@@ -5497,7 +5517,8 @@ static bool bfq_may_expire_for_budg_timeout(struct 
bfq_queue *bfqq)
 static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 {
struct bfq_data *bfqd = bfqq->bfqd;
-   bool idling_boosts_thr, asymmetric_scenario;
+   bool idling_boosts_thr, idling_boosts_thr_without_issues,
+   asymmetric_scenario;
 
if (bfqd->strict_guarantees)
return true;
@@ -5520,6 +5541,44 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
 
/*
+* The value of the next variable,
+* idling_boosts_thr_without_issues, is equal to that of
+* idling_boosts_thr, unless a special case holds. In this
+* special case, described below, idling may cause problems to
+* weight-raised queues.
+*
+* When the request pool is saturated (e.g., in the presence
+* of write hogs), if the processes associated with
+* non-weight-raised queues ask for requests at a lower rate,
+* then processes associated with weight-raised queues have a
+* higher probability to get a request from the pool
+* immediately (or at least soon) when they need one. Thus
+* they have a higher probability to actually get a fraction
+* of the device throughput proportional to their high
+* weight. This is espe

[PATCH V4 02/16] block, bfq: add full hierarchical scheduling and cgroups support

2017-04-12 Thread Paolo Valente
From: Arianna Avanzini 

Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.

Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.

Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi 
Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 Documentation/block/bfq-iosched.txt |   17 +-
 block/Kconfig.iosched   |   10 +
 block/bfq-iosched.c | 2568 ++-
 include/linux/blkdev.h  |2 +-
 4 files changed, 2213 insertions(+), 384 deletions(-)

diff --git a/Documentation/block/bfq-iosched.txt 
b/Documentation/block/bfq-iosched.txt
index cbf85f6f..461b27f 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -253,9 +253,14 @@ of slice_idle are copied from CFQ too.
 per-process ioprio and weight
 -
 
-Unless the cgroups interface is used, weights can be assigned to
-processes only indirectly, through I/O priorities, and according to
-the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
+Unless the cgroups interface is used (see "4. BFQ group scheduling"),
+weights can be assigned to processes only indirectly, through I/O
+priorities, and according to the relation:
+weight = (IOPRIO_BE_NR - ioprio) * 10.
+
+Beware that, if low-latency is set, then BFQ automatically raises the
+weight of the queues associated with interactive and soft real-time
+applications. Unset this tunable if you need/want to control weights.
 
 slice_idle
 --
@@ -450,9 +455,9 @@ may be reactivated for an already busy async queue (in ms).
 4. Group scheduling with BFQ
 
 
-BFQ supports both cgroup-v1 and cgroup-v2 io controllers, namely blkio
-and io. In particular, BFQ supports weight-based proportional
-share.
+BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
+blkio and io. In particular, BFQ supports weight-based proportional
+share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
 
 4-1 Service guarantees provided
 ---
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 562e30e..a37cd03 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -40,6 +40,7 @@ config CFQ_GROUP_IOSCHED
  Enable group IO scheduling in CFQ.
 
 choice
+
prompt "Default I/O scheduler"
default DEFAULT_CFQ
help
@@ -80,6 +81,15 @@ config IOSCHED_BFQ
real-time applications.  Details in
Documentation/block/bfq-iosched.txt
 
+config BFQ_GROUP_IOSCHED
+   bool "BFQ hierarchical scheduling support"
+   depends on IOSCHED_BFQ && BLK_CGROUP
+   default n
+   ---help---
+
+   Enable hierarchical scheduling in BFQ, using the blkio
+   (cgroups-v1) or io (cgroups-v2) controller.
+
 endmenu
 
 endif
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c4e7d8d..af1740a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -90,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -114,7 +115,7 @@
 
 #d

[PATCH V4 05/16] block, bfq: add more fairness with writes and slow processes

2017-04-12 Thread Paolo Valente
This patch deals with two sources of unfairness, which can also cause
high latencies and throughput loss. The first source is related to
write requests. Write requests tend to starve read requests, basically
because, on one side, writes are slower than reads, whereas, on the
other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient. The value of the
coefficient is the result of our tuning with different devices.

The second source of unfairness has to do with slowness detection:
when the in-service queue is expired, BFQ also controls whether the
queue has been "too slow", i.e., has consumed its last-assigned budget
at such a low rate that it would have been impossible to consume all
of this budget within the maximum time slice T_max (Subsec. 3.5 in
[1]). In this case, the queue is always (over)charged the whole
budget, to reduce its utilization of the device. Both this overcharge
and the slowness-detection criterion may cause unfairness.

First, always charging a full budget to a slow queue is too coarse. It
is much more accurate, and this patch lets BFQ do so, to charge an
amount of service 'equivalent' to the amount of time during which the
queue has been in service. As explained in more detail in the comments
on the code, this enables BFQ to provide time fairness among slow
queues.

Secondly, because of ZBR, a queue may be deemed as slow when its
associated process is performing I/O on the slowest zones of a
disk. However, unless the process is truly too slow, not reducing the
disk utilization of the queue is more profitable in terms of disk
throughput than the opposite. A similar problem is caused by logical
block mapping on non-rotational devices. For this reason, this patch
lets a queue be charged time, and not budget, only if the queue has
consumed less than 2/3 of its assigned budget. As an additional,
important benefit, this tolerance allows BFQ to preserve enough
elasticity to still perform bandwidth, and not time, distribution with
little unlucky or quasi-sequential processes.

Finally, for the same reasons as above, this patch makes slowness
detection itself much less harsh: a queue is deemed slow only if it
has consumed its budget at less than half of the peak rate.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 120 +---
 1 file changed, 85 insertions(+), 35 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 61d880b..dce273b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -753,6 +753,13 @@ static const int bfq_stats_min_budgets = 194;
 /* Default maximum budget values, in sectors and number of requests. */
 static const int bfq_default_max_budget = 16 * 1024;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout = HZ / 8;
 
@@ -1571,22 +1578,52 @@ static void bfq_bfqq_served(struct bfq_queue *bfqq, int 
served)
 }
 
 /**
- * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * bfq_bfqq_charge_time - charge an amount of service equivalent to the length
+ *   of the time interval during which bfqq has been in
+ *   service.
+ * @bfqd: the device
  * @bfqq: the queue that needs a service update.
+ * @time_ms: the amount of time during which the queue has received service
  *
- * When it's not possible to be fair in the service domain, because
- * a queue is not consuming its budget fast enough (the meaning of
- * fast depends on the timeout parameter), we charge it a full
- * budget.  In this way we should obtain a sort of time-domain
- * fairness among all the seeky/slow queues.
+ * If a queue does not consume its budget fast enough, then providing
+ * the queue with service fairness may impair throughput, more or less
+ * severely. For this reason, queues that consume their budget slowly
+ * are provided with time fairness instead of service fairness. This
+ * goal is achieved through the B

[PATCH V4 04/16] block, bfq: modify the peak-rate estimator

2017-04-12 Thread Paolo Valente
Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual device
peak rate, the higher the probability that processes incur budget
timeouts unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

Unfortunately, it is not trivial to estimate the peak rate correctly:
because of the presence of sw and hw queues between the scheduler and
the device components that finally serve I/O requests, it is hard to
say exactly when a given dispatched request is served inside the
device, and for how long. As a consequence, it is hard to know
precisely at what rate a given set of requests is actually served by
the device.

On the opposite end, the dispatch time of any request is trivially
available, and, from this piece of information, the "dispatch rate"
of requests can be immediately computed. So, the idea in the next
function is to use what is known, namely request dispatch times
(plus, when useful, request completion times), to estimate what is
unknown, namely in-device request service rate.

The main issue is that, because of the above facts, the rate at
which a certain set of requests is dispatched over a certain time
interval can vary greatly with respect to the rate at which the
same requests are then served. But, since the size of any
intermediate queue is limited, and the service scheme is lossless
(no request is silently dropped), the following obvious convergence
property holds: the number of requests dispatched MUST become
closer and closer to the number of requests completed as the
observation interval grows. This is the key property used in
this new version of the peak-rate estimator.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 497 +++-
 1 file changed, 372 insertions(+), 125 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1edac72..61d880b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -407,19 +407,37 @@ struct bfq_data {
/* on-disk position of the last served request */
sector_t last_position;
 
+   /* time of last request completion (ns) */
+   u64 last_completion;
+
+   /* time of first rq dispatch in current observation interval (ns) */
+   u64 first_dispatch;
+   /* time of last rq dispatch in current observation interval (ns) */
+   u64 last_dispatch;
+
/* beginning of the last budget */
ktime_t last_budget_start;
/* beginning of the last idle slice */
ktime_t last_idling_start;
-   /* number of samples used to calculate @peak_rate */
+
+   /* number of samples in current observation interval */
int peak_rate_samples;
+   /* num of samples of seq dispatches in current observation interval */
+   u32 sequential_samples;
+   /* total num of sectors transferred in current observation interval */
+   u64 tot_sectors_dispatched;
+   /* max rq size seen during current observation interval (sectors) */
+   u32 last_rq_max_size;
+   /* time elapsed from first dispatch in current observ. interval (us) */
+   u64 delta_from_first;
/*
-* Peak read/write rate, observed during the service of a
-* budget [BFQ_RATE_SHIFT * sectors/usec]. The value is
-* left-shifted by BFQ_RATE_SHIFT to increase precision in
+* Current estimate of the device peak rate, measured in
+* [BFQ_RATE_SHIFT * sectors/usec]. The left-shift by
+* BFQ_RATE_SHIFT is performed to increase precision in
 * fixed-point calculations.
 */
-   u64 peak_rate;
+   u32 peak_rate;
+
/* maximum budget allotted to a bfq_queue before rescheduling */
int bfq_max_budget;
 
@@ -740,7 +758,7 @@ static const int bfq_timeout = HZ / 8;
 
 static struct kmem_cache *bfq_pool;
 
-/* Below this threshold (in ms), we consider thinktime immediate. */
+/* Below this threshold (in ns), we consider thinktime immediate. */
 #define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
 
 /* hw_tag detection: parallel requests threshold and min samples needed. */
@@ -752,8 +770,12 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
 
-/* Min samples used for peak rate estimation (for autotuning). */
-#define BFQ_PEAK_RATE_SAMPLES  32
+/* Min number of samples required to perform peak-rate update */
+#define BFQ_RATE_MIN_SAMPLES   32
+/* Min observati

Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-25 Thread Paolo Valente

> Il giorno 24 mag 2017, alle ore 17:47, Tejun Heo  ha scritto:
> 
> Hello,
> 
> On Wed, May 24, 2017 at 05:43:18PM +0100, Paolo Valente wrote:
>>> so none of the above objects can be destroyed before the request is
>>> done.
>> 
>> ... the issue seems just to move to a more subtle position: cfq is ok,
>> because it protects itself with rq lock, but blk-mq schedulers don't.
>> So, the race that leads to the (real) crashes reported by people may
>> actually be:
> 
> Oh, I was just thinking about !mq paths the whole time.
> 
>> 1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
>> copy of the content of the blkg, but the rcu mechanism doesn't prevent
>> destruction from going on
>> 2 blkg_get gets executed on the copy of the original blkg
> 
> So, we can't do that.  We should look up and bump the ref and use the
> original copy.  We probably should switch blkgs to use percpu-refs.
> 

Ok.  So, just to better understand: as of now, i.e., before you make
the changes you are proposing, the address returned by blkg_lookup can
be used safely only if one both invokes blkg_lookup and gets a
reference, while holding the rq lock all the time.  But then, before
the changes you propose, what is the remaining role of rcu protection
here?  Are there places where the value returned by blkg_lookup is
actually used safely without getting a reference the returned blkg?

Anyway, I'm willing to help with your proposal, if you think I can be
of any help at some point.  In this respect, consider that I'm not an
expert of percpu-refs either.

Finally, I guess that the general fix you have in mind may not be
ready shortly.  So, I'll proceed with my temporary fix for the moment.
In particular, I will
1) fix the typo reported by Jens;
2) add a note stating that this is a temporary fix;
3) if needed, modify commit log and comments in the diffs, to better
describe the general problem, and the actual critical race.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



[PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-19 Thread Paolo Valente
Operations on blkg objects in blk-cgroup are protected with the
request_queue lock, which is no more the lock that protects
I/O-scheduler operations in blk-mq. The latter are now protected with
finer-grained per-scheduler-instance locks. As a consequence, if blkg
and blkg-related objects are accessed in a blk-mq I/O scheduler, it is
possible to have races, unless proper care is taken for these
accesses. BFQ does access these objects, and does incur these races.

This commit addresses this issue without introducing further locks, by
exploiting the following facts.  Destroy operations on a blkg invoke,
as a first step, hooks of the scheduler associated with the blkg. And
these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed and that all the pointers it contains are consistent,
(only) while that instance is holding its bfqd->lock. A blkg_lookup
performed with bfqd->lock held then returns a fully consistent blkg,
which remains consistent until this lock is held.

In view of these facts, this commit caches any needed blkg data (only)
when it (safely) detects a parent-blkg change for an internal entity,
and, to cache these data safely, it gets the new blkg, through a
blkg_lookup, and copies data while keeping the bfqd->lock held. As of
now, BFQ needs to cache only the path of the blkg, which is used in
the bfq_log_* functions.

This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.

Reported-by: Tomas Konir 
Reported-by: Lee Tibbert 
Reported-by: Marco Piazza 
Signed-off-by: Paolo Valente 
Tested-by: Tomas Konir 
Tested-by: Lee Tibbert 
Tested-by: Marco Piazza 
---
 block/bfq-cgroup.c  | 56 +
 block/bfq-iosched.h | 18 +++--
 2 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c8a32fb..06195ff 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -52,7 +52,7 @@ BFQG_FLAG_FNS(idling)
 BFQG_FLAG_FNS(empty)
 #undef BFQG_FLAG_FNS
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -67,7 +67,7 @@ static void bfqg_stats_update_group_wait_time(struct 
bfqg_stats *stats)
bfqg_stats_clear_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
 struct bfq_group *curr_bfqg)
 {
@@ -81,7 +81,7 @@ static void bfqg_stats_set_start_group_wait_time(struct 
bfq_group *bfqg,
bfqg_stats_mark_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -496,9 +496,10 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
  * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
  * it on the new one.  Avoid putting the entity on the old group idle tree.
  *
- * Must be called under the queue lock; the cgroup owning @bfqg must
- * not disappear (by now this just means that we are called under
- * rcu_read_lock()).
+ * Must be called under the scheduler lock, to make sure that the blkg
+ * owning @bfqg does not disappear (see comments in
+ * bfq_bic_update_cgroup on guaranteeing the consistency of blkg
+ * objects).
  */
 void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
   struct bfq_group *bfqg)
@@ -545,8 +546,9 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue 
*bfqq,
  * @bic: the bic to move.
  * @blkcg: the blk-cgroup to move to.
  *
- * Move bic to blkcg, assuming that bfqd->queue is locked; the caller
- * has to make sure that the reference to cgroup is valid across the call.
+ * Move bic to blkcg, assuming that bfqd->lock is held; which makes
+ * sure that the reference to cgroup is valid across the call (see
+ * comments in bfq_bic_update_cgroup on this issue)
  *
  * NOTE: an alternative approach might have been to store the current
  * cgroup in bfqq and getting a reference to it, reducing the lookup
@@ -604,6 +606,39 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct 
bio *bio)
goto out;
 
bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
+   /*
+* Update blkg_path for bfq_log_* functions. We cache this
+* path, and update it here, for the following
+* reasons. Operations on blkg objects in blk-cgroup are
+* protected with the request_queue lock, and not with t

Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-20 Thread Paolo Valente

> Il giorno 19 mag 2017, alle ore 16:54, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Fri, May 19, 2017 at 10:39:08AM +0200, Paolo Valente wrote:
>> Operations on blkg objects in blk-cgroup are protected with the
>> request_queue lock, which is no more the lock that protects
>> I/O-scheduler operations in blk-mq. The latter are now protected with
>> finer-grained per-scheduler-instance locks. As a consequence, if blkg
>> and blkg-related objects are accessed in a blk-mq I/O scheduler, it is
>> possible to have races, unless proper care is taken for these
>> accesses. BFQ does access these objects, and does incur these races.
>> 
>> This commit addresses this issue without introducing further locks, by
>> exploiting the following facts.  Destroy operations on a blkg invoke,
>> as a first step, hooks of the scheduler associated with the blkg. And
>> these hooks are executed with bfqd->lock held for BFQ. As a
>> consequence, for any blkg associated with the request queue an
>> instance of BFQ is attached to, we are guaranteed that such a blkg is
>> not destroyed and that all the pointers it contains are consistent,
>> (only) while that instance is holding its bfqd->lock. A blkg_lookup
>> performed with bfqd->lock held then returns a fully consistent blkg,
>> which remains consistent until this lock is held.
>> 
>> In view of these facts, this commit caches any needed blkg data (only)
>> when it (safely) detects a parent-blkg change for an internal entity,
>> and, to cache these data safely, it gets the new blkg, through a
>> blkg_lookup, and copies data while keeping the bfqd->lock held. As of
>> now, BFQ needs to cache only the path of the blkg, which is used in
>> the bfq_log_* functions.
>> 
>> This commit also removes or updates some stale comments on locking
>> issues related to blk-cgroup operations.
> 
> For a quick fix, this is fine but I think it'd be much better to
> update blkcg core so that we protect lookups with rcu and refcnt the
> blkg with percpu refs so that we can use blkcg correctly for all
> purposes with blk-mq.  There's no reason to hold up the immediate fix
> for that but it'd be nice to at least note what we should be doing in
> the longer term in a comment.
> 

Ok.

I have started thinking of a blk-cgroup-wide solution, but, Tejun and
Jens, the more I think about it, the more I see a more structural bug
:( The bug seems to affect CFQ too, even if CFQ still uses the
request_queue lock.  Hoping that the bug is only in my mind, here is
first my understanding of how the data structures related to the bug
are performed, and, second, why handling them this way apparently
leads to the bug.

For a given instance of [B|C]FQ (i.e., of BFQ or CFQ), a [b|c]fq_group
(descriptor of a group in [B|C]FQ) is created on the creation of each
blkg associated with the same request queue that instance of [B|C]FQ
is attached to.  The schedulable entities that belong to this blkg
(only queues in CFQ, or both queues and generic entities in BFQ), are
then associated with this [b|c]fq_group on the arrival on new I/O
requests for them: these entities contain a pointer to a
[b|c]fq_group, and the pointer is assigned the address of this new
[b|c]fq_group.  The functions where the association occurs are
bfq_get_rq_private for BFQ and cfq_set_request for CFQ.  Both hooks
are executed before the hook for actually enqueueing the request.  Any
access to group information is performed through this [b|c]fq_group
field.  The associated blkg is accessed through the policy_data
pointer in the bfq_group (the policy data in its turn contains a
pointer to the blkg)

Consider a process or a group that is moved from a given source group
to a different group, or simply removed from a group (although I
didn't yet succeed in just removing a process from a group :) ).  The
pointer to the [b|c]fq_group contained in the schedulable entity
belonging to the source group *is not* updated, in BFQ, if the entity
is idle, and *is not* updated *unconditionally* in CFQ.  The update
will happen in bfq_get_rq_private or cfq_set_request, on the arrival
of a new request.  But, if the move happens right after the arrival of
a request, then all the scheduler functions executed until a new
request arrives for that entity will see a stale [b|c]fq_group.  Much
worse, if also a blkcg_deactivate_policy or a blkg_destroy are
executed right after the move, then both the policy data pointed by
the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
So, all the functions of the scheduler invoked before next request
arrival may use dangling references!

The symptom reported by BFQ users has been actually the dereference of
dangling bfq_group or policy data pointers in a request_insert

What do you think, have I been mistaken in some step?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



races between blk-cgroup operations and I/O scheds in blk-mq (?)

2017-05-15 Thread Paolo Valente
Hi Tejun, Jens, and anyone else possibly interested in this issue,
I have realized that, while blk-cgroup operation are of course
protected by the usual request_queue lock, I/O scheduler operations
aren't any longer protected by this same lock in blk-mq.  They are
protected by a finer-grained scheduler lock instead.  If I'm not
missing anything, this exposes to obvious races any I/O scheduler
supporting cgroups, as bfq.  So I have tried to check bfq code,
against blk-cgroup, as carefully as I could.

The only dangerous operations I found in blk-cgroup, for bfq, are blkg
destroy ones. But the scheduler hook related to these operations
(pd_offline) seems to be always invoked before any other, possibly
dangerous, step.  It seems then enough that this hook is executed with
the scheduler lock held, to serialize cgroup and in-scheduler
blkg-lookup operations.

As for in-scheduler operations, the only danger I found so far is the
dereference of the blkg_policy_data pointer field cached in the
descriptor of a group.  Given the parent group of some process in the
scheduler, that pointer may have become a dangling reference if the
policy data it pointed to has been destroyed, but the parent-group
pointer for that process has not yet been updated (the parent pointer
itself is then a dangling reference).  In this respect, these updates
happen (only) after the arrival of new I/O requests after the
destruction of a parent group.

So, unless you tell me that there are other races I haven't seen, or,
even worse, that I'm just talking nonsense, I have thought of a simple
solution to address this issue without resorting to the request_queue
lock: further caching, on blkg lookups, the only policy or blkg data
the scheduler may use, and access this data directly when needed.  By
doing so, the issue is reduced to the occasional use of stale data.
And apparently this already happens, e.g., in cfq when it uses the
weight of a cfq_queue associated with a process whose group has just
been changed (and for which a blkg_lookup has not yet been invoked).
The same should happen when cfq invokes cfq_log_cfqq for such a
cfq_queue, as this function prints the path of the group the bfq_queue
belongs to.

Thanks,
Paolo



Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-14 Thread Paolo Valente

> Il giorno 14 ott 2016, alle ore 18:40, Tejun Heo  ha scritto:
> 
> Hello, Kyle.
> 
> On Sat, Oct 08, 2016 at 06:15:14PM -0700, Kyle Sanderson wrote:
>> How is this even a discussion when hard numbers, and trying any
>> reproduction case easily reproduce the issues that CFQ causes. Reading
>> this thread, and many others only grows not only my disappointment,
>> but whenever someone launches kterm or scrot and their machine
>> freezes, leaves a selective few individuals completely responsible for
>> this. Help those users, help yourself, help Linux.
> 
> So, just to be clear.  I wasn't arguing against bfq replacing cfq (or
> anything along that line) but that proportional control, as
> implemented, would be too costly for many use cases and thus we need
> something along the line of what Shaohua is proposing.
> 

Sorry for dropping in all the times, but the vision that you and some
other guys propose seems to miss some important piece (unless, now or
then, you will patiently prove me wrong, or I will finally understand
on my own why I'm wrong).

You are of course right: bfq, as a component of blk, and above all, as
a sort of derivative of CFQ (and of its overhead), has currently too
high a overhead to handle more than 10-20K IOPS.

That said, your 'thus' seems a little too strong: "bfq does not yet
handle fast SSDs, thus we need something else".  What about the
millions of devices (and people) still within 10-20 K IOPS, and
experiencing awful latencies and lack of bandwidth guarantees?

For certain systems or applications, it isn't even just a "buy a fast
SSD" matter, but a technological constraint.

> FWIW, it looks like the only way we can implement proportional control
> on highspeed ssds with acceptable overhead

Maybe not: as I wrote to Viveck in a previous reply, containing
pointers to documentation, we have already achieved twenty millions
of decisions per second with a prototype driving existing
proportional-share packet schedulers (essentially without
modifications).

> is somehow finding a way to
> calculate the cost of each IO and throttle IOs according to that while
> controlling for latency as necessary.  Slice scheduling with idling
> seems too expensive with highspeed devices with high io depth.
> 

Yes, that's absolutely true.  I'm already thinking about an idleless
solution.  As I already wrote, I'm willing to help with scheduling in
blk-mq.  I hope there will be the opportunity to find some way to go
at KS.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/







Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-16 Thread Paolo Valente

> Il giorno 14 ott 2016, alle ore 20:35, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Fri, Oct 14, 2016 at 07:13:41PM +0200, Paolo Valente wrote:
>> That said, your 'thus' seems a little too strong: "bfq does not yet
>> handle fast SSDs, thus we need something else".  What about the
>> millions of devices (and people) still within 10-20 K IOPS, and
>> experiencing awful latencies and lack of bandwidth guarantees?
> 
> I'm not objecting to any of that.

Ok, sorry for misunderstanding.  I'm just more and more confused about
why a readily available, and not proven wrong solution has not yet
been accepted, if everybody apparently acknowledges the problem.

>  My point just is that bfq, at least
> as currently implemented, is unfit for certain classes of use cases.
> 

Absolutely correct.

>>> FWIW, it looks like the only way we can implement proportional control
>>> on highspeed ssds with acceptable overhead
>> 
>> Maybe not: as I wrote to Viveck in a previous reply, containing
>> pointers to documentation, we have already achieved twenty millions
>> of decisions per second with a prototype driving existing
>> proportional-share packet schedulers (essentially without
>> modifications).
> 
> And that doesn't require idling and thus doesn't severely impact
> utilization?
> 

Nope.  Packets are commonly assumed to be sent asynchronously.
I guess that discussing the validity of this assumption is out of the
scope of this thread.

Thanks,
Paolo

>>> is somehow finding a way to
>>> calculate the cost of each IO and throttle IOs according to that while
>>> controlling for latency as necessary.  Slice scheduling with idling
>>> seems too expensive with highspeed devices with high io depth.
>> 
>> Yes, that's absolutely true.  I'm already thinking about an idleless
>> solution.  As I already wrote, I'm willing to help with scheduling in
>> blk-mq.  I hope there will be the opportunity to find some way to go
>> at KS.
> 
> It'd be great to have a proportional control mechanism whose overhead
> is acceptable.  Unfortunately, we don't have one now and nothing seems
> right around the corner.  (Mostly) work-conserving throttling would be
> fiddlier to use but is something which is useful regardless of such
> proportional control mechanism and can be obtained relatively easily.
> 
> I don't see why the two approaches would be mutually exclusive.
> 
> Thanks.
> 
> -- 
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/







Re: [PATCH 0/8] Throttled background buffered writeback v7

2016-09-09 Thread Paolo Valente
sync issue was disregarded, if the stat
>  sample weren't valid.
> - Speed up the division in rwb_arm_timer().
> - Add logic to scale back up for 'unknown' latency events.
> - Don't track sync issue timestamp of wbt is disabled.
> - Drop the dirty/writeback page inc/dec patch. We don't need it, and
>  it was racy.
> - Move block/blk-wb.c to lib/wbt.c
> 
> Changes since v3
> 
> - Re-do the mm/ writheback parts. Add REQ_BG for background writes,
>  and don't overload the wbc 'reason' for writeback decisions.
> - Add tracking for when apps are sleeping waiting for a page to complete.
> - Change wbc_to_write() to wbc_to_write_cmd().
> - Use atomic_t for the balance_dirty_pages() sleep count.
> - Add a basic scalable block stats tracking framework.
> - Rewrite blk-wb core as described above, to dynamically adapt. This is
>  a big change, see the last patch for a full description of it.
> - Add tracing to blk-wb, instead of using debug printk's.
> - Rebased to 4.6-rc3 (ish)
> 
> Changes since v2
> 
> - Switch from wb_depth to wb_percent, as that's an easier tunable.
> - Add the patch to track device depth on the block layer side.
> - Cleanup the limiting code.
> - Don't use a fixed limit in the wb wait, since it can change
>  between wakeups.
> - Minor tweaks, fixups, cleanups.
> 
> Changes since v1
> 
> - Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
> - wb_start_writeback() fills in background/reclaim/sync info in
>  the writeback work, based on writeback reason.
> - Use WRITE_SYNC for reclaim/sync IO
> - Split balance_dirty_pages() sleep change into separate patch
> - Drop get_request() u64 flag change, set the bit on the request
>  directly after-the-fact.
> - Fix wrong sysfs return value
> - Various small cleanups
> 
> 
> Documentation/block/queue-sysfs.txt |   13 
> block/Kconfig   |1 
> block/Makefile  |2 
> block/blk-core.c|   22 +
> block/blk-mq-sysfs.c|   47 ++
> block/blk-mq.c  |   42 ++
> block/blk-mq.h  |3 
> block/blk-settings.c|   15 
> block/blk-stat.c|  221 +++
> block/blk-stat.h|   18 
> block/blk-sysfs.c   |  151 
> block/cfq-iosched.c |   12 
> drivers/scsi/scsi.c |3 
> fs/buffer.c |2 
> fs/f2fs/data.c  |2 
> fs/f2fs/node.c  |2 
> fs/gfs2/meta_io.c   |3 
> fs/mpage.c  |2 
> fs/xfs/xfs_aops.c   |7 
> include/linux/backing-dev-defs.h|2 
> include/linux/blk_types.h   |   16 
> include/linux/blkdev.h  |   19 +
> include/linux/fs.h  |3 
> include/linux/wbt.h |  120 ++
> include/linux/writeback.h   |   10 
> include/trace/events/wbt.h  |  153 
> lib/Kconfig |3 
> lib/Makefile|1 
> lib/wbt.c       |  679 
> 
> mm/backing-dev.c|1 
> mm/page-writeback.c |1 
> 31 files changed, 1560 insertions(+), 16 deletions(-)
> 
> -- 
> Jens Axboe
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente 
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy  
homepage:  http://algogroup.unimore.it/people/paolo/



Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-29 Thread Paolo Valente

> Il giorno 29 ott 2016, alle ore 16:12, Jens Axboe  ha 
> scritto:
> 
> On 10/28/2016 11:38 PM, Paolo Valente wrote:
>> 
>>> Il giorno 26 ott 2016, alle ore 18:12, Jens Axboe  ha 
>>> scritto:
>>> 
>>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe  ha 
>>>>> scritto:
>>>>> 
>>>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>>>>> The question to ask first is whether to actually have pluggable
>>>>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>>>>> do the right thing in every case (and possibly can be bypassed
>>>>>>> completely).
>>>>>> 
>>>>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>>>>> option (default to off unless opted in by the driver or user), and
>>>>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>>>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>>>>> device even for slow devices, except for the option of having I/O
>>>>>> scheduling.
>>>>> 
>>>>> It's the only right way forward. blk-mq might not offer any substantial
>>>>> advantages to rotating storage, but with scheduling, it won't offer a
>>>>> downside either. And it'll take us towards the real goal, which is to
>>>>> have just one IO path.
>>>> 
>>>> ok
>>>> 
>>>>> Adding a new scheduler for the legacy IO path
>>>>> makes no sense.
>>>> 
>>>> I would fully agree if effective and stable I/O scheduling would be
>>>> available in blk-mq in one or two months.  But I guess that it will
>>>> take at least one year optimistically, given the current status of the
>>>> needed infrastructure, and given the great difficulties of doing
>>>> effective scheduling at the high parallelism and extreme target speeds
>>>> of blk-mq.  Of course, this holds true unless little clever scheduling
>>>> is performed.
>>>> 
>>>> So, what's the point in forcing a lot of users wait another year or
>>>> more, for a solution that has yet to be even defined, while they could
>>>> enjoy a much better system, and then switch an even better system when
>>>> scheduling is ready in blk-mq too?
>>> 
>>> That same argument could have been made 2 years ago. Saying no to a new
>>> scheduler for the legacy framework goes back roughly that long. We could
>>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>>> point.
>>> 
>>> I'm hesistant to add a new scheduler because it's very easy to add, very
>>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>>> it'll take us years and years to get rid of it again. We should be
>>> moving towards LESS moving parts in the legacy path, not more.
>>> 
>>> We can keep having this discussion every few years, but I think we'd
>>> both prefer to make some actual progress here.
>> 
>> ok Jens, I give up
>> 
>>> It's perfectly fine to
>>> add an interface for a single queue interface for an IO scheduler for
>>> blk-mq, since we don't care too much about scalability there. And that
>>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>>> that should not be hard either. That can co-exist with a real multiqueue
>>> scheduler as well, something that's geared towards some fairness for
>>> faster devices.
>>> 
>> 
>> AFAICT this solution is good, for many practical reasons.  I don't
>> have the expertise to make such an infrastructure well on my own.  At
>> least not in an acceptable amount of time, because working on this
>> nice stuff is unfortunately not my job (although Linaro is now
>> supporting me for BFQ).
>> 
>> Then, assuming that this solution may be of general interest, and that
>> BFQ benefits convinced you a little bit too, may I get significant
>> collaboration/help on implementing this infrastructure?
> 
> Of course, I already offered to help with this.
> 

Yep, I just did not want to take this important point for granted.

>> If so, Jens
>> and all possibly interested parties, could we have a sort of short
>> kick-off te

[PATCH BUGFIX] block, bfq: use pointer entity->sched_data only if set

2017-05-09 Thread Paolo Valente
In the function __bfq_deactivate_entity, the pointer
entity->sched_data could happen to be used before being properly
initialized. This led to a NULL pointer dereference. This commit fixes
this bug by just using this pointer only where it is safe to do so.

Reported-by: Tom Harrison 
Tested-by: Tom Harrison 
Signed-off-by: Paolo Valente 
---
 block/bfq-wf2q.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index b4fc3e4..8726ede 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1114,12 +1114,21 @@ static void bfq_activate_requeue_entity(struct 
bfq_entity *entity,
 bool __bfq_deactivate_entity(struct bfq_entity *entity, bool 
ins_into_idle_tree)
 {
struct bfq_sched_data *sd = entity->sched_data;
-   struct bfq_service_tree *st = bfq_entity_service_tree(entity);
-   int is_in_service = entity == sd->in_service_entity;
+   struct bfq_service_tree *st;
+   bool is_in_service;
 
if (!entity->on_st) /* entity never activated, or already inactive */
return false;
 
+   /*
+* If we get here, then entity is active, which implies that
+* bfq_group_set_parent has already been invoked for the group
+* represented by entity. Therefore, the field
+* entity->sched_data has been set, and we can safely use it.
+*/
+   st = bfq_entity_service_tree(entity);
+   is_in_service = entity == sd->in_service_entity;
+
if (is_in_service)
bfq_calc_finish(entity, entity->service);
 
-- 
2.10.0



[PATCH BUGFIX] block, bfq: stress that low_latency must be off to get max throughput

2017-05-09 Thread Paolo Valente
The introduction of the BFQ and Kyber I/O schedulers has triggered a
new wave of I/O benchmarks. Unfortunately, comments and discussions on
these benchmarks confirm that there is still little awareness that it
is very hard to achieve, at the same time, a low latency and a high
throughput. In particular, virtually all benchmarks measure
throughput, or throughput-related figures of merit, but, for BFQ, they
use the scheduler in its default configuration. This configuration is
geared, instead, toward a low latency. This is evidently a sign that
BFQ documentation is still too unclear on this important aspect. This
commit addresses this issue by stressing how BFQ configuration must be
(easily) changed if the only goal is maximum throughput.

Signed-off-by: Paolo Valente 
---
 Documentation/block/bfq-iosched.txt | 17 -
 block/bfq-iosched.c |  5 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/bfq-iosched.txt 
b/Documentation/block/bfq-iosched.txt
index 1b87df6..05e2822 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -11,6 +11,13 @@ controllers), BFQ's main features are:
   groups (switching back to time distribution when needed to keep
   throughput high).
 
+In its default configuration, BFQ privileges latency over
+throughput. So, when needed for achieving a lower latency, BFQ builds
+schedules that may lead to a lower throughput. If your main or only
+goal, for a given device, is to achieve the maximum-possible
+throughput at all times, then do switch off all low-latency heuristics
+for that device, by setting low_latency to 0. Full details in Section 3.
+
 On average CPUs, the current version of BFQ can handle devices
 performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
 reference, 30-50 KIOPS correspond to very high bandwidths with
@@ -375,11 +382,19 @@ default, low latency mode is enabled. If enabled, 
interactive and soft
 real-time applications are privileged and experience a lower latency,
 as explained in more detail in the description of how BFQ works.
 
-DO NOT enable this mode if you need full control on bandwidth
+DISABLE this mode if you need full control on bandwidth
 distribution. In fact, if it is enabled, then BFQ automatically
 increases the bandwidth share of privileged applications, as the main
 means to guarantee a lower latency to them.
 
+In addition, as already highlighted at the beginning of this document,
+DISABLE this mode if your only goal is to achieve a high throughput.
+In fact, privileging the I/O of some application over the rest may
+entail a lower throughput. To achieve the highest-possible throughput
+on a non-rotational device, setting slice_idle to 0 may be needed too
+(at the cost of giving up any strong guarantee on fairness and low
+latency).
+
 timeout_sync
 
 
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bd8499e..08ce450 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -56,6 +56,11 @@
  * rotational or flash-based devices, and to get the job done quickly
  * for applications consisting in many I/O-bound processes.
  *
+ * NOTE: if the main or only goal, with a given device, is to achieve
+ * the maximum-possible throughput at all times, then do switch off
+ * all low-latency heuristics for that device, by setting low_latency
+ * to 0.
+ *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
-- 
2.10.0



Re: [PATCH V3 02/16] block, bfq: add full hierarchical scheduling and cgroups support

2017-04-18 Thread Paolo Valente

> Il giorno 18 apr 2017, alle ore 09:04, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Wed, Apr 12, 2017 at 07:22:03AM +0200, Paolo Valente wrote:
>> could you elaborate a bit more on this?  I mean, cgroups support has
>> been in BFQ (and CFQ) for almost ten years, perfectly working as far
>> as I know.  Of course it is perfectly working in terms of I/O and not
>> of CPU bandwidth distribution; and, for the moment, it is effective
>> only for devices below 30-50KIOPS.  What's the point in throwing
>> (momentarily?) away such a fundamental feature?  What am I missing?
> 
> I've been trying to track down latency issues with the CPU controller
> which basically takes the same approach and I'm not sure nesting
> scheduler timelines is a good approach.  It intuitively feels elegant
> but seems to have some fundamental issues.  IIUC, bfq isn't quite the
> same in that it doesn't need load balancer across multiple queues and
> it could be that bfq is close enough to the basic model that the
> nested behavior maps to the correct scheduling behavior.
> 
> However, for example, in the CPU controller, the nested timelines
> break sleeper boost.  The boost is implemented by considering the
> thread to have woken up upto some duration prior to the current time;
> however, it only affects the timeline inside the cgroup and there's no
> good way to propagate it upwards.  The final result is two threads in
> a cgroup with the double weight can behave significantly worse in
> terms of latency compared to two threads with the weight of 1 in the
> root.
> 

Hi Tejun,
I don't know in detail the specific multiple-queue issues you report,
but bfq implements the upward propagation you mention: if a process in
a group is to be privileged, i.e., if the process has basically to be
provided with a higher weight (in addition to other important forms of
help), then this weight boost is propagated upward through the path
from the process to the root node in the group hierarchy.

> Given that the nested scheduling ends up pretty expensive, I'm not
> sure how good a model this nesting approach is.  Especially if there
> can be multiple queues, the weight distribution across cgroup
> instances across multiple queues has to be coordinated globally
> anyway,

To get perfect global service guarantees, yes.  But you can settle
with tradeoffs that, according to my experience with storage and
packet I/O, are so good to be probably indistinguishable from an
ideal, but too costly solution.  I mean, with a well-done approximated
scheduling solution, the deviation with respect to an ideal service
can be in the same order of the noise caused by unavoidable latencies
of other sw and hw components than the scheduler.

> so the weight / cost adjustment part can't happen
> automatically anyway as in single queue case.  If we're going there,
> we might as well implement cgroup support by actively modulating the
> combined weights, which will make individual scheduling operations
> cheaper and it easier to think about and guarantee latency behaviors.
> 

Yes.  Anyway, I didn't quite understand what is or could be the
alternative, w.r.t. hierarchical scheduling, for guaranteeing
bandwidth distribution of shared resources in a complex setting.  If
you think I could be of any help on this, just put me somehow in the
loop.

> If you think that bfq will stay single queue and won't need timeline
> modifying heuristics (for responsiveness or whatever), the current
> approach could be fine, but I'm a bit awry about committing to the
> current approach if we're gonna encounter the same problems.
> 

As of now, bfq is targeted at not too fast devices (< 30-50KIOPS),
which happen to be single queue.  In particular, bfq is currently
agnostic w.r.t.  to the number of downstream queues.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: bfq-mq performance comparison to cfq

2017-04-19 Thread Paolo Valente

> Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
>  ha scritto:
> 
> On 04/11/17 00:29, Paolo Valente wrote:
>> 
>>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>>>> That said, if you do always want maximum throughput, even at the
>>>> expense of latency, then just switch off low-latency heuristics, i.e.,
>>>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>>>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>>>> still low also after forcing BFQ to an only-throughput mode, then you
>>>> hit some bug, and I'll have a little more work to do ...
>>> 
>>> Has it been considered to make applications tell the I/O scheduler
>>> whether to optimize for latency or for throughput? It shouldn't be that
>>> hard for window managers and shells to figure out whether or not a new
>>> application that is being started is interactive or not. This would
>>> require a mechanism that allows applications to provide such information
>>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
>>> scheduler trying to guess whether or not an application is an interactive
>>> application?
>> 
>> IMO that would be an (or maybe the) optimal solution, in terms of both
>> throughput and latency.  We have even developed a prototype doing what
>> you propose, for Android.  Unfortunately, I have not yet succeeded in
>> getting support, to turn it into candidate production code, or to make
>> a similar solution for lsb-compliant systems.
> 
> Hello Paolo,
> 
> What API was used by the Android application to tell the I/O scheduler 
> to optimize for latency? Do you think that it would be sufficient if the 
> application uses the ioprio_set() system call to set the I/O priority to 
> IOPRIO_CLASS_RT?
> 

That's exactly the hack we are using in our prototype.  However, it
can only be a temporary hack, because it mixes two slightly different
concepts: 1) the activation of weight raising and other mechanisms for
reducing latency for the target app, 2) the assignment of a different
priority class, which (cleanly) means just that processes in a lower
priority class will be served only when the processes of the target
app have no pending I/O request.  Finding a clean boosting API would
be one of the main steps to turn our prototype into a usable solution.

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: [PATCH V3 02/16] block, bfq: add full hierarchical scheduling and cgroups support

2017-04-19 Thread Paolo Valente

> Il giorno 19 apr 2017, alle ore 07:33, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 18 apr 2017, alle ore 09:04, Tejun Heo  ha 
>> scritto:
>> 
>> Hello, Paolo.
>> 
>> On Wed, Apr 12, 2017 at 07:22:03AM +0200, Paolo Valente wrote:
>>> could you elaborate a bit more on this?  I mean, cgroups support has
>>> been in BFQ (and CFQ) for almost ten years, perfectly working as far
>>> as I know.  Of course it is perfectly working in terms of I/O and not
>>> of CPU bandwidth distribution; and, for the moment, it is effective
>>> only for devices below 30-50KIOPS.  What's the point in throwing
>>> (momentarily?) away such a fundamental feature?  What am I missing?
>> 
>> I've been trying to track down latency issues with the CPU controller
>> which basically takes the same approach and I'm not sure nesting
>> scheduler timelines is a good approach.  It intuitively feels elegant
>> but seems to have some fundamental issues.  IIUC, bfq isn't quite the
>> same in that it doesn't need load balancer across multiple queues and
>> it could be that bfq is close enough to the basic model that the
>> nested behavior maps to the correct scheduling behavior.
>> 
>> However, for example, in the CPU controller, the nested timelines
>> break sleeper boost.  The boost is implemented by considering the
>> thread to have woken up upto some duration prior to the current time;
>> however, it only affects the timeline inside the cgroup and there's no
>> good way to propagate it upwards.  The final result is two threads in
>> a cgroup with the double weight can behave significantly worse in
>> terms of latency compared to two threads with the weight of 1 in the
>> root.
>> 
> 
> Hi Tejun,
> I don't know in detail the specific multiple-queue issues you report,
> but bfq implements the upward propagation you mention: if a process in
> a group is to be privileged, i.e., if the process has basically to be
> provided with a higher weight (in addition to other important forms of
> help), then this weight boost is propagated upward through the path
> from the process to the root node in the group hierarchy.
> 

ERRATA CORRIGE: actually, this propagation is implemented in a simple
variant of bfq that I made for a virtualization company (to truly
guarantee a low latency to the processes in a guest OS, regardless of
the load in the host).  The base version of bfq in these patches
contains all the mechanisms needed to get this propagation, but
doesn't modify group weights autonomously.

Paolo

>> Given that the nested scheduling ends up pretty expensive, I'm not
>> sure how good a model this nesting approach is.  Especially if there
>> can be multiple queues, the weight distribution across cgroup
>> instances across multiple queues has to be coordinated globally
>> anyway,
> 
> To get perfect global service guarantees, yes.  But you can settle
> with tradeoffs that, according to my experience with storage and
> packet I/O, are so good to be probably indistinguishable from an
> ideal, but too costly solution.  I mean, with a well-done approximated
> scheduling solution, the deviation with respect to an ideal service
> can be in the same order of the noise caused by unavoidable latencies
> of other sw and hw components than the scheduler.
> 
>> so the weight / cost adjustment part can't happen
>> automatically anyway as in single queue case.  If we're going there,
>> we might as well implement cgroup support by actively modulating the
>> combined weights, which will make individual scheduling operations
>> cheaper and it easier to think about and guarantee latency behaviors.
>> 
> 
> Yes.  Anyway, I didn't quite understand what is or could be the
> alternative, w.r.t. hierarchical scheduling, for guaranteeing
> bandwidth distribution of shared resources in a complex setting.  If
> you think I could be of any help on this, just put me somehow in the
> loop.
> 
>> If you think that bfq will stay single queue and won't need timeline
>> modifying heuristics (for responsiveness or whatever), the current
>> approach could be fine, but I'm a bit awry about committing to the
>> current approach if we're gonna encounter the same problems.
>> 
> 
> As of now, bfq is targeted at not too fast devices (< 30-50KIOPS),
> which happen to be single queue.  In particular, bfq is currently
> agnostic w.r.t.  to the number of downstream queues.
> 
> Thanks,
> Paolo
> 
>> Thanks.
>> 
>> -- 
>> tejun



Re: [PATCH V4 00/16] Introduce the BFQ I/O scheduler

2017-04-19 Thread Paolo Valente

> Il giorno 12 apr 2017, alle ore 18:23, Paolo Valente 
>  ha scritto:
> 
> Hi,
> new patch series, addressing (both) issues raised by Bart [1], and
> with block/Makefile fixed as suggested by Bart [2].
> 

Hi Jens,
apparently no complain of any sort on this last series.  Do you think
we could make it for 4.12, or shall we aim at 4.13?

Thanks,
Paolo

> Thanks,
> Paolo
> 
> [1] https://lkml.org/lkml/2017/3/31/393
> [2] https://lkml.org/lkml/2017/4/12/502
> 
> Arianna Avanzini (4):
>  block, bfq: add full hierarchical scheduling and cgroups support
>  block, bfq: add Early Queue Merge (EQM)
>  block, bfq: reduce idling only in symmetric scenarios
>  block, bfq: handle bursts of queue activations
> 
> Paolo Valente (12):
>  block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
>  block, bfq: improve throughput boosting
>  block, bfq: modify the peak-rate estimator
>  block, bfq: add more fairness with writes and slow processes
>  block, bfq: improve responsiveness
>  block, bfq: reduce I/O latency for soft real-time applications
>  block, bfq: preserve a low latency also with NCQ-capable drives
>  block, bfq: reduce latency during request-pool saturation
>  block, bfq: boost the throughput on NCQ-capable flash-based devices
>  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
>  block, bfq: remove all get and put of I/O contexts
>  block, bfq: split bfq-iosched.c into multiple source files
> 
> Documentation/block/00-INDEX|2 +
> Documentation/block/bfq-iosched.txt |  531 
> block/Kconfig.iosched   |   21 +
> block/Makefile  |2 +
> block/bfq-cgroup.c  | 1139 
> block/bfq-iosched.c | 5047 +++
> block/bfq-iosched.h |  942 +++
> block/bfq-wf2q.c| 1616 +++
> include/linux/blkdev.h  |2 +-
> 9 files changed, 9301 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/block/bfq-iosched.txt
> create mode 100644 block/bfq-cgroup.c
> create mode 100644 block/bfq-iosched.c
> create mode 100644 block/bfq-iosched.h
> create mode 100644 block/bfq-wf2q.c
> 
> --
> 2.10.0



Re: [PATCH V4 00/16] Introduce the BFQ I/O scheduler

2017-04-19 Thread Paolo Valente

> Il giorno 19 apr 2017, alle ore 16:33, Jens Axboe  ha 
> scritto:
> 
> On 04/19/2017 03:23 AM, Paolo Valente wrote:
>> 
>>> Il giorno 12 apr 2017, alle ore 18:23, Paolo Valente 
>>>  ha scritto:
>>> 
>>> Hi,
>>> new patch series, addressing (both) issues raised by Bart [1], and
>>> with block/Makefile fixed as suggested by Bart [2].
>>> 
>> 
>> Hi Jens,
>> apparently no complain of any sort on this last series.  Do you think
>> we could make it for 4.12, or shall we aim at 4.13?
> 
> We may as well queue it up now, I don't think there's much point in
> deferring an extra cycle.
> 

A little scary, but (of course) fine for me.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



[PATCH BUGFIX] block, bfq: update wr_busy_queues if needed on a queue split

2017-06-18 Thread Paolo Valente
This commit fixes a bug triggered by a non-trivial sequence of
events. These events are briefly described in the next two
paragraphs. The impatiens, or those who are familiar with queue
merging and splitting, can jump directly to the last paragraph.

On each I/O-request arrival for a shared bfq_queue, i.e., for a
bfq_queue that is the result of the merge of two or more bfq_queues,
BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
many random I/O requests have arrived for the bfq_queue; if the device
is non rotational, then random requests must be also small for the
bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
detected as seeky, then a split occurs: the bfq I/O context of the
process that has issued the request is redirected from the shared
bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
shared bfq_queue actually happens to be shared only by one process
(because of previous splits), then no new bfq_queue is created: the
state of the shared bfq_queue is just changed from shared to non
shared.

Regardless of whether a brand new non-shared bfq_queue is created, or
the pre-existing shared bfq_queue is just turned into a non-shared
bfq_queue, several parameters of the non-shared bfq_queue are set
(restored) to the original values they had when the bfq_queue
associated with the bfq I/O context of the process (that has just
issued an I/O request) was merged with the shared bfq_queue. One of
these parameters is the weight-raising state.

If, on the split of a shared bfq_queue,
1) a pre-existing shared bfq_queue is turned into a non-shared
bfq_queue;
2) the previously shared bfq_queue happens to be busy;
3) the weight-raising state of the previously shared bfq_queue happens
to change;
the number of weight-raised busy queues changes. The field
wr_busy_queues must then be updated accordingly, but such an update
was missing. This commit adds the missing update.

Reported-by: Luca Miccio 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ed93da2..4731cfb 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -725,8 +725,12 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 }
 
 static void
-bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
+ struct bfq_io_cq *bic, bool bfq_already_existing)
 {
+   unsigned int old_wr_coeff;
+   bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
+
if (bic->saved_idle_window)
bfq_mark_bfqq_idle_window(bfqq);
else
@@ -737,6 +741,9 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_io_cq *bic)
else
bfq_clear_bfqq_IO_bound(bfqq);
 
+   if (unlikely(busy))
+   old_wr_coeff = bfqq->wr_coeff;
+
bfqq->ttime = bic->saved_ttime;
bfqq->wr_coeff = bic->saved_wr_coeff;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
@@ -754,6 +761,14 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_io_cq *bic)
 
/* make sure weight will be updated, however we got here */
bfqq->entity.prio_changed = 1;
+
+   if (likely(!busy))
+   return;
+
+   if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
+   bfqd->wr_busy_queues++;
+   else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
+   bfqd->wr_busy_queues--;
 }
 
 static int bfqq_process_refs(struct bfq_queue *bfqq)
@@ -4402,7 +4417,7 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
const int is_sync = rq_is_sync(rq);
struct bfq_queue *bfqq;
bool new_queue = false;
-   bool split = false;
+   bool bfqq_already_existing = false, split = false;
 
spin_lock_irq(>lock);
 
@@ -4432,6 +4447,8 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
 true, is_sync,
 NULL);
+   else
+   bfqq_already_existing = true;
}
}
 
@@ -4457,7 +4474,8 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
 * queue: restore the idle window and the
 * possible weight raising period.
 */
-   bfq_bfqq_resume_state(bfqq, bic);
+   bfq_bfqq_resume_state(bfqq, bfqd, bic,
+ bfqq_already_existing);
}
}
 
-- 
2.10.0



Re: [PATCH BUGFIX] block, bfq: update wr_busy_queues if needed on a queue split

2017-06-19 Thread Paolo Valente

> Il giorno 19 giu 2017, alle ore 09:38, kbuild test robot  ha 
> scritto:
> 
> Hi Paolo,
> 
> [auto build test WARNING on v4.12-rc5]
> [also build test WARNING on next-20170616]
> [cannot apply to block/for-next]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-update-wr_busy_queues-if-needed-on-a-queue-split/20170619-145003
> config: i386-randconfig-x000-201725 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
># save the attached .config to linux build tree
>make ARCH=i386 
> 
> Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
> http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings
> 
> All warnings (new ones prefixed by >>):
> 
>   block/bfq-iosched.c: In function 'bfq_get_rq_private':
>>> block/bfq-iosched.c:770:10: warning: 'old_wr_coeff' may be used 
>>> uninitialized in this function [-Wmaybe-uninitialized]
> else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
> ^
>   block/bfq-iosched.c:731:15: note: 'old_wr_coeff' was declared here
> unsigned int old_wr_coeff;
>  ^~~~
> 

I'm sending a V2, (probably imperceptibly) slower on average, but not
confusing the compiler.

Thanks,
Paolo

> vim +/old_wr_coeff +770 block/bfq-iosched.c
> 
>   754 time_is_before_jiffies(bfqq->last_wr_start_finish +
>   755bfqq->wr_cur_max_time))) {
>   756 bfq_log_bfqq(bfqq->bfqd, bfqq,
>   757 "resume state: switching off wr");
>   758 
>   759 bfqq->wr_coeff = 1;
>   760 }
>   761 
>   762 /* make sure weight will be updated, however we got here */
>   763 bfqq->entity.prio_changed = 1;
>   764 
>   765 if (likely(!busy))
>   766 return;
>   767 
>   768 if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
>   769 bfqd->wr_busy_queues++;
>> 770  else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
>   771 bfqd->wr_busy_queues--;
>   772 }
>   773 
>   774 static int bfqq_process_refs(struct bfq_queue *bfqq)
>   775 {
>   776 return bfqq->ref - bfqq->allocated - bfqq->entity.on_st;
>   777 }
>   778 
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> <.config.gz>



[PATCH BUGFIX V2] block, bfq: update wr_busy_queues if needed on a queue split

2017-06-19 Thread Paolo Valente
This commit fixes a bug triggered by a non-trivial sequence of
events. These events are briefly described in the next two
paragraphs. The impatiens, or those who are familiar with queue
merging and splitting, can jump directly to the last paragraph.

On each I/O-request arrival for a shared bfq_queue, i.e., for a
bfq_queue that is the result of the merge of two or more bfq_queues,
BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
many random I/O requests have arrived for the bfq_queue; if the device
is non rotational, then random requests must be also small for the
bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
detected as seeky, then a split occurs: the bfq I/O context of the
process that has issued the request is redirected from the shared
bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
shared bfq_queue actually happens to be shared only by one process
(because of previous splits), then no new bfq_queue is created: the
state of the shared bfq_queue is just changed from shared to non
shared.

Regardless of whether a brand new non-shared bfq_queue is created, or
the pre-existing shared bfq_queue is just turned into a non-shared
bfq_queue, several parameters of the non-shared bfq_queue are set
(restored) to the original values they had when the bfq_queue
associated with the bfq I/O context of the process (that has just
issued an I/O request) was merged with the shared bfq_queue. One of
these parameters is the weight-raising state.

If, on the split of a shared bfq_queue,
1) a pre-existing shared bfq_queue is turned into a non-shared
bfq_queue;
2) the previously shared bfq_queue happens to be busy;
3) the weight-raising state of the previously shared bfq_queue happens
to change;
the number of weight-raised busy queues changes. The field
wr_busy_queues must then be updated accordingly, but such an update
was missing. This commit adds the missing update.

Reported-by: Luca Miccio 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ed93da2..bbeaf52 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -725,8 +725,12 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 }
 
 static void
-bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
+ struct bfq_io_cq *bic, bool bfq_already_existing)
 {
+   unsigned int old_wr_coeff = bfqq->wr_coeff;
+   bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
+
if (bic->saved_idle_window)
bfq_mark_bfqq_idle_window(bfqq);
else
@@ -754,6 +758,14 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_io_cq *bic)
 
/* make sure weight will be updated, however we got here */
bfqq->entity.prio_changed = 1;
+
+   if (likely(!busy))
+   return;
+
+   if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
+   bfqd->wr_busy_queues++;
+   else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
+   bfqd->wr_busy_queues--;
 }
 
 static int bfqq_process_refs(struct bfq_queue *bfqq)
@@ -4402,7 +4414,7 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
const int is_sync = rq_is_sync(rq);
struct bfq_queue *bfqq;
bool new_queue = false;
-   bool split = false;
+   bool bfqq_already_existing = false, split = false;
 
spin_lock_irq(>lock);
 
@@ -4432,6 +,8 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
 true, is_sync,
 NULL);
+   else
+   bfqq_already_existing = true;
}
}
 
@@ -4457,7 +4471,8 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
 * queue: restore the idle window and the
 * possible weight raising period.
 */
-   bfq_bfqq_resume_state(bfqq, bic);
+   bfq_bfqq_resume_state(bfqq, bfqd, bic,
+ bfqq_already_existing);
}
}
 
-- 
2.10.0



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-06-08 Thread Paolo Valente

> Il giorno 05 giu 2017, alle ore 10:11, Paolo Valente 
>  ha scritto:
> 
> In blk-cgroup, operations on blkg objects are protected with the
> request_queue lock. This is no more the lock that protects
> I/O-scheduler operations in blk-mq. In fact, the latter are now
> protected with a finer-grained per-scheduler-instance lock. As a
> consequence, although blkg lookups are also rcu-protected, blk-mq I/O
> schedulers may see inconsistent data when they access blkg and
> blkg-related objects. BFQ does access these objects, and does incur
> this problem, in the following case.
> 
> The blkg_lookup performed in bfq_get_queue, being protected (only)
> through rcu, may happen to return the address of a copy of the
> original blkg. If this is the case, then the blkg_get performed in
> bfq_get_queue, to pin down the blkg, is useless: it does not prevent
> blk-cgroup code from destroying both the original blkg and all objects
> directly or indirectly referred by the copy of the blkg. BFQ accesses
> these objects, which typically causes a crash for NULL-pointer
> dereference of memory-protection violation.
> 
> Some additional protection mechanism should be added to blk-cgroup to
> address this issue. In the meantime, this commit provides a quick
> temporary fix for BFQ: cache (when safe) blkg data that might
> disappear right after a blkg_lookup.
> 
> In particular, this commit exploits the following facts to achieve its
> goal without introducing further locks.  Destroy operations on a blkg
> invoke, as a first step, hooks of the scheduler associated with the
> blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
> consequence, for any blkg associated with the request queue an
> instance of BFQ is attached to, we are guaranteed that such a blkg is
> not destroyed, and that all the pointers it contains are consistent,
> while that instance is holding its bfqd->lock. A blkg_lookup performed
> with bfqd->lock held then returns a fully consistent blkg, which
> remains consistent until this lock is held. In more detail, this holds
> even if the returned blkg is a copy of the original one.
> 
> Finally, also the object describing a group inside BFQ needs to be
> protected from destruction on the blkg_free of the original blkg
> (which invokes bfq_pd_free). This commit adds private refcounting for
> this object, to let it disappear only after no bfq_queue refers to it
> any longer.
> 
> This commit also removes or updates some stale comments on locking
> issues related to blk-cgroup operations.
> 
> Reported-by: Tomas Konir 
> Reported-by: Lee Tibbert 
> Reported-by: Marco Piazza 
> Signed-off-by: Paolo Valente 
> Tested-by: Tomas Konir 
> Tested-by: Lee Tibbert 
> Tested-by: Marco Piazza 

Hi Jens,
are you waiting for some further review/ack on this, or is it just in
your queue of patches to check?  Sorry for bothering you, but this bug
is causing problems to users.

Thanks,
Paolo

> ---
> block/bfq-cgroup.c  | 116 +---
> block/bfq-iosched.c |   2 +-
> block/bfq-iosched.h |  23 +--
> 3 files changed, 105 insertions(+), 36 deletions(-)
> 
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index c8a32fb..78b2e0d 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -52,7 +52,7 @@ BFQG_FLAG_FNS(idling)
> BFQG_FLAG_FNS(empty)
> #undef BFQG_FLAG_FNS
> 
> -/* This should be called with the queue_lock held. */
> +/* This should be called with the scheduler lock held. */
> static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
> {
>   unsigned long long now;
> @@ -67,7 +67,7 @@ static void bfqg_stats_update_group_wait_time(struct 
> bfqg_stats *stats)
>   bfqg_stats_clear_waiting(stats);
> }
> 
> -/* This should be called with the queue_lock held. */
> +/* This should be called with the scheduler lock held. */
> static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
>struct bfq_group *curr_bfqg)
> {
> @@ -81,7 +81,7 @@ static void bfqg_stats_set_start_group_wait_time(struct 
> bfq_group *bfqg,
>   bfqg_stats_mark_waiting(stats);
> }
> 
> -/* This should be called with the queue_lock held. */
> +/* This should be called with the scheduler lock held. */
> static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
> {
>   unsigned long long now;
> @@ -203,12 +203,30 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
> 
> static void bfqg_get(struct bfq_group *bfqg)
> {
> - return blkg_get(bfqg_to_blkg(bfqg));
> + bfqg->ref++;
> }
> 
> void bfqg_put(struct bfq_group *bfqg)
> {
> - retur

[PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-06-05 Thread Paolo Valente
In blk-cgroup, operations on blkg objects are protected with the
request_queue lock. This is no more the lock that protects
I/O-scheduler operations in blk-mq. In fact, the latter are now
protected with a finer-grained per-scheduler-instance lock. As a
consequence, although blkg lookups are also rcu-protected, blk-mq I/O
schedulers may see inconsistent data when they access blkg and
blkg-related objects. BFQ does access these objects, and does incur
this problem, in the following case.

The blkg_lookup performed in bfq_get_queue, being protected (only)
through rcu, may happen to return the address of a copy of the
original blkg. If this is the case, then the blkg_get performed in
bfq_get_queue, to pin down the blkg, is useless: it does not prevent
blk-cgroup code from destroying both the original blkg and all objects
directly or indirectly referred by the copy of the blkg. BFQ accesses
these objects, which typically causes a crash for NULL-pointer
dereference of memory-protection violation.

Some additional protection mechanism should be added to blk-cgroup to
address this issue. In the meantime, this commit provides a quick
temporary fix for BFQ: cache (when safe) blkg data that might
disappear right after a blkg_lookup.

In particular, this commit exploits the following facts to achieve its
goal without introducing further locks.  Destroy operations on a blkg
invoke, as a first step, hooks of the scheduler associated with the
blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed, and that all the pointers it contains are consistent,
while that instance is holding its bfqd->lock. A blkg_lookup performed
with bfqd->lock held then returns a fully consistent blkg, which
remains consistent until this lock is held. In more detail, this holds
even if the returned blkg is a copy of the original one.

Finally, also the object describing a group inside BFQ needs to be
protected from destruction on the blkg_free of the original blkg
(which invokes bfq_pd_free). This commit adds private refcounting for
this object, to let it disappear only after no bfq_queue refers to it
any longer.

This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.

Reported-by: Tomas Konir 
Reported-by: Lee Tibbert 
Reported-by: Marco Piazza 
Signed-off-by: Paolo Valente 
Tested-by: Tomas Konir 
Tested-by: Lee Tibbert 
Tested-by: Marco Piazza 
---
 block/bfq-cgroup.c  | 116 +---
 block/bfq-iosched.c |   2 +-
 block/bfq-iosched.h |  23 +--
 3 files changed, 105 insertions(+), 36 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c8a32fb..78b2e0d 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -52,7 +52,7 @@ BFQG_FLAG_FNS(idling)
 BFQG_FLAG_FNS(empty)
 #undef BFQG_FLAG_FNS
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -67,7 +67,7 @@ static void bfqg_stats_update_group_wait_time(struct 
bfqg_stats *stats)
bfqg_stats_clear_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
 struct bfq_group *curr_bfqg)
 {
@@ -81,7 +81,7 @@ static void bfqg_stats_set_start_group_wait_time(struct 
bfq_group *bfqg,
bfqg_stats_mark_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -203,12 +203,30 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
 
 static void bfqg_get(struct bfq_group *bfqg)
 {
-   return blkg_get(bfqg_to_blkg(bfqg));
+   bfqg->ref++;
 }
 
 void bfqg_put(struct bfq_group *bfqg)
 {
-   return blkg_put(bfqg_to_blkg(bfqg));
+   bfqg->ref--;
+
+   if (bfqg->ref == 0)
+   kfree(bfqg);
+}
+
+static void bfqg_and_blkg_get(struct bfq_group *bfqg)
+{
+   /* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
+   bfqg_get(bfqg);
+
+   blkg_get(bfqg_to_blkg(bfqg));
+}
+
+void bfqg_and_blkg_put(struct bfq_group *bfqg)
+{
+   bfqg_put(bfqg);
+
+   blkg_put(bfqg_to_blkg(bfqg));
 }
 
 void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq,
@@ -312,7 +330,11 @@ void bfq_init_entity(struct bfq_entity *entity, struct 
bfq_group *bfqg)
if (bfqq) {
bfqq->ioprio = bfqq->new_ioprio;
bfqq->ioprio_class = bfqq->new_ioprio_class;

Re: [WIP BRANCH] cgroups support in bfq-mq WIP branch

2017-03-02 Thread Paolo Valente

> Il giorno 25 feb 2017, alle ore 19:52, Jens Axboe  ha 
> scritto:
> 
> On 02/25/2017 10:44 AM, Paolo Valente wrote:
>> Hi,
>> I've just completed cgroups support, and I'd like to highlight the
>> main blk-mq issue that I have found along the way.  I have pushed the
>> commit that completes the support for cgroups to the usual WIP branch
>> [1].  Before moving to this issue, I have preliminary question about
>> the scheduler name, since I'm about to start preparing the patch
>> series for submission.  So far, I have used bfq-mq as a temporary
>> name.  Are we fine with it, or should I change it, for example, to
>> just bfq?  Jens?
> 
> Just call it 'bfq', that doesn't conflict with anything that's
> in the kernel already.
> 

ok

>> I've found a sort of circular dependency in blk-mq, related to
>> scheduler initialization.  To describe both the issue and how I've
>> addressed it, I'm pasting the message of the new commit.
> 
> Rebase your patches on top of Linus current master, some of them
> will need to change and some can be dropped.
> 

Done, but the last deadlock issue shows up again :( To help you get
context, I'm going to reply to the email in which your sent the patch that
solved it.


> And disentangle it completely from the old bfq, I don't want to see
> nasty stuff like includes of .c files with prior defines modifying
> behavior of functions.
> 

Of course.

> When that's done, get it posted for review asap. I would imagine
> we will go through a few postings and review cycles, and if we're
> targeting 4.12 with this, then we should get the ball rolling
> on that side.
> 

I was about to to submit, but bumped into the above regression.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCHSET v6] blk-mq scheduling framework

2017-01-15 Thread Paolo Valente

> Il giorno 11 gen 2017, alle ore 22:39, Jens Axboe  ha scritto:
> 
> Another year, another posting of this patchset. The previous posting
> was here:
> 
> https://www.spinics.net/lists/kernel/msg2406106.html
> 
> (yes, I've skipped v5, it was fixes on top of v4, not the rework).
> 
> I've reworked bits of this to get rid of the shadow requests, thanks
> to Bart for the inspiration. The missing piece, for me, was the fact
> that we have the tags->rqs[] indirection array already. I've done this
> somewhat differently, though, by having the internal scheduler tag
> map be allocated/torn down when an IO scheduler is attached or
> detached. This also means that when we run without a scheduler, we
> don't have to do double tag allocations, it'll work like before.
> 
> The patchset applies on top of 4.10-rc3, or can be pulled here:
> 
> git://git.kernel.dk/linux-block blk-mq-sched.6
> 
> 

Hi Jens,
I have checked this new version to find solutions to the apparent
errors, mistakes or just unclear parts (to me) that I have pointed out
before Christmas last year.  But I have found no changes related to
these problems.

As I have already written, I'm willing to try to fix those errors
myself, if they really are errors, but I would first need at least
some minimal initial feedback and guidance.  If needed, tell me how I
can help you get in sync again with these issues (sending my reports
again, sending a digest of them, ...).

Thanks,
Paolo

> block/Kconfig.iosched|   50 
> block/Makefile   |3 
> block/blk-core.c |   19 -
> block/blk-exec.c |3 
> block/blk-flush.c|   15 -
> block/blk-ioc.c  |   12 
> block/blk-merge.c|4 
> block/blk-mq-sched.c |  354 +
> block/blk-mq-sched.h |  157 
> block/blk-mq-sysfs.c |   13 +
> block/blk-mq-tag.c   |   58 ++--
> block/blk-mq-tag.h   |4 
> block/blk-mq.c   |  413 +++---
> block/blk-mq.h   |   40 +++
> block/blk-tag.c  |1 
> block/blk.h  |   26 +-
> block/cfq-iosched.c  |2 
> block/deadline-iosched.c |2 
> block/elevator.c |  247 +++-
> block/mq-deadline.c  |  569 
> +++
> block/noop-iosched.c |2 
> drivers/nvme/host/pci.c  |1 
> include/linux/blk-mq.h   |9 
> include/linux/blkdev.h   |6 
> include/linux/elevator.h |   36 ++
> 25 files changed, 1732 insertions(+), 314 deletions(-)
> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2017-02-01 Thread Paolo Valente

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe  ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe 
> ---
> block/Kconfig.iosched |   6 +
> block/Makefile|   1 +
> block/mq-deadline.c   | 649 ++
> 3 files changed, 656 insertions(+)
> create mode 100644 block/mq-deadline.c
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 421bef9c4c48..490ef2850fae 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
> 
> This is the default I/O scheduler.
> 
> +config MQ_IOSCHED_DEADLINE
> + tristate "MQ deadline I/O scheduler"
> + default y
> + ---help---
> +   MQ version of the deadline IO scheduler.
> +
> config CFQ_GROUP_IOSCHED
>   bool "CFQ Group Scheduling support"
>   depends on IOSCHED_CFQ && BLK_CGROUP
> diff --git a/block/Makefile b/block/Makefile
> index 2eee9e1bb6db..3ee0abd7205a 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)+= blk-throttle.o
> obj-$(CONFIG_IOSCHED_NOOP)+= noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE)+= deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE)+= mq-deadline.o
> 
> obj-$(CONFIG_BLOCK_COMPAT)+= compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER)  += cmdline-parser.o
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index ..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> @@ -0,0 +1,649 @@
> +/*
> + *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
> + *  for the blk-mq scheduling framework
> + *
> + *  Copyright (C) 2016 Jens Axboe 
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "blk.h"
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
> +#include "blk-mq-sched.h"
> +
> +static unsigned int queue_depth = 256;
> +module_param(queue_depth, uint, 0644);
> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
> +
> +/*
> + * See Documentation/block/deadline-iosched.txt
> + */
> +static const int read_expire = HZ / 2;  /* max time before a read is 
> submitted. */
> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits 
> are SOFT! */
> +static const int writes_starved = 2;/* max times reads can starve a 
> write */
> +static const int fifo_batch = 16;   /* # of sequential requests treated 
> as one
> +  by the above parameters. For throughput. */
> +
> +struct deadline_data {
> + /*
> +  * run time data
> +  */
> +
> + /*
> +  * requests (deadline_rq s) are present on both sort_list and fifo_list
> +  */
> + struct rb_root sort_list[2];
> + struct list_head fifo_list[2];
> +
> + /*
> +  * next in sort order. read, write or both are NULL
> +  */
> + struct request *next_rq[2];
> + unsigned int batching;  /* number of sequential requests made */
> + unsigned int starved;   /* times reads have starved writes */
> +
> + /*
> +  * settings that change how the i/o scheduler behaves
> +  */
> + int fifo_expire[2];
> + int fifo_batch;
> + int writes_starved;
> + int front_merges;
> +
> + spinlock_t lock;
> + struct list_head dispatch;
> + struct blk_mq_tags *tags;
> + atomic_t wait_index;
> +};
> +
> +static inline struct rb_root *
> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
> +{
> + return >sort_list[rq_data_dir(rq)];
> +}
> +
> +/*
> + * get the request after `rq' in sector-sorted order
> + */
> +static inline struct request *
> +deadline_latter_request(struct request *rq)
> +{
> + struct rb_node *node = rb_next(>rb_node);
> +
> + if (node)
> + return rb_entry_rq(node);
> +
> + return NULL;
> +}
> +
> +static void
> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> + struct rb_root *root = deadline_rb_root(dd, rq);
> +
> + elv_rb_add(root, rq);
> +}
> +
> +static inline void
> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> + const int data_dir = rq_data_dir(rq);
> +
> + if (dd->next_rq[data_dir] == rq)
> + dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> + elv_rb_del(deadline_rb_root(dd, rq), rq);
> +}
> +
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request 
> *rq)
> +{
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + list_del_init(>queuelist);
> +
> + /*
> +  * We might not be on the rbtree, if we are 

Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2017-02-01 Thread Paolo Valente

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe  ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe 
> ---
> block/Kconfig.iosched |   6 +
> block/Makefile|   1 +
> block/mq-deadline.c   | 649 ++
> 3 files changed, 656 insertions(+)
> create mode 100644 block/mq-deadline.c
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 421bef9c4c48..490ef2850fae 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
> 
> This is the default I/O scheduler.
> 
> +config MQ_IOSCHED_DEADLINE
> + tristate "MQ deadline I/O scheduler"
> + default y
> + ---help---
> +   MQ version of the deadline IO scheduler.
> +
> config CFQ_GROUP_IOSCHED
>   bool "CFQ Group Scheduling support"
>   depends on IOSCHED_CFQ && BLK_CGROUP
> diff --git a/block/Makefile b/block/Makefile
> index 2eee9e1bb6db..3ee0abd7205a 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)+= blk-throttle.o
> obj-$(CONFIG_IOSCHED_NOOP)+= noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE)+= deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE)+= mq-deadline.o
> 
> obj-$(CONFIG_BLOCK_COMPAT)+= compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER)  += cmdline-parser.o
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index ..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> @@ -0,0 +1,649 @@
> +/*
> + *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
> + *  for the blk-mq scheduling framework
> + *
> + *  Copyright (C) 2016 Jens Axboe 
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "blk.h"
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
> +#include "blk-mq-sched.h"
> +
> +static unsigned int queue_depth = 256;
> +module_param(queue_depth, uint, 0644);
> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
> +
> +/*
> + * See Documentation/block/deadline-iosched.txt
> + */
> +static const int read_expire = HZ / 2;  /* max time before a read is 
> submitted. */
> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits 
> are SOFT! */
> +static const int writes_starved = 2;/* max times reads can starve a 
> write */
> +static const int fifo_batch = 16;   /* # of sequential requests treated 
> as one
> +  by the above parameters. For throughput. */
> +
> +struct deadline_data {
> + /*
> +  * run time data
> +  */
> +
> + /*
> +  * requests (deadline_rq s) are present on both sort_list and fifo_list
> +  */
> + struct rb_root sort_list[2];
> + struct list_head fifo_list[2];
> +
> + /*
> +  * next in sort order. read, write or both are NULL
> +  */
> + struct request *next_rq[2];
> + unsigned int batching;  /* number of sequential requests made */
> + unsigned int starved;   /* times reads have starved writes */
> +
> + /*
> +  * settings that change how the i/o scheduler behaves
> +  */
> + int fifo_expire[2];
> + int fifo_batch;
> + int writes_starved;
> + int front_merges;
> +
> + spinlock_t lock;
> + struct list_head dispatch;
> + struct blk_mq_tags *tags;
> + atomic_t wait_index;
> +};
> +
> +static inline struct rb_root *
> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
> +{
> + return >sort_list[rq_data_dir(rq)];
> +}
> +
> +/*
> + * get the request after `rq' in sector-sorted order
> + */
> +static inline struct request *
> +deadline_latter_request(struct request *rq)
> +{
> + struct rb_node *node = rb_next(>rb_node);
> +
> + if (node)
> + return rb_entry_rq(node);
> +
> + return NULL;
> +}
> +
> +static void
> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> + struct rb_root *root = deadline_rb_root(dd, rq);
> +
> + elv_rb_add(root, rq);
> +}
> +
> +static inline void
> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> + const int data_dir = rq_data_dir(rq);
> +
> + if (dd->next_rq[data_dir] == rq)
> + dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> + elv_rb_del(deadline_rb_root(dd, rq), rq);
> +}
> +
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request 
> *rq)
> +{
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + list_del_init(>queuelist);
> +
> + /*
> +  * We might not be on the rbtree, if we are 

Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2017-02-02 Thread Paolo Valente

> Il giorno 02 feb 2017, alle ore 06:19, Jens Axboe  ha scritto:
> 
> On 02/01/2017 04:11 AM, Paolo Valente wrote:
>>> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
>>> +{
>>> +   struct request_queue *q = hctx->queue;
>>> +   struct deadline_data *dd = q->elevator->elevator_data;
>>> +   int ret;
>>> +
>>> +   spin_lock(>lock);
>>> +   ret = blk_mq_sched_try_merge(q, bio);
>>> +   spin_unlock(>lock);
>>> +
>> 
>> Hi Jens,
>> first, good news, bfq is passing my first sanity checks.  Still, I
>> need a little more help for the following issue.  There is a case that
>> would be impossible to handle without modifying code outside bfq.  But
>> so far such a case never occurred, and I hope that it can never occur.
>> I'll try to briefly list all relevant details on this concern of mine,
>> so that you can quickly confirm my hope, or highlight where or what I
>> am missing.
> 
> Remember my earlier advice - it's not a problem to change anything in
> the core, in fact I would be surprised if you did not need to. My
> foresight isn't THAT good! It's much better to fix up an inconsistency
> there, rather than work around it in the consumer of that API.
> 
>> First, as done above for mq-deadline, invoking blk_mq_sched_try_merge
>> with the scheduler lock held is of course necessary (for example, to
>> protect q->last_merge).  This may lead to put_rq_private invoked
>> with the lock held, in case of successful merge.
> 
> Right, or some other lock with the same scope, as per my other email.
> 
>> As a consequence, put_rq_private may be invoked:
>> (1) in IRQ context, no scheduler lock held, because of a completion:
>> can be handled by deferring work and lock grabbing, because the
>> completed request is not queued in the scheduler any more;
>> (2) in process context, scheduler lock held, because of the above
>> successful merge: must be handled immediately, for consistency,
>> because the request is still queued in the scheduler;
>> (3) in process context, no scheduler lock held, for some other reason:
>> some path apparently may lead to this case, although I've never seen
>> it to happen.  Immediate handling, and hence locking, may be needed,
>> depending on whether the request is still queued in the scheduler.
>> 
>> So, my main question is: is case (3) actually impossible?  Should it
>> be possible, I guess we would have a problem, because of the
>> different lock state with respect to (2).
> 
> I agree, there's some inconsistency there, if you potentially need to
> grab the lock in your put_rq_private handler. The problem case is #2,
> when we have the merge. I would probably suggest that the best way to
> handle that is to pass back the dropped request so we can put it outside
> of holding the lock.
> 
> Let me see if I can come up with a good solution for this. We have to be
> consistent in how we invoke the scheduler functions, we can't have hooks
> that are called in unknown lock states. I also don't want you to have to
> add defer work handling in that kind of path, that will impact your
> performance and overhead.
> 

I'll try to learn from your solution, because, as of now, I don't see
how to avoid deferred work for the case where put_rq_private is
invoked in interrupt context.  In fact, for this case, we cannot grab
the lock, unless we turn all spin_lock into spin_lock_irq*.

>> Finally, I hope that it is certainly impossible to have a case (4): in
>> IRQ context, no lock held, but with the request in the scheduler.
> 
> That should not be possible.
> 
> Edit: since I'm on a flight and email won't send, I had a few minutes to
> hack this up. Totally untested, but something like the below should do
> it. Not super pretty... I'll play with this a bit more tomorrow.
> 
> 

The scheme is clear.  One comment, in case it could make sense and
avoid more complexity: since put_rq_priv is invoked in two different
contexts, process or interrupt, I didn't feel so confusing that, when
put_rq_priv is invoked in the context where the lock cannot be held
(unless one is willing to pay with irq disabling all the times), the
lock is not held, while, when invoked in the context where the lock
can be held, the lock is actually held, or must be taken.

Thanks,
Paolo

> diff --git a/block/blk-core.c b/block/blk-core.c
> index c142de090c41..530a9a3f60c9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1609,7 +1609,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
> struct bio *bio)
> {
>   struct blk_plug *plug;
>   int el_ret, where = E

Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2017-02-02 Thread Paolo Valente

> Il giorno 02 feb 2017, alle ore 16:30, Jens Axboe  ha scritto:
> 
> On 02/02/2017 02:19 AM, Paolo Valente wrote:
>> The scheme is clear.  One comment, in case it could make sense and
>> avoid more complexity: since put_rq_priv is invoked in two different
>> contexts, process or interrupt, I didn't feel so confusing that, when
>> put_rq_priv is invoked in the context where the lock cannot be held
>> (unless one is willing to pay with irq disabling all the times), the
>> lock is not held, while, when invoked in the context where the lock
>> can be held, the lock is actually held, or must be taken.
> 
> If you grab the same lock from put_rq_priv, yes, you must make it IRQ
> disabling in all contexts, and use _irqsave() from put_rq_priv. If it's
> just freeing resources, you could potentially wait and do that when
> someone else needs them, since that part will come from proces context.
> That would need two locks, though.
> 
> As I said above, I would not worry about the IRQ disabling lock.
> 

I'm sorry, I focused only on the IRQ-disabling consequence of grabbing
a scheduler lock also in IRQ context.  I thought it was a serious
enough issue to avoid this option.  Yet there is also a deadlock
problem related to this option.  In fact, the IRQ handler may preempt
some process-context code that already holds some other locks, and, if
some of these locks are already held by another process, which is
executing on another CPU and which then tries to take the scheduler
lock, or which happens to be preempted by an IRQ handler trying to
grab the scheduler lock, then a deadlock occurs.  This is not just a
speculation, but a problem that did occur before I moved to a
deferred-work solution, and that can be readily reproduced.  Before
moving to a deferred work solution, I tried various code manipulations
to avoid these deadlocks without resorting to deferred work, but at no
avail.

At any rate, bfq seems now to work, so I can finally move from just
asking questions endlessly, to proposing actual code to discuss on.
I'm about to: port this version of bfq to your improved/fixed
blk-mq-sched version in for-4.11 (port postponed, to avoid introducing
further changes in code that did not yet wok), run more extensive
tests, polish commits a little bit, and finally share a branch.

Thanks,
Paolo

> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-17 Thread Paolo Valente

> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe  ha scritto:
> 
> On 12/22/2016 02:59 AM, Paolo Valente wrote:
>> 
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe  ha scritto:
>>> 
>>> This adds a set of hooks that intercepts the blk-mq path of
>>> allocating/inserting/issuing/completing requests, allowing
>>> us to develop a scheduler within that framework.
>>> 
>>> We reuse the existing elevator scheduler API on the registration
>>> side, but augment that with the scheduler flagging support for
>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>> devices.
>>> 
>>> Schedulers can opt in to using shadow requests. Shadow requests
>>> are internal requests that the scheduler uses for for the allocate
>>> and insert part, which are then mapped to a real driver request
>>> at dispatch time. This is needed to separate the device queue depth
>>> from the pool of requests that the scheduler has to work with.
>>> 
>>> Signed-off-by: Jens Axboe 
>>> 
>> ...
>> 
>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>> new file mode 100644
>>> index ..b7e1839d4785
>>> --- /dev/null
>>> +++ b/block/blk-mq-sched.c
>> 
>>> ...
>>> +static inline bool
>>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>>> +struct bio *bio)
>>> +{
>>> +   struct elevator_queue *e = q->elevator;
>>> +
>>> +   if (e && e->type->ops.mq.allow_merge)
>>> +   return e->type->ops.mq.allow_merge(q, rq, bio);
>>> +
>>> +   return true;
>>> +}
>>> +
>> 
>> Something does not seem to add up here:
>> e->type->ops.mq.allow_merge may be called only in
>> blk_mq_sched_allow_merge, which, in its turn, may be called only in
>> blk_mq_attempt_merge, which, finally, may be called only in
>> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
>> no elevator (line 1399 and 1507 in blk-mq.c).
>> 
>> Therefore, e->type->ops.mq.allow_merge can never be called, both if
>> there is and if there is not an elevator.  Be patient if I'm missing
>> something huge, but I thought it was worth reporting this.
> 
> I went through the current branch, and it seems mostly fine. There was
> a double call to allow_merge() that I killed in the plug path, and one
> set missing in blk_mq_sched_try_merge(). The rest looks OK.
> 

Yes, I missed a path, sorry.  I'm happy that at least your check has
not been a waste of time for other reasons.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-17 Thread Paolo Valente

> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe  ha scritto:
> 
> On 12/22/2016 04:13 AM, Paolo Valente wrote:
>> 
>>> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente 
>>>  ha scritto:
>>> 
>>>> 
>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe  ha 
>>>> scritto:
>>>> 
>>>> This adds a set of hooks that intercepts the blk-mq path of
>>>> allocating/inserting/issuing/completing requests, allowing
>>>> us to develop a scheduler within that framework.
>>>> 
>>>> We reuse the existing elevator scheduler API on the registration
>>>> side, but augment that with the scheduler flagging support for
>>>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>>>> devices.
>>>> 
>>>> Schedulers can opt in to using shadow requests. Shadow requests
>>>> are internal requests that the scheduler uses for for the allocate
>>>> and insert part, which are then mapped to a real driver request
>>>> at dispatch time. This is needed to separate the device queue depth
>>>> from the pool of requests that the scheduler has to work with.
>>>> 
>>>> Signed-off-by: Jens Axboe 
>>>> 
>>> ...
>>> 
>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>>> new file mode 100644
>>>> index ..b7e1839d4785
>>>> --- /dev/null
>>>> +++ b/block/blk-mq-sched.c
>>> 
>>>> ...
>>>> +static inline bool
>>>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>>>> +   struct bio *bio)
>>>> +{
>>>> +  struct elevator_queue *e = q->elevator;
>>>> +
>>>> +  if (e && e->type->ops.mq.allow_merge)
>>>> +  return e->type->ops.mq.allow_merge(q, rq, bio);
>>>> +
>>>> +  return true;
>>>> +}
>>>> +
>>> 
>>> Something does not seem to add up here:
>>> e->type->ops.mq.allow_merge may be called only in
>>> blk_mq_sched_allow_merge, which, in its turn, may be called only in
>>> blk_mq_attempt_merge, which, finally, may be called only in
>>> blk_mq_merge_queue_io.  Yet the latter may be called only if there is
>>> no elevator (line 1399 and 1507 in blk-mq.c).
>>> 
>>> Therefore, e->type->ops.mq.allow_merge can never be called, both if
>>> there is and if there is not an elevator.  Be patient if I'm missing
>>> something huge, but I thought it was worth reporting this.
>>> 
>> 
>> Just another detail: if e->type->ops.mq.allow_merge does get invoked
>> from the above path, then it is invoked of course without the
>> scheduler lock held.  In contrast, if this function gets invoked
>> from dd_bio_merge, then the scheduler lock is held.
> 
> But the scheduler controls that itself. So it'd be perfectly fine to
> have a locked and unlocked variant. The way that's typically done is to
> have function() grabbing the lock, and __function() is invoked with the
> lock held.
> 
>> To handle this opposite alternatives, I don't know whether checking if
>> the lock is held (and possibly taking it) from inside
>> e->type->ops.mq.allow_merge is a good solution.  In any case, before
>> possibly trying it, I will wait for some feedback on the main problem,
>> i.e., on the fact that e->type->ops.mq.allow_merge
>> seems unreachable in the above path.
> 
> Checking if a lock is held is NEVER a good idea, as it leads to both bad
> and incorrect code. If you just check if a lock is held when being
> called, you don't necessarily know if it was the caller that grabbed it
> or it just happens to be held by someone else for unrelated reasons.
> 
> 

Thanks a lot for this and the above explanations.  Unfortunately, I
still see the problem.  To hopefully make you waste less time, I have
reported the problematic paths explicitly, so that you can quickly
point me to my mistake.

The problem is caused by the existence of at least the following two
alternative paths to e->type->ops.mq.allow_merge.

1.  In mq-deadline.c (line 374): spin_lock(>lock);
blk_mq_sched_try_merge -> elv_merge -> elv_bio_merge_ok ->
elv_iosched_allow_bio_merge -> e->type->ops.mq.allow_merge

2. In blk-core.c (line 1660): spin_lock_irq(q->queue_lock);
elv_merge -> elv_bio_merge_ok ->
elv_iosched_allow_bio_merge -> e->type->ops.mq.allow_merge

In the first path, the scheduler lock is held, while in the second
path, it is not.  This does not cause problems with mq-deadline,
because the latter just has no allow_merge function.  Yet it does
cause problems with the allow_merge implementation of bfq.  There was
no issue in blk, as only the global queue lock was used.

Where am I wrong?

Thanks,
Paolo


> -- 
> Jens Axboe
> 



Re: [PATCHSET v4] blk-mq-scheduling framework

2017-01-17 Thread Paolo Valente
[NEW RESEND ATTEMPT]

> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe  ha scritto:
> 
> On 12/22/2016 08:28 AM, Paolo Valente wrote:
>> 
>>> Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe  ha scritto:
>>> 
>>> On 12/19/2016 11:21 AM, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe  ha 
>>>>> scritto:
>>>>> 
>>>>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>>>> 
>>>>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe  ha 
>>>>>>> scritto:
>>>>>>> 
>>>>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>>>> 
>>>>>>> https://marc.info/?l=linux-block=148178513407631=2
>>>>>>> 
>>>>>>> From the discussion last time, I looked into the feasibility of having
>>>>>>> two sets of tags for the same request pool, to avoid having to copy
>>>>>>> some of the request fields at dispatch and completion time. To do that,
>>>>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>>>>> that with tag map(s) on the side representing the device queue depth.
>>>>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>>>>> dispatching would acquire the "real" tag. We would need to change
>>>>>>> drivers to do this, or add an extra indirection table to map a real
>>>>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>>>>> scheduler and hardware tag pools, or additional info to track it.
>>>>>>> Unless someone can convince me otherwise, I think the current approach
>>>>>>> is cleaner.
>>>>>>> 
>>>>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>>>>> to drastically decreased merging. Especially on rotating storage,
>>>>>>> this release should be fast, and on par with the merging that we
>>>>>>> get through the legacy schedulers.
>>>>>>> 
>>>>>> 
>>>>>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>>>>>> you already have an idea of what they are, so that I am somehow
>>>>>> prepared to what won't work even if my changes are right?
>>>>> 
>>>>> I'm mostly talking about elevator ops hooks that aren't there in the new
>>>>> framework, but exist in the old one. There should be no hidden
>>>>> surprises, if that's what you are worried about.
>>>>> 
>>>>> On the ops side, the only ones I can think of are the activate and
>>>>> deactivate, and those can be done in the dispatch_request hook for
>>>>> activate, and put/requeue for deactivate.
>>>>> 
>>>> 
>>>> You mean that there is no conceptual problem in moving the code of the
>>>> activate interface function into the dispatch function, and the code
>>>> of the deactivate into the put_request? (for a requeue it is a little
>>>> less clear to me, so one step at a time)  Or am I missing
>>>> something more complex?
>>> 
>>> Yes, what I mean is that there isn't a 1:1 mapping between the old ops
>>> and the new ops. So you'll have to consider the cases.
>>> 
>>> 
>> 
>> Problem: whereas it seems easy and safe to do somewhere else the
>> simple increment that was done in activate_request, I wonder if it may
>> happen that a request is deactivate before being completed.  In it may
>> happen, then, without a deactivate_request hook, the increments would
>> remain unbalanced.  Or are request completions always guaranteed till
>> no hw/sw components breaks?
> 
> You should be able to do it in get/put_request. But you might need some
> extra tracking, I'd need to double check.

Exactly, AFAICT something extra is apparently needed.  In particular,
get is not ok, because dispatch is a different event (but dispatch is
however an already controlled event), while put could be used,
provided that it is guaranteed to be executed only after dispatch.  If
it is not, then I think that an extra flag or something should be
added to the request.  I don't know whether adding this extra piece
would be worst than adding an extra hook.

> 
> I'm trying to avoid adding
> hooks that we don't truly need, the old interface had a lot of that. If
> you find that you need a hook and it isn't there, feel free to add it.
> activate/deactivate might be a good change.
> 

If my comments above do not trigger any proposal of a better solution,
then I will try by adding only one extra 'deactivate' hook.  Unless
unbalanced hooks are a bad idea too.

Thanks,
Paolo

> -- 
> Jens Axboe



[PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Paolo Valente
If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
that scheduler is set and initialized without any check, driving the
system into an inconsistent state. This commit addresses this issue by
letting elevator_get fail for these wrong cross choices.

Signed-off-by: Paolo Valente 
---
 block/elevator.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 27ff1ed..a25bdd9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -99,7 +99,8 @@ static void elevator_put(struct elevator_type *e)
module_put(e->elevator_owner);
 }
 
-static struct elevator_type *elevator_get(const char *name, bool try_loading)
+static struct elevator_type *elevator_get(const char *name, bool try_loading,
+ bool mq_ops)
 {
struct elevator_type *e;
 
@@ -113,6 +114,12 @@ static struct elevator_type *elevator_get(const char 
*name, bool try_loading)
e = elevator_find(name);
}
 
+   if (e && (e->uses_mq != mq_ops)) {
+   pr_err("ERROR: attempted to choose %s %s I/O scheduler in 
blk%s",
+  name, e->uses_mq ? "blk-mq" : "legacy", mq_ops ? "-mq" : 
"");
+   e = NULL;
+   }
+
if (e && !try_module_get(e->elevator_owner))
e = NULL;
 
@@ -201,7 +208,7 @@ int elevator_init(struct request_queue *q, char *name)
q->boundary_rq = NULL;
 
if (name) {
-   e = elevator_get(name, true);
+   e = elevator_get(name, true, q->mq_ops);
if (!e)
return -EINVAL;
}
@@ -212,7 +219,7 @@ int elevator_init(struct request_queue *q, char *name)
 * off async and request_module() isn't allowed from async.
 */
if (!e && *chosen_elevator) {
-   e = elevator_get(chosen_elevator, false);
+   e = elevator_get(chosen_elevator, false, q->mq_ops);
if (!e)
printk(KERN_ERR "I/O scheduler %s not found\n",
chosen_elevator);
@@ -220,17 +227,20 @@ int elevator_init(struct request_queue *q, char *name)
 
if (!e) {
if (q->mq_ops && q->nr_hw_queues == 1)
-   e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
+   e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false,
+q->mq_ops);
else if (q->mq_ops)
-   e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+   e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false,
+q->mq_ops);
else
-   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false,
+q->mq_ops);
 
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
"Using noop/none.\n");
-   e = elevator_get("noop", false);
+   e = elevator_get("noop", false, q->mq_ops);
}
}
 
@@ -1051,7 +1061,7 @@ static int __elevator_change(struct request_queue *q, 
const char *name)
return elevator_switch(q, NULL);
 
strlcpy(elevator_name, name, sizeof(elevator_name));
-   e = elevator_get(strstrip(elevator_name), true);
+   e = elevator_get(strstrip(elevator_name), true, q->mq_ops);
if (!e) {
printk(KERN_ERR "elevator: type %s not found\n", elevator_name);
return -EINVAL;
-- 
2.10.0



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-13 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 20:49, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 10 feb 2017, alle ore 19:13, Bart Van Assche 
>>  ha scritto:
>> 
>> On 02/10/2017 08:49 AM, Paolo Valente wrote:
>>>> $ grep '^C.*_MQ_' .config
>>>> CONFIG_BLK_MQ_PCI=y
>>>> CONFIG_MQ_IOSCHED_BFQ=y
>>>> CONFIG_MQ_IOSCHED_DEADLINE=y
>>>> CONFIG_MQ_IOSCHED_NONE=y
>>>> CONFIG_DEFAULT_MQ_BFQ_MQ=y
>>>> CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
>>>> CONFIG_SCSI_MQ_DEFAULT=y
>>>> CONFIG_DM_MQ_DEFAULT=y
>>>> 
>>> 
>>> Could you reconfigure with none or mq-deadline as default, check
>>> whether the system boots, and, it it does, switch manually to bfq-mq,
>>> check what happens, and, in the likely case of a failure, try to get
>>> the oops?
>> 
>> Hello Paolo,
>> 
>> I just finished performing that test with the following kernel config:
>> $ grep '^C.*_MQ_' .config
>> CONFIG_BLK_MQ_PCI=y
>> CONFIG_MQ_IOSCHED_BFQ=y
>> CONFIG_MQ_IOSCHED_DEADLINE=y
>> CONFIG_MQ_IOSCHED_NONE=y
>> CONFIG_DEFAULT_MQ_DEADLINE=y
>> CONFIG_DEFAULT_MQ_IOSCHED="mq-deadline"
>> CONFIG_SCSI_MQ_DEFAULT=y
>> CONFIG_DM_MQ_DEFAULT=y
>> 
>> After the system came up I logged in, switched to the bfq-mq scheduler
>> and ran several I/O tests against the boot disk.
> 
> Without any failure, right?
> 
> Unfortunately, as you can imagine, no boot failure occurred on
> any of my test systems so far :(
> 
> This version of bfq-mq can be configured to print all its activity in
> the kernel log, by just defining a macro.  This will of course slow
> down the system so much to make it probably unusable, if bfq-mq is
> active from boot.  Yet, the failure may still occur so early to make
> this approach useful to discover where bfq-mq gets stuck.  As of now I
> have no better ideas.  Any suggestion is welcome.
> 

Hi Bart,
I have found a machine crashing at boot, yet not only when bfq-mq is
chosen, but also when mq-deadline is chosen as the default
scheduler.  I have found and just reported the cause of the failure,
together with a fix.  Probably this is not the cause of your failure,
but what do you think about trying this fix?  BTW, I have rebased the
branch [1] against the new commits in Jens for-4.11/next.

Otherwise, if you have no news or suggestions, would you be willing to
try my micro-logging proposal?

Thanks,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

> Thanks,
> Paolo
> 
>> Sorry but nothing
>> interesting appeared in the kernel log.
>> 
>> Bart.
>> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
>> Notice & Disclaimer:
>> 
>> This e-mail and any files transmitted with it may contain confidential or 
>> legally privileged information of WDC and/or its affiliates, and are 
>> intended solely for the use of the individual or entity to which they are 
>> addressed. If you are not the intended recipient, any disclosure, copying, 
>> distribution or any action taken or omitted to be taken in reliance on it, 
>> is prohibited. If you have received this e-mail in error, please notify the 
>> sender immediately and delete the e-mail in its entirety from your system.



[PATCH BUGFIX] attempt to fix wrong scheduler selection

2017-02-13 Thread Paolo Valente
Hi,
if, at boot, a legacy I/O scheduler is chosen for a device using
blk-mq, or, viceversa, a blk-mq scheduler is chosen for a device using
blk, then that scheduler is set and initialized without any check,
driving the system into an inconsistent state.

The purpose of this message is, first, to report this issue, and,
second, to propose a possible fix in case you do consider this as a
bug.

Thanks,
Paolo

Paolo Valente (1):
  block: make elevator_get robust against cross blk/blk-mq choice

 block/elevator.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

--
2.10.0


Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-13 Thread Paolo Valente

> Il giorno 07 feb 2017, alle ore 18:24, Paolo Valente 
>  ha scritto:
> 
> Hi,
> 
> I have finally pushed here [1] the current WIP branch of bfq for
> blk-mq, which I have tentatively named bfq-mq.
> 
> This branch *IS NOT* meant for merging into mainline and contain code
> that mau easily violate code style, and not only, in many
> places. Commits implement the following main steps:
> 1) Add the last version of bfq for blk
> 2) Clone bfq source files into identical bfq-mq source files
> 3) Modify bfq-mq files to get a working version of bfq for blk-mq
> (cgroups support not yet functional)
> 
> In my intentions, the main goals of this branch are:
> 
> 1) Show, as soon as I could, the changes I made to let bfq-mq comply
> with blk-mq-sched framework. I though this could be particularly
> useful for Jens, being BFQ identical to CFQ in terms of hook
> interfaces and io-context handling, and almost identical in terms
> request-merging.
> 
> 2) Enable people to test this first version bfq-mq. Code is purposely
> overfull of log messages and invariant checks that halt the system on
> failure (lock assertions, BUG_ONs, ...).
> 
> To make it easier to revise commits, I'm sending the patches that
> transform bfq into bfq-mq (last four patches in the branch [1]). They
> work on two files, bfq-mq-iosched.c and bfq-mq.h, which, at the
> beginning, are just copies of bfq-iosched.c and bfq.h.
> 

Hi,
this is just to inform that, as I just wrote to Bart, I have rebase
the branch [1] against the current content of for-4.11/next.

Jens, Omar, did you find the time to have a look at the main commits
or to run some test?

Thanks,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

> Thanks,
> Paolo
> 
> [1] https://github.com/Algodev-github/bfq-mq
> 
> Paolo Valente (4):
>  blk-mq: pass bio to blk_mq_sched_get_rq_priv
>  Move thinktime from bic to bfqq
>  Embed bfq-ioc.c and add locking on request queue
>  Modify interface and operation to comply with blk-mq-sched
> 
> block/bfq-cgroup.c   |   4 -
> block/bfq-mq-iosched.c   | 852 +--
> block/bfq-mq.h   |  65 ++--
> block/blk-mq-sched.c |   8 +-
> block/blk-mq-sched.h |   5 +-
> include/linux/elevator.h |   2 +-
> 6 files changed, 567 insertions(+), 369 deletions(-)
> 
> --
> 2.10.0



Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-14 Thread Paolo Valente

> Il giorno 14 feb 2017, alle ore 00:10, Jens Axboe  ha 
> scritto:
> 
> On 02/13/2017 03:28 PM, Jens Axboe wrote:
>> On 02/13/2017 03:09 PM, Omar Sandoval wrote:
>>> On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote:
>>>> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
>>>> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
>>>> that scheduler is set and initialized without any check, driving the
>>>> system into an inconsistent state. This commit addresses this issue by
>>>> letting elevator_get fail for these wrong cross choices.
>>>> 
>>>> Signed-off-by: Paolo Valente 
>>>> ---
>>>> block/elevator.c | 26 ++
>>>> 1 file changed, 18 insertions(+), 8 deletions(-)
>>> 
>>> Hey, Paolo,
>>> 
>>> How exactly are you triggering this? In __elevator_change(), we do check
>>> for mq or not mq:
>>> 
>>> if (!e->uses_mq && q->mq_ops) {
>>> elevator_put(e);
>>> return -EINVAL;
>>> }
>>> if (e->uses_mq && !q->mq_ops) {
>>> elevator_put(e);
>>> return -EINVAL;
>>> }
>>> 
>>> We don't ever appear to call elevator_init() with a specific scheduler
>>> name, and for the default we switch off of q->mq_ops and use the
>>> defaults from Kconfig:
>>> 
>>> if (q->mq_ops && q->nr_hw_queues == 1)
>>> e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
>>> else if (q->mq_ops)
>>> e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
>>> else
>>> e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
>>> 
>>> if (!e) {
>>> printk(KERN_ERR
>>> "Default I/O scheduler not found. " \
>>> "Using noop/none.\n");
>>> e = elevator_get("noop", false);
>>> }
>>> 
>>> So I guess this could happen if someone manually changed those Kconfig
>>> options, but I don't see what other case would make this happen, could
>>> you please explain?
>> 
>> Was wondering the same - is it using the 'elevator=' boot parameter?
>> Didn't look at that path just now, but that's the only one I could
>> think of. If it is, I'd much prefer only using 'chosen_elevator' for
>> the non-mq stuff, and the fix should be just that instead.
>> 
>> So instead of:
>> 
>>  if (!e && *chosen_elevator) {
>> 
>> do
>> 
>>  if (!e && !q->mq_ops && && *chosen_elevator) {
> 
> Confirmed, that's what it seems to be, and here's a real diff of the
> above example that works for me:
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 27ff1ed5a6fa..699d10f71a2c 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -207,11 +207,12 @@ int elevator_init(struct request_queue *q, char *name)
>   }
> 
>   /*
> -  * Use the default elevator specified by config boot param or
> -  * config option.  Don't try to load modules as we could be running
> -  * off async and request_module() isn't allowed from async.
> +  * Use the default elevator specified by config boot param for
> +  * non-mq devices, or by config option.

I don't fully get this choice: being able to change the default I/O
scheduler through the command line has been rather useful for me,
saving me a lot of recompilations, and such a feature seems widespread
among (at least power) users.  However, mine is of course just an
opinion, and I may be missing the main point also in this case.

Thanks,
Paolo


> Don't try to load modules
> +  * as we could be running off async and request_module() isn't
> +  * allowed from async.
>*/
> - if (!e && *chosen_elevator) {
> + if (!e && !q->mq_ops && *chosen_elevator) {
>   e = elevator_get(chosen_elevator, false);
>   if (!e)
>   printk(KERN_ERR "I/O scheduler %s not found\n",
> 
> -- 
> Jens Axboe



Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-14 Thread Paolo Valente

> Il giorno 14 feb 2017, alle ore 16:16, Jens Axboe  ha 
> scritto:
> 
> On 02/14/2017 01:14 AM, Paolo Valente wrote:
>> 
>>> Il giorno 14 feb 2017, alle ore 00:10, Jens Axboe  ha 
>>> scritto:
>>> 
>>> On 02/13/2017 03:28 PM, Jens Axboe wrote:
>>>> On 02/13/2017 03:09 PM, Omar Sandoval wrote:
>>>>> On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote:
>>>>>> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
>>>>>> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
>>>>>> that scheduler is set and initialized without any check, driving the
>>>>>> system into an inconsistent state. This commit addresses this issue by
>>>>>> letting elevator_get fail for these wrong cross choices.
>>>>>> 
>>>>>> Signed-off-by: Paolo Valente 
>>>>>> ---
>>>>>> block/elevator.c | 26 ++
>>>>>> 1 file changed, 18 insertions(+), 8 deletions(-)
>>>>> 
>>>>> Hey, Paolo,
>>>>> 
>>>>> How exactly are you triggering this? In __elevator_change(), we do check
>>>>> for mq or not mq:
>>>>> 
>>>>>   if (!e->uses_mq && q->mq_ops) {
>>>>>   elevator_put(e);
>>>>>   return -EINVAL;
>>>>>   }
>>>>>   if (e->uses_mq && !q->mq_ops) {
>>>>>   elevator_put(e);
>>>>>   return -EINVAL;
>>>>>   }
>>>>> 
>>>>> We don't ever appear to call elevator_init() with a specific scheduler
>>>>> name, and for the default we switch off of q->mq_ops and use the
>>>>> defaults from Kconfig:
>>>>> 
>>>>>   if (q->mq_ops && q->nr_hw_queues == 1)
>>>>>   e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
>>>>>   else if (q->mq_ops)
>>>>>   e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
>>>>>   else
>>>>>   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
>>>>> 
>>>>>   if (!e) {
>>>>>   printk(KERN_ERR
>>>>>   "Default I/O scheduler not found. " \
>>>>>   "Using noop/none.\n");
>>>>>   e = elevator_get("noop", false);
>>>>>   }
>>>>> 
>>>>> So I guess this could happen if someone manually changed those Kconfig
>>>>> options, but I don't see what other case would make this happen, could
>>>>> you please explain?
>>>> 
>>>> Was wondering the same - is it using the 'elevator=' boot parameter?
>>>> Didn't look at that path just now, but that's the only one I could
>>>> think of. If it is, I'd much prefer only using 'chosen_elevator' for
>>>> the non-mq stuff, and the fix should be just that instead.
>>>> 
>>>> So instead of:
>>>> 
>>>>if (!e && *chosen_elevator) {
>>>> 
>>>> do
>>>> 
>>>>if (!e && !q->mq_ops && && *chosen_elevator) {
>>> 
>>> Confirmed, that's what it seems to be, and here's a real diff of the
>>> above example that works for me:
>>> 
>>> diff --git a/block/elevator.c b/block/elevator.c
>>> index 27ff1ed5a6fa..699d10f71a2c 100644
>>> --- a/block/elevator.c
>>> +++ b/block/elevator.c
>>> @@ -207,11 +207,12 @@ int elevator_init(struct request_queue *q, char *name)
>>> }
>>> 
>>> /*
>>> -* Use the default elevator specified by config boot param or
>>> -* config option.  Don't try to load modules as we could be running
>>> -* off async and request_module() isn't allowed from async.
>>> +* Use the default elevator specified by config boot param for
>>> +* non-mq devices, or by config option.
>> 
>> I don't fully get this choice: being able to change the default I/O
>> scheduler through the command line has been rather useful for me,
>> saving me a lot of recompilations, and such a feature seems widespread
>> among (at least power) users.  However, mine is of course just an
>> opinion, and I may be missing the main point also in this case.
> 
> The problem with the elevator= boot parameter is that it applies across
> everything, which makes very little sense, since it's a per device
> setting. In retrospect, it was a mistake to add this parameter, and I
> don't want to continue down that path with blk-mq.
> 

ok, thanks

> Why aren't you just using online switching through syses?

To change the scheduler from the very beginning at boot.  Which maybe
can be done through udev rules as well, I'm just too ignorant.

Thanks,
Paolo

> For normal
> users, typically this would be done through udev rules.
> 
> -- 
> Jens Axboe



Re: [PATCH] bfq-mq: cause deadlock by executing exit_icq body immediately

2017-02-10 Thread Paolo Valente

> Il giorno 08 feb 2017, alle ore 18:17, Omar Sandoval  ha 
> scritto:
> 
> On Wed, Feb 08, 2017 at 11:39:24AM +0100, Paolo Valente wrote:
>> 
>>> Il giorno 08 feb 2017, alle ore 11:33, Omar Sandoval  
>>> ha scritto:
>>> 
>>> On Wed, Feb 08, 2017 at 11:03:01AM +0100, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 07 feb 2017, alle ore 22:45, Omar Sandoval 
>>>>>  ha scritto:
>>>>> 
>>>>> On Tue, Feb 07, 2017 at 06:33:46PM +0100, Paolo Valente wrote:
>>>>>> Hi,
>>>>>> this patch is meant to show that, if the  body of the hook exit_icq is 
>>>>>> executed
>>>>>> from inside that hook, and not as deferred work, then a circular deadlock
>>>>>> occurs.
>>>>>> 
>>>>>> It happens if, on a CPU
>>>>>> - the body of icq_exit takes the scheduler lock,
>>>>>> - it does so from inside the exit_icq hook, which is invoked with the 
>>>>>> queue
>>>>>> lock held
>>>>>> 
>>>>>> while, on another CPU
>>>>>> - bfq_bio_merge, after taking the scheduler lock, invokes bfq_bic_lookup,
>>>>>> which, in its turn, takes the queue lock. bfq_bic_lookup needs to take 
>>>>>> such a
>>>>>> lock, because it invokes ioc_lookup_icq.
>>>>>> 
>>>>>> For more details, here is a lockdep report, right before the deadlock 
>>>>>> did occur.
>>>>>> 
>>>>>> [   44.059877] ==
>>>>>> [   44.124922] [ INFO: possible circular locking dependency detected ]
>>>>>> [   44.125795] 4.10.0-rc5-bfq-mq+ #38 Not tainted
>>>>>> [   44.126414] ---
>>>>>> [   44.127291] sync/2043 is trying to acquire lock:
>>>>>> [   44.128918]  (&(>lock)->rlock){-.-...}, at: 
>>>>>> [] bfq_exit_icq_bfqq+0x55/0x140
>>>>>> [   44.134052]
>>>>>> [   44.134052] but task is already holding lock:
>>>>>> [   44.134868]  (&(>__queue_lock)->rlock){-.}, at: 
>>>>>> [] put_io_context_active+0x6e/0xc0
>>>>> 
>>>>> Hey, Paolo,
>>>>> 
>>>>> I only briefly skimmed the code, but what are you using the queue_lock
>>>>> for? You should just use your scheduler lock everywhere. blk-mq doesn't
>>>>> use the queue lock, so the scheduler is the only thing you need mutual
>>>>> exclusion against.
>>>> 
>>>> Hi Omar,
>>>> the cause of the problem is that the hook functions bfq_request_merge
>>>> and bfq_allow_bio_merge invoke, directly or through other functions,
>>>> the function bfq_bic_lookup, which, in its turn, invokes
>>>> ioc_lookup_icq.  The latter must be invoked with the queue lock held.
>>>> In particular the offending lines in bfq_bic_lookup are:
>>>> 
>>>>spin_lock_irqsave(q->queue_lock, flags);
>>>>icq = icq_to_bic(ioc_lookup_icq(ioc, q));
>>>>spin_unlock_irqrestore(q->queue_lock, flags);
>>>> 
>>>> Maybe I'm missing something and we can avoid taking this lock?
>>> 
>>> Ah, I didn't realize we still used the q->queue_lock for the icq stuff.
>>> You're right, you still need that lock for ioc_lookup_icq(). Unless
>>> there's something else I'm forgetting, that should be the only thing you
>>> need it for in the core code, and you should use your scheduler lock for
>>> everything else. What else are you using q->queue_lock for? 
>> 
>> Nothing.  The deadlock follows from that bfq_request_merge gets called
>> with the scheduler lock already held.  Problematic paths start from:
>> bfq_bio_merge and bfq_insert_request.
>> 
>> I'm trying to understand whether I/we can reorder operations in some
>> way that avoids the nested locking, but at no avail so far.
>> 
>> Thanks,
>> Paolo
> 
> Okay, I understand what you're saying now. It was all in the first email
> but I didn't see it right away, sorry about that.
> 
> I don't think it makes sense for ->exit_icq() to be invoked while
> holding q->queue_lock for blk-mq -- we don't hold that lock for any of
> the other hooks. Could you try the below? I haven't convinced myself
> that there isn

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 17:08, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
>> [1] https://github.com/Algodev-github/bfq-mq
> 
> Hello Paolo,
> 
> That branch includes two changes of the version suffix (EXTRAVERSION in 
> Makefile).
> Please don't do that but set CONFIG_LOCALVERSION in .config to add a suffix to
> the kernel version string.
> 

I know it, thanks. Unfortunately, many other irregular things you will probably 
find in that sort of private branch (as for that suffix, for some reason it was 
handy for me to have it tracked by git).

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 17:45, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
>> 2) Enable people to test this first version bfq-mq.
> 
> Hello Paolo,
> 
> I installed this version of bfq-mq on a server that boots from a SATA
> disk. That server boots fine with kernel v4.10-rc7 but not with this
> tree. The first 30 seconds of the boot process seem to proceed normally
> but after that time the messages on the console stop scrolling and
> about another 30 seconds later the server reboots. I haven't found
> anything useful in the system log. I configured the block layer as
> follows:
> 
> $ grep '^C.*_MQ_' .config
> CONFIG_BLK_MQ_PCI=y
> CONFIG_MQ_IOSCHED_BFQ=y
> CONFIG_MQ_IOSCHED_DEADLINE=y
> CONFIG_MQ_IOSCHED_NONE=y
> CONFIG_DEFAULT_MQ_BFQ_MQ=y
> CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
> CONFIG_SCSI_MQ_DEFAULT=y
> CONFIG_DM_MQ_DEFAULT=y
> 

Could you reconfigure with none or mq-deadline as default, check
whether the system boots, and, it it does, switch manually to bfq-mq,
check what happens, and, in the likely case of a failure, try to get
the oops?

Thank you very much,
Paolo

> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
> 



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 19:34, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
>> (lock assertions, BUG_ONs, ...).
> 
> Hello Paolo,
> 
> If you are using BUG_ON(), does that mean that you are not aware of Linus'
> opinion about BUG_ON()? Please read https://lkml.org/lkml/2016/10/4/1.
> 

I am, thanks.  But this is a testing version, overfull of assertions
as a form of hysteric defensive programming.  I will of course remove
all halting assertions in the submission for merging.

Thanks,
Paolo

> Thanks,
> 
> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
> 



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 19:13, Bart Van Assche 
>  ha scritto:
> 
> On 02/10/2017 08:49 AM, Paolo Valente wrote:
>>> $ grep '^C.*_MQ_' .config
>>> CONFIG_BLK_MQ_PCI=y
>>> CONFIG_MQ_IOSCHED_BFQ=y
>>> CONFIG_MQ_IOSCHED_DEADLINE=y
>>> CONFIG_MQ_IOSCHED_NONE=y
>>> CONFIG_DEFAULT_MQ_BFQ_MQ=y
>>> CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
>>> CONFIG_SCSI_MQ_DEFAULT=y
>>> CONFIG_DM_MQ_DEFAULT=y
>>> 
>> 
>> Could you reconfigure with none or mq-deadline as default, check
>> whether the system boots, and, it it does, switch manually to bfq-mq,
>> check what happens, and, in the likely case of a failure, try to get
>> the oops?
> 
> Hello Paolo,
> 
> I just finished performing that test with the following kernel config:
> $ grep '^C.*_MQ_' .config
> CONFIG_BLK_MQ_PCI=y
> CONFIG_MQ_IOSCHED_BFQ=y
> CONFIG_MQ_IOSCHED_DEADLINE=y
> CONFIG_MQ_IOSCHED_NONE=y
> CONFIG_DEFAULT_MQ_DEADLINE=y
> CONFIG_DEFAULT_MQ_IOSCHED="mq-deadline"
> CONFIG_SCSI_MQ_DEFAULT=y
> CONFIG_DM_MQ_DEFAULT=y
> 
> After the system came up I logged in, switched to the bfq-mq scheduler
> and ran several I/O tests against the boot disk.

Without any failure, right?

Unfortunately, as you can imagine, no boot failure occurred on
any of my test systems so far :(

This version of bfq-mq can be configured to print all its activity in
the kernel log, by just defining a macro.  This will of course slow
down the system so much to make it probably unusable, if bfq-mq is
active from boot.  Yet, the failure may still occur so early to make
this approach useful to discover where bfq-mq gets stuck.  As of now I
have no better ideas.  Any suggestion is welcome.

Thanks,
Paolo

> Sorry but nothing
> interesting appeared in the kernel log.
> 
> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
> 



Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2017-02-16 Thread Paolo Valente

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe  ha scritto:
> 
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
> 
> Signed-off-by: Jens Axboe 
...
> +
> +static void dd_merged_requests(struct request_queue *q, struct request *req,
> +struct request *next)
> +{
> + /*
> +  * if next expires before rq, assign its expire time to rq
> +  * and move into next position (next will be deleted) in fifo
> +  */
> + if (!list_empty(>queuelist) && !list_empty(>queuelist)) {
> + if (time_before((unsigned long)next->fifo_time,
> + (unsigned long)req->fifo_time)) {
> + list_move(>queuelist, >queuelist);
> + req->fifo_time = next->fifo_time;
> + }
> + }
> +

Jens,
while trying to imagine the possible causes of Bart's hang with
bfq-mq, I've bumped into the following doubt: in the above function
(in my case, in bfq-mq-'s equivalent of the above function), are
we sure that neither req or next could EVER be in dd->dispatch instead
of dd->fifo_list?  I've tried to verify it, but, although I think it has never
happened in my tests, I was not able to make sure that no unlucky
combination may ever happen (considering also the use of
blk_rq_is_passthrough too, to decide where to put a new request).

I'm making a blunder, right?

Thanks,
Paolo



[WIP BRANCH] cgroups support in bfq-mq WIP branch

2017-02-25 Thread Paolo Valente
Hi,
I've just completed cgroups support, and I'd like to highlight the
main blk-mq issue that I have found along the way.  I have pushed the
commit that completes the support for cgroups to the usual WIP branch
[1].  Before moving to this issue, I have preliminary question about
the scheduler name, since I'm about to start preparing the patch
series for submission.  So far, I have used bfq-mq as a temporary
name.  Are we fine with it, or should I change it, for example, to
just bfq?  Jens?

I've found a sort of circular dependency in blk-mq, related to
scheduler initialization.  To describe both the issue and how I've
addressed it, I'm pasting the message of the new commit.

This commit completes cgroups support for bfq-mq. In particular, it deals 
with
a sort of circular dependency introduced in blk-mq: the function
blkcg_activate_policy, invoked during scheduler initialization, triggers the
invocation of the has_work scheduler hook (before the init function is
finished). To adress this issue, this commit moves the invocation of
blkcg_activate_policy after the initialization of all the fields that could 
be
initialized before invoking blkcg_activate_policy itself. This enables 
has_work
to correctly return false, and thus to prevent the blk-mq stack from 
invoking
further scheduler hooks before the init function is finished.

Thanks,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-25 Thread Paolo Valente

> Il giorno 24 feb 2017, alle ore 19:44, Bart Van Assche 
>  ha scritto:
> 
> On Wed, 2017-02-22 at 22:29 +0100, Paolo Valente wrote:
>> thanks for this second attempt of yours.  Although, unfortunately, not
>> providing some clear indication of the exact cause of your hang (apart
>> from a possible deadlock), your log helped me notice another bug.
>> 
>> At any rate, as I have just written to Jens, I have pushed a new
>> version of the branch [1] (not just added new commits, but also
>> integrated some old commit with new changes, to make it more quickly).
>> The branch now contains both a fix for the above bug, and, more
>> importantly, a fix for the circular dependencies that were still
>> lurking around.  Could you please test it?
> 
> Hello Paolo,
> 

Hi

> I have good news: the same test system boots normally with the same
> kernel config I used during my previous tests and with the latest
> bfq-mq code (commit a965d19585c0) merged with kernel v4.10.
> 

Whew, I was longing for you reply, thanks :)

Should you want to have a look at it, I have just finished completing
cgroups support too, as you have probably already read from my
previous email.

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: [PATCH V2 00/22] Replace the CFQ I/O Scheduler with BFQ

2016-09-05 Thread Paolo Valente

Il giorno 05/set/2016, alle ore 17:56, Bartlomiej Zolnierkiewicz 
 ha scritto:

> 
> Hi,
> 
> On Thursday, September 01, 2016 10:39:46 AM Linus Walleij wrote:
>> On Thu, Sep 1, 2016 at 12:09 AM, Mark Brown  wrote:
>> 
>>> - Do some benchmarks on the current status of the various branches on
>>>   relevant hardware (including trying to convert some of these slower
>>>   devices to blk-mq and seeing what happens).  Linus has been working
>>>   on this already in the context of MMC.
>> 
>> I'm trying to do a patch switching MMC to use blk-mq, so I can
>> benchmark performance before/after this.
>> 
>> While we expect mq to perform worse on single-hardware-queue
>> devices like these, we don't know until we tried, so I'm trying.
> 
> I did this (switched MMC to blk-mq) some time ago.  Patches are
> extremely ugly and hacky (basically the whole MMC block layer
> glue code needs to be re-done) so I'm rather reluctant to
> sharing them yet (to be honest I would like to rewrite them
> completely before posting).
> 
> I only did linear read tests (using dd) so far and results that
> I got were mixed (BTW the hardware I'm doing this work on is
> Odroid-XU3).  Pure block performance under maximum CPU frequency
> was slightly worse (5-12%) but the CPU consumption was reduced so
> when CPU was scaled down manually (or ondemand CPUfreq governor
> was used) blk-mq mode results were better then vanilla ones (up
> to 10% when CPU was scaled down to minimum frequency and even
> up to 50% when using ondemand governor - this finding is very
> interesting and needs to be investigated further).
> 

IMO, another important figure of merit is application- and
system-level latency (e.g., application/system responsiveness or frame
drop rate with audio/video playback/streaming, while the device
happens to be busy with furhter I/O). Scripts to measure it can be
found, e.g., here [1] for desktop systems. If I can, I'm willing to
help in any respect.

Thanks,
Paolo

[1] https://github.com/Algodev-github/S

> Best regards,
> --
> Bartlomiej Zolnierkiewicz
> Samsung R Institute Poland
> Samsung Electronics
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH V2 04/16] block, bfq: modify the peak-rate estimator

2017-04-06 Thread Paolo Valente

> Il giorno 04 apr 2017, alle ore 17:28, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-04-04 at 12:42 +0200, Paolo Valente wrote:
>>> Il giorno 31 mar 2017, alle ore 17:31, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
>>>> +   delta_ktime = ktime_get();
>>>> +   delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);
>>>> +   delta_usecs = ktime_to_us(delta_ktime);
>>> 
>>> This patch changes the type of the variable in which the result of 
>>> ktime_to_us()
>>> is stored from u64 into u32 and next compares that result with LONG_MAX. 
>>> Since
>>> ktime_to_us() returns a signed 64-bit number, are you sure you want to 
>>> store that
>>> result in a 32-bit variable? If ktime_to_us() would e.g. return 
>>> 0x0100
>>> or 0x10100 then the assignment will truncate these numbers to 0x100.
>> 
>> The instruction above the assignment you highlight stores in
>> delta_ktime the difference between 'now' and the last budget start.
>> The latter may have happened at most about 100 ms before 'now'.  So
>> there should be no overflow issue.
> 
> Hello Paolo,
> 
> Please double check the following code: if (delta_usecs < 1000 || delta_usecs 
> >= LONG_MAX)
> Since delta_usecs is a 32-bit variable and LONG_MAX a 64-bit constant on 
> 64-bit systems
> I'm not sure that code will do what it is intended to do.
> 

Yes, sorry. Actually, it never occurred to me to see that extra condition to 
hold over the last eight years on 32-bit systems. So I think I will just remove 
it. Unless Fabio, who inserted that condition several years ago, has something 
to say.

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: [PATCH V2 11/16] block, bfq: reduce idling only in symmetric scenarios

2017-04-07 Thread Paolo Valente

> Il giorno 31 mar 2017, alle ore 17:20, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
>> +   entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
>> +GFP_ATOMIC);
>> +   entity->weight_counter->weight = entity->weight;
> 
> GFP_ATOMIC allocations are more likely to fail than GFP_KERNEL allocations.
> What will happen if kzalloc() returns NULL?
> 

A plain crash :( I'm adding the simple handling of this forgotten exception. If 
I don't get other reviews in the next days, I'll post a V3 addressing this and 
the other issue you highlighted.

Thanks,
Paolo

> Bart.



Re: bfq-mq performance comparison to cfq

2017-04-10 Thread Paolo Valente

> Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann  
> ha scritto:
> 
> Hi Paolo,
> 
> I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
> and did some fio tests to compare the behavior to CFQ.
> 
> My understanding is that bfq-mq is supposed to be merged sooner or
> later and then it will be the only reasonable I/O scheduler with
> blk-mq for rotational devices. Hence I think it is interesting to see
> what to expect performance-wise in comparison to CFQ which is usually
> used for such devices with the legacy block layer.
> 
> I've just done simple tests iterating over number of jobs (1-8 as the
> test system had 8 CPUs) for all (random/sequential) read/write
> patterns. Fixed set of fio parameters used were '-size=5G
> --group_reporting --ioengine=libaio --direct=1 --iodepth=1
> --runtime=10'.
> 
> I've done 10 runs for each such configuration. The device used was an
> older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
> the most are those for sequential reads and sequential writes:
> 
> * sequential reads
>  [0] - cfq, intel_pstate driver, powersave governor
>  [1] - bfq_mq, intel_pstate driver, powersave governor
> 
> jo [0]   [1]
> bs   mean stddevmean   stddev
>  1 & 17060.300 &  77.090 & 17657.500 &  69.602
>  2 & 15318.200 &  28.817 & 10678.000 & 279.070
>  3 & 15403.200 &  42.762 &  9874.600 &  93.436
>  4 & 14521.200 & 624.111 &  9918.700 & 226.425
>  5 & 13893.900 & 144.354 &  9485.000 & 109.291
>  6 & 13065.300 & 180.608 &  9419.800 &  75.043
>  7 & 12169.600 &  95.422 &  9863.800 & 227.662
>  8 & 12422.200 & 215.535 & 15335.300 & 245.764
> 
> * sequential writes
>  [0] - cfq, intel_pstate driver, powersave governor
>  [1] - bfq_mq, intel_pstate driver, powersave governor
> 
> jo[0]   [1]
> bs  mean stddevmean   stddev
>  1 & 14171.300 & 80.796 & 14392.500 & 182.587
>  2 & 13520.000 & 88.967 &  9565.400 & 119.400
>  3 & 13396.100 & 44.936 &  9284.000 &  25.122
>  4 & 13139.800 & 62.325 &  8846.600 &  45.926
>  5 & 12942.400 & 45.729 &  8568.700 &  35.852
>  6 & 12650.600 & 41.283 &  8275.500 & 199.273
>  7 & 12475.900 & 43.565 &  8252.200 &  33.145
>  8 & 12307.200 & 43.594 & 13617.500 & 127.773
> 
> With performance instead of powersave governor results were
> (expectedly) higher but the pattern was the same -- bfq-mq shows a
> "dent" for tests with 2-7 fio jobs. At the moment I have no
> explanation for this behavior.
> 

I have :)

BFQ, by default, is configured to privilege latency over throughput.
In this respect, as various people and I happened to discuss a few
times, even on these mailing lists, the only way to provide strong
low-latency guarantees, at the moment, is through device idling.  The
throughput loss you see is very likely to be the consequence of that
idling.

Why does the throughput go back up at eight jobs?  Because, if many
processes are born in a very short time interval, then BFQ understands
that some multi-job task is being started.  And these parallel tasks
usually prefer overall high throughput to single-process low latency.
Then, BFQ does not idle the device for these processes.

That said, if you do always want maximum throughput, even at the
expense of latency, then just switch off low-latency heuristics, i.e.,
set low_latency to 0.  Depending on the device, setting slice_ilde to
0 may help a lot too (as well as with CFQ).  If the throughput is
still low also after forcing BFQ to an only-throughput mode, then you
hit some bug, and I'll have a little more work to do ...

Thanks,
Paolo

> Regards,
> Andreas



Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Paolo Valente

> Il giorno 10 apr 2017, alle ore 11:55, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann  
>> ha scritto:
>> 
>> Hi Paolo,
>> 
>> I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
>> and did some fio tests to compare the behavior to CFQ.
>> 
>> My understanding is that bfq-mq is supposed to be merged sooner or
>> later and then it will be the only reasonable I/O scheduler with
>> blk-mq for rotational devices. Hence I think it is interesting to see
>> what to expect performance-wise in comparison to CFQ which is usually
>> used for such devices with the legacy block layer.
>> 
>> I've just done simple tests iterating over number of jobs (1-8 as the
>> test system had 8 CPUs) for all (random/sequential) read/write
>> patterns. Fixed set of fio parameters used were '-size=5G
>> --group_reporting --ioengine=libaio --direct=1 --iodepth=1
>> --runtime=10'.
>> 
>> I've done 10 runs for each such configuration. The device used was an
>> older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
>> the most are those for sequential reads and sequential writes:
>> 
>> * sequential reads
>> [0] - cfq, intel_pstate driver, powersave governor
>> [1] - bfq_mq, intel_pstate driver, powersave governor
>> 
>> jo [0]   [1]
>> bs   mean stddevmean   stddev
>> 1 & 17060.300 &  77.090 & 17657.500 &  69.602
>> 2 & 15318.200 &  28.817 & 10678.000 & 279.070
>> 3 & 15403.200 &  42.762 &  9874.600 &  93.436
>> 4 & 14521.200 & 624.111 &  9918.700 & 226.425
>> 5 & 13893.900 & 144.354 &  9485.000 & 109.291
>> 6 & 13065.300 & 180.608 &  9419.800 &  75.043
>> 7 & 12169.600 &  95.422 &  9863.800 & 227.662
>> 8 & 12422.200 & 215.535 & 15335.300 & 245.764
>> 
>> * sequential writes
>> [0] - cfq, intel_pstate driver, powersave governor
>> [1] - bfq_mq, intel_pstate driver, powersave governor
>> 
>> jo[0]   [1]
>> bs  mean stddevmean   stddev
>> 1 & 14171.300 & 80.796 & 14392.500 & 182.587
>> 2 & 13520.000 & 88.967 &  9565.400 & 119.400
>> 3 & 13396.100 & 44.936 &  9284.000 &  25.122
>> 4 & 13139.800 & 62.325 &  8846.600 &  45.926
>> 5 & 12942.400 & 45.729 &  8568.700 &  35.852
>> 6 & 12650.600 & 41.283 &  8275.500 & 199.273
>> 7 & 12475.900 & 43.565 &  8252.200 &  33.145
>> 8 & 12307.200 & 43.594 & 13617.500 & 127.773
>> 
>> With performance instead of powersave governor results were
>> (expectedly) higher but the pattern was the same -- bfq-mq shows a
>> "dent" for tests with 2-7 fio jobs. At the moment I have no
>> explanation for this behavior.
>> 
> 
> I have :)
> 
> BFQ, by default, is configured to privilege latency over throughput.
> In this respect, as various people and I happened to discuss a few
> times, even on these mailing lists, the only way to provide strong
> low-latency guarantees, at the moment, is through device idling.  The
> throughput loss you see is very likely to be the consequence of that
> idling.
> 
> Why does the throughput go back up at eight jobs?  Because, if many
> processes are born in a very short time interval, then BFQ understands
> that some multi-job task is being started.  And these parallel tasks
> usually prefer overall high throughput to single-process low latency.
> Then, BFQ does not idle the device for these processes.
> 
> That said, if you do always want maximum throughput, even at the
> expense of latency, then just switch off low-latency heuristics, i.e.,
> set low_latency to 0.  Depending on the device, setting slice_ilde to
> 0 may help a lot too (as well as with CFQ).  If the throughput is
> still low also after forcing BFQ to an only-throughput mode, then you
> hit some bug, and I'll have a little more work to do ...
> 

I forgot two pieces of information:
1) The throughput drop lasts only for a few seconds, after which BFQ
stops caring about the latency of the newborn fio processes, and aims
only at throughput.
2) One of my main goals, if and after BFQ is merged, is to get about
the same low-latency guarantees, without idling, and thus without
losing throughput.

Paolo


> Thanks,
> Paolo
> 
>> Regards,
>> Andreas



Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Paolo Valente

> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>  ha scritto:
> 
> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>> That said, if you do always want maximum throughput, even at the
>> expense of latency, then just switch off low-latency heuristics, i.e.,
>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>> still low also after forcing BFQ to an only-throughput mode, then you
>> hit some bug, and I'll have a little more work to do ...
> 
> Hello Paolo,
> 
> Has it been considered to make applications tell the I/O scheduler
> whether to optimize for latency or for throughput? It shouldn't be that
> hard for window managers and shells to figure out whether or not a new
> application that is being started is interactive or not. This would
> require a mechanism that allows applications to provide such information
> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> scheduler trying to guess whether or not an application is an interactive
> application?
> 

IMO that would be an (or maybe the) optimal solution, in terms of both
throughput and latency.  We have even developed a prototype doing what
you propose, for Android.  Unfortunately, I have not yet succeeded in
getting support, to turn it into candidate production code, or to make
a similar solution for lsb-compliant systems.

Thanks,
Paolo


> Bart.



Re: [PATCH V2 00/16] Introduce the BFQ I/O scheduler

2017-04-11 Thread Paolo Valente

> Il giorno 10 apr 2017, alle ore 18:56, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
>> [ ... ]
> 
> Hello Paolo,
> 
> Is the git tree that is available at https://github.com/Algodev-github/bfq-mq
> appropriate for testing BFQ? If I merge that tree with v4.11-rc6 and if I run
> the srp-test software against that tree as follows:
> 
>./run_tests -e bfq-mq -t 02-mq
> 
> then the following appears on the console:
> 
> [ 2748.650352] BUG: unable to handle kernel NULL pointer dereference at 
> 00d0
> [ 2748.650442] IP: __bfq_insert_request+0x26/0x650 [bfq_mq_iosched]
> [ 2748.650509] PGD 0 
> [ 2748.650511] 
> [ 2748.650585] Oops:  [#1] SMP
> [ 2748.651107] CPU: 9 PID: 10772 Comm: kworker/9:2H Tainted: G  I 
> 4.11.0-rc6-dbg+ #1
> [ 2748.651191] Workqueue: kblockd blk_mq_requeue_work
> [ 2748.651228] task: 88037c808040 task.stack: c90003b4c000
> [ 2748.651268] RIP: 0010:__bfq_insert_request+0x26/0x650 [bfq_mq_iosched]
> [ 2748.651307] RSP: 0018:c90003b4f9d8 EFLAGS: 00010002
> [ 2748.651345] RAX: 0001 RBX:  RCX: 
> 0001
> [ 2748.651383] RDX: 0001 RSI: 880377f52e80 RDI: 
> 880401f774e8
> [ 2748.651423] RBP: c90003b4fa80 R08: 9093955f R09: 
> 0001
> [ 2748.651464] R10: c90003b4fa00 R11: a06d0d53 R12: 
> 880401f77840
> [ 2748.651506] R13: 880401f774e8 R14: 880378a451e0 R15: 
> 
> [ 2748.651547] FS:  () GS:88046f04() 
> knlGS:
> [ 2748.651588] CS:  0010 DS:  ES:  CR0: 80050033
> [ 2748.651626] CR2: 00d0 CR3: 01c0f000 CR4: 
> 001406e0
> [ 2748.651664] Call Trace:
> [ 2748.651778]  bfq_insert_request+0x83/0x280 [bfq_mq_iosched]
> [ 2748.651934]  bfq_insert_requests+0x50/0x70 [bfq_mq_iosched]
> [ 2748.651975]  blk_mq_sched_insert_request+0x11e/0x170
> [ 2748.652015]  blk_insert_cloned_request+0xb6/0x1f0
> [ 2748.652361]  map_request+0x13c/0x290 [dm_mod]
> [ 2748.652403]  dm_mq_queue_rq+0x90/0x160 [dm_mod]
> [ 2748.652441]  blk_mq_dispatch_rq_list+0x1f2/0x3e0
> [ 2748.652479]  blk_mq_sched_dispatch_requests+0xf1/0x190
> [ 2748.652516]  __blk_mq_run_hw_queue+0x12d/0x1c0
> [ 2748.652553]  __blk_mq_delay_run_hw_queue+0xe3/0xf0
> [ 2748.652593]  blk_mq_run_hw_queues+0x5c/0x80
> [ 2748.652632]  blk_mq_requeue_work+0x132/0x150
> [ 2748.652671]  process_one_work+0x206/0x6a0
> [ 2748.652709]  worker_thread+0x49/0x4a0
> [ 2748.652745]  kthread+0x107/0x140
> [ 2748.652854]  ret_from_fork+0x2e/0x40
> [ 2748.652891] Code: ff 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 
> 83 c4 80 8b 87 58 03 00 00 48 8b 9e b0 00 00 00 85 c0 0f 84 8b 04 00 00 <48> 
> 8b 83 d0 00 00 00 48 85 c0 0f 84 63 04 00 00
> 48 83 e8 10 48 
> [ 2748.653049] RIP: __bfq_insert_request+0x26/0x650 [bfq_mq_iosched] RSP: 
> c90003b4f9d8
> [ 2748.653090] CR2: 00d0
> 
> The crash address corresponds to the following source code according to gdb:
> 
> (gdb) list *(__bfq_insert_request+0x26)
> 0xd6f6 is in __bfq_insert_request (block/bfq-mq-iosched.c:4430).
> 4425
> 4426static void __bfq_insert_request(struct bfq_data *bfqd, struct 
> request *rq)
> 4427{
> 4428struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
> 4429
> 4430assert_spin_locked(>lock);
> 4431
> 4432bfq_log_bfqq(bfqd, bfqq, "__insert_req: rq %p bfqq %p", rq, 
> bfqq);
> 4433
> 4434/*
> 

Hi Bart,
I've tried to figure out how to deal with this crash, but I didn't
find any sensible way to go, for the following two reasons.

First, if I'm not missing anything, then I don't yet have the hardware
required to run the srp-test.  So, I cannot easily reproduce this
failure.  Actually, BFQ is not yet suitable, and maybe will never be
in its current design, for very high-speed hardware as InfiniBand and
NVMe devices.

Second, a NULL-pointer fault at the line you report is rather weird.
In fact, the sequence of C-code instructions executed up to that line
is:

struct bfq_data *bfqd = q->elevator->elevator_data;
...
spin_lock_irq(>lock);
__bfq_insert_request(bfqd, rq);
/* inside the __bfq_insert_request function: */
struct bfq_queue *bfqq = RQ_BFQQ(rq), ...;
assert_spin_locked(>lock);

So, how can the last line cause a NULL-pointer-dereference exception
on the same address, >lock, on which spin_lock_irq(>lock);
was happy to work to get a spin lock?

Any idea on how to proceed?  If this strage bug remains hard to spot,
then, if you agree, I will go on in the meanwhile with submitting a
new version of the patch series, which addresses your other issues.

Thanks,
Paolo

> Bart.



Re: [PATCH V2 16/16] block, bfq: split bfq-iosched.c into multiple source files

2017-04-11 Thread Paolo Valente

> Il giorno 02 apr 2017, alle ore 12:02, kbuild test robot  ha 
> scritto:
> 
> Hi Paolo,
> 
> [auto build test ERROR on block/for-next]
> [also build test ERROR on v4.11-rc4 next-20170331]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 

Hi,
this seems to be a false positive.  Build is correct with the tested
tree and the .config.

Thanks,
Paolo

> url:
> https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
> for-next
> config: i386-allmodconfig (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
># save the attached .config to linux build tree
>make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>>> ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_clear_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_clear_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_clear_bfqq_wait_request" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_timeout" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfqg_stats_set_start_empty_time" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_weights_tree_add" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_put_queue" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_bfqq_sync" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfqg_to_blkg" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfqq_group" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_weights_tree_remove" [block/bfq-wf2q.ko] undefined!
>>> ERROR: "bfq_bic_update_cgroup" [block/bfq-iosched.ko] undefined!
>>> ERROR: "bfqg_stats_set_start_idle_time" [block/bfq-iosched.ko] undefined!
>>> ERROR: "bfqg_stats_update_completion" [block/bfq-iosched.ko] undefined!
>>> ERROR: "bfq_bfqq_move" [block/bfq-iosched.ko] undefined!
>>> ERROR: "bfqg_put" [block/bfq-iosched.ko] undefined!
>>> ERROR: "next_queue_may_preempt" [block/bfq-iosched.ko] undefined!
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> <.config.gz>



[PATCH V3 00/16] Introduce the BFQ I/O scheduler

2017-04-11 Thread Paolo Valente
Hi,
new patch series, addressing (both) issues raised by Bart [1].

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/3/31/393

Arianna Avanzini (4):
  block, bfq: add full hierarchical scheduling and cgroups support
  block, bfq: add Early Queue Merge (EQM)
  block, bfq: reduce idling only in symmetric scenarios
  block, bfq: handle bursts of queue activations

Paolo Valente (12):
  block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
  block, bfq: improve throughput boosting
  block, bfq: modify the peak-rate estimator
  block, bfq: add more fairness with writes and slow processes
  block, bfq: improve responsiveness
  block, bfq: reduce I/O latency for soft real-time applications
  block, bfq: preserve a low latency also with NCQ-capable drives
  block, bfq: reduce latency during request-pool saturation
  block, bfq: boost the throughput on NCQ-capable flash-based devices
  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
  block, bfq: remove all get and put of I/O contexts
  block, bfq: split bfq-iosched.c into multiple source files

 Documentation/block/00-INDEX|2 +
 Documentation/block/bfq-iosched.txt |  531 
 block/Kconfig.iosched   |   21 +
 block/Makefile  |1 +
 block/bfq-cgroup.c  | 1139 
 block/bfq-iosched.c | 5047 +++
 block/bfq-iosched.h |  942 +++
 block/bfq-wf2q.c| 1616 +++
 include/linux/blkdev.h  |2 +-
 9 files changed, 9300 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/block/bfq-iosched.txt
 create mode 100644 block/bfq-cgroup.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-iosched.h
 create mode 100644 block/bfq-wf2q.c

--
2.10.0


[PATCH V3 03/16] block, bfq: improve throughput boosting

2017-04-11 Thread Paolo Valente
The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 87 +
 1 file changed, 41 insertions(+), 46 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 800048fa..553aee1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -752,9 +752,6 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP 128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES  32
 
@@ -4074,40 +4071,6 @@ static struct bfq_queue *bfq_set_in_service_queue(struct 
bfq_data *bfqd)
return bfqq;
 }
 
-/*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-   struct bfq_queue *bfqq)
-{
-   unsigned long budget;
-
-   /*
-* When we need an estimate of the peak rate we need to avoid
-* to give budgets that are too short due to previous
-* measurements.  So, in the first 10 assignments use a
-* ``safe'' budget value. For such first assignment the value
-* of bfqd->budgets_assigned happens to be lower than 

[PATCH V3 07/16] block, bfq: reduce I/O latency for soft real-time applications

2017-04-11 Thread Paolo Valente
To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in the previous patch)
also the queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* their requests as quickly as they can,
whereas soft real-time applications spend some time processing data
after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time, thereby giving to the application the opportunity to be
deemed as such, only when both the following two conditions happen to
hold: 1) the queue associated with the application has expired and is
empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues its next request at time, say, t_i. At time t_c the
heuristic computes the next time instant, called soft_rt_next_start in
the code, such that, only if t_i >= soft_rt_next_start, then both the
next conditions will hold when the application issues its next
request: 1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments on the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 342 +---
 1 file changed, 323 insertions(+), 19 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1a32c83..7f94ad3 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -119,6 +119,13 @@
 #define BFQ_DEFAULT_GRP_IOPRIO 0
 #define BFQ_DEFAULT_GRP_CLASS  IOPRIO_CLASS_BE
 
+/*
+ * Soft real-time applications are extremely more latency sensitive
+ * than interactive ones. Over-raise the weight of the former to
+ * privilege them against the latter.
+ */
+#define BFQ_SOFTRT_WEIGHT_FACTOR   100
+
 struct bfq_entity;
 
 /**
@@ -343,6 +350,14 @@ struct bfq_queue {
/* current maximum weight-raising time for this queue */
unsigned long wr_cur_max_time;
/*
+* Minimum time instant such that, only if a new request is
+* enqueued after this time instant in an idle @bfq_queue with
+* no outstanding requests, then the task associated with the
+* queue it is deemed as soft real-time (see the comme

[PATCH V3 02/16] block, bfq: add full hierarchical scheduling and cgroups support

2017-04-11 Thread Paolo Valente
From: Arianna Avanzini 

Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.

Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.

Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi 
Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 Documentation/block/bfq-iosched.txt |   17 +-
 block/Kconfig.iosched   |   10 +
 block/bfq-iosched.c | 2568 ++-
 include/linux/blkdev.h  |2 +-
 4 files changed, 2213 insertions(+), 384 deletions(-)

diff --git a/Documentation/block/bfq-iosched.txt 
b/Documentation/block/bfq-iosched.txt
index cbf85f6f..461b27f 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -253,9 +253,14 @@ of slice_idle are copied from CFQ too.
 per-process ioprio and weight
 -
 
-Unless the cgroups interface is used, weights can be assigned to
-processes only indirectly, through I/O priorities, and according to
-the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
+Unless the cgroups interface is used (see "4. BFQ group scheduling"),
+weights can be assigned to processes only indirectly, through I/O
+priorities, and according to the relation:
+weight = (IOPRIO_BE_NR - ioprio) * 10.
+
+Beware that, if low-latency is set, then BFQ automatically raises the
+weight of the queues associated with interactive and soft real-time
+applications. Unset this tunable if you need/want to control weights.
 
 slice_idle
 --
@@ -450,9 +455,9 @@ may be reactivated for an already busy async queue (in ms).
 4. Group scheduling with BFQ
 
 
-BFQ supports both cgroup-v1 and cgroup-v2 io controllers, namely blkio
-and io. In particular, BFQ supports weight-based proportional
-share.
+BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
+blkio and io. In particular, BFQ supports weight-based proportional
+share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
 
 4-1 Service guarantees provided
 ---
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 562e30e..a37cd03 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -40,6 +40,7 @@ config CFQ_GROUP_IOSCHED
  Enable group IO scheduling in CFQ.
 
 choice
+
prompt "Default I/O scheduler"
default DEFAULT_CFQ
help
@@ -80,6 +81,15 @@ config IOSCHED_BFQ
real-time applications.  Details in
Documentation/block/bfq-iosched.txt
 
+config BFQ_GROUP_IOSCHED
+   bool "BFQ hierarchical scheduling support"
+   depends on IOSCHED_BFQ && BLK_CGROUP
+   default n
+   ---help---
+
+   Enable hierarchical scheduling in BFQ, using the blkio
+   (cgroups-v1) or io (cgroups-v2) controller.
+
 endmenu
 
 endif
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 56a59fe..800048fa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -90,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -114,7 +115,7 

[PATCH V3 13/16] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs

2017-04-11 Thread Paolo Valente
This patch is basically the counterpart, for NCQ-capable rotational
devices, of the previous patch. Exactly as the previous patch does on
flash-based devices and for any workload, this patch disables device
idling on rotational devices, but only for random I/O. In fact, only
with these queues disabling idling boosts the throughput on
NCQ-capable rotational devices. To not break service guarantees,
idling is disabled for NCQ-enabled rotational devices only when the
same symmetry conditions considered in the previous patches hold.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2081784..549f030 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6439,20 +6439,15 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 * The next variable takes into account the cases where idling
 * boosts the throughput.
 *
-* The value of the variable is computed considering that
-* idling is usually beneficial for the throughput if:
+* The value of the variable is computed considering, first, that
+* idling is virtually always beneficial for the throughput if:
 * (a) the device is not NCQ-capable, or
 * (b) regardless of the presence of NCQ, the device is rotational
-* and the request pattern for bfqq is I/O-bound (possible
-* throughput losses caused by granting idling to seeky queues
-* are mitigated by the fact that, in all scenarios where
-* boosting throughput is the best thing to do, i.e., in all
-* symmetric scenarios, only a minimal idle time is allowed to
-* seeky queues).
+* and the request pattern for bfqq is I/O-bound and sequential.
 *
 * Secondly, and in contrast to the above item (b), idling an
 * NCQ-capable flash-based device would not boost the
-* throughput even with intense I/O; rather it would lower
+* throughput even with sequential I/O; rather it would lower
 * the throughput in proportion to how fast the device
 * is. Accordingly, the next variable is true if any of the
 * above conditions (a) and (b) is true, and, in particular,
@@ -6460,7 +6455,8 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 * device.
 */
idling_boosts_thr = !bfqd->hw_tag ||
-   (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq));
+   (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) &&
+bfq_bfqq_idle_window(bfqq));
 
/*
 * The value of the next variable,
-- 
2.10.0



[PATCH V3 12/16] block, bfq: boost the throughput on NCQ-capable flash-based devices

2017-04-11 Thread Paolo Valente
This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.

As already highlighted in a previous patch, allowing the device to
prefetch and internally reorder requests trivially causes loss of
control on the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments on the function
bfq_bfqq_may_idle(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.

Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 154 
 1 file changed, 106 insertions(+), 48 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b97801f..2081784 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6442,15 +6442,25 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 * The value of the variable is computed considering that
 * idling is usually beneficial for the throughput if:
 * (a) the device is not NCQ-capable, or
-* (b) regardless of the presence of NCQ, the request pattern
-* for bfqq is I/O-bound (possible throughput losses
-* caused by granting idling to seeky queues are mitigated
-* by the fact that, in all scenarios where boosting
-* throughput is the best thing to do, i.e., in all
-* symmetric scenarios, only a minimal idle time is
-* allowed to seeky queues).
+* (b) regardless of the presence of NCQ, the device is rotational
+* and the request pattern for bfqq is I/O-bound (possible
+* throughput losses caused by granting idling to seeky queues
+* are mitigated by the fact that, in all scenarios where
+* boosting throughput is the best thing to do, i.e., in all
+* symmetric scenarios, only a minimal idle time is allowed to
+* seeky queues).
+*
+* Secondly, and in contrast to the above item (b), idling an
+* NCQ-capable flash-based device would not boost the
+* throughput even with intense I/O; rather it would lower
+* the throughput in proportion to how fast the device
+* is. Accordingly, the next variable is true if any of the
+* above conditions (a) and (b) is true, and, in particular,
+* happens to be false if bfqd is an NCQ-capable flash-based
+* device.
 */
-   idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
+   idling_boosts_thr = !bfqd->hw_tag ||
+   (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq));
 
/*
 * The value of the next variable,
@@ -6491,14 +6501,16 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
bfqd->wr_busy_queues == 0;
 
/*
-* There is then a case where idling must be performed not for
-* throughput concerns, but to preserve service guarantees. To
-* introduce it, we can note that allowing the drive to
-* enqueue more than one request at a time, and hence
+* There is then a case where idling must be performed not
+* for throughput concerns, but to preserve service
+* guarantees.
+*
+* To introduce this case, we can note that allowing the drive
+* to enqueue more than one request at a time, and hence
 * delegating de facto final scheduling decisions to the
-* drive's internal scheduler, causes loss of control on the
+* drive's internal scheduler, entails loss of control on the
 * actual request service order. In particular, the critical
-* situation is when requests from different processes happens
+* situation is when requests from different processes happen
 * to be present, at the same time, in the internal queue(s)
 * of the drive. In such a situation, the drive, by deciding
 * the service order of the internally-queued requests, does
@@ -6509,5

[PATCH V3 15/16] block, bfq: remove all get and put of I/O contexts

2017-04-11 Thread Paolo Valente
When a bfq queue is set in service and when it is merged, a reference
to the I/O context associated with the queue is taken. This reference
is then released when the queue is deselected from service or
split. More precisely, the release of the reference is postponed to
when the scheduler lock is released, to avoid nesting between the
scheduler and the I/O-context lock. In fact, such nesting would lead
to deadlocks, because of other code paths that take the same locks in
the opposite order. This postponing of I/O-context releases does
complicate code.

This commit addresses these issue by modifying involved operations in
such a way to not need to get the above I/O-context references any
more. Then it also removes any get and release of these references.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 143 +---
 1 file changed, 23 insertions(+), 120 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b7e3c86..30bb8f9 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -538,8 +538,6 @@ struct bfq_data {
 
/* bfq_queue in service */
struct bfq_queue *in_service_queue;
-   /* bfq_io_cq (bic) associated with the @in_service_queue */
-   struct bfq_io_cq *in_service_bic;
 
/* on-disk position of the last served request */
sector_t last_position;
@@ -704,15 +702,6 @@ struct bfq_data {
struct bfq_io_cq *bio_bic;
/* bfqq associated with the task issuing current bio for merging */
struct bfq_queue *bio_bfqq;
-
-   /*
-* io context to put right after bfqd->lock is released. This
-* filed is used to perform put_io_context, when needed, to
-* after the scheduler lock has been released, and thus
-* prevent an ioc->lock from being possibly taken while the
-* scheduler lock is being held.
-*/
-   struct io_context *ioc_to_put;
 };
 
 enum bfqq_state_flags {
@@ -1148,34 +1137,6 @@ static void bfq_schedule_dispatch(struct bfq_data *bfqd)
}
 }
 
-/*
- * Next two functions release bfqd->lock and put the io context
- * pointed by bfqd->ioc_to_put. This delayed put is used to not risk
- * to take an ioc->lock while the scheduler lock is being held.
- */
-static void bfq_unlock_put_ioc(struct bfq_data *bfqd)
-{
-   struct io_context *ioc_to_put = bfqd->ioc_to_put;
-
-   bfqd->ioc_to_put = NULL;
-   spin_unlock_irq(>lock);
-
-   if (ioc_to_put)
-   put_io_context(ioc_to_put);
-}
-
-static void bfq_unlock_put_ioc_restore(struct bfq_data *bfqd,
-  unsigned long flags)
-{
-   struct io_context *ioc_to_put = bfqd->ioc_to_put;
-
-   bfqd->ioc_to_put = NULL;
-   spin_unlock_irqrestore(>lock, flags);
-
-   if (ioc_to_put)
-   put_io_context(ioc_to_put);
-}
-
 /**
  * bfq_gt - compare two timestamps.
  * @a: first ts.
@@ -2684,18 +2645,6 @@ static void __bfq_bfqd_reset_in_service(struct bfq_data 
*bfqd)
struct bfq_entity *in_serv_entity = _serv_bfqq->entity;
struct bfq_entity *entity = in_serv_entity;
 
-   if (bfqd->in_service_bic) {
-   /*
-* Schedule the release of a reference to
-* bfqd->in_service_bic->icq.ioc to right after the
-* scheduler lock is released. This ioc is not
-* released immediately, to not risk to possibly take
-* an ioc->lock while holding the scheduler lock.
-*/
-   bfqd->ioc_to_put = bfqd->in_service_bic->icq.ioc;
-   bfqd->in_service_bic = NULL;
-   }
-
bfq_clear_bfqq_wait_request(in_serv_bfqq);
hrtimer_try_to_cancel(>idle_slice_timer);
bfqd->in_service_queue = NULL;
@@ -3495,7 +3444,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
__bfq_deactivate_entity(entity, false);
bfq_put_async_queues(bfqd, bfqg);
 
-   bfq_unlock_put_ioc_restore(bfqd, flags);
+   spin_unlock_irqrestore(>lock, flags);
/*
 * @blkg is going offline and will be ignored by
 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
@@ -5472,20 +5421,18 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct 
bfq_queue *new_bfqq)
 * first time that the requests of some process are redirected to
 * it.
 *
-* We redirect bfqq to new_bfqq and not the opposite, because we
-* are in the context of the process owning bfqq, hence we have
-* the io_cq of this process. So we can immediately configure this
-* io_cq to redirect the requests of the process to new_bfqq.
+* We redirect bfqq to new_bfqq and not the opposite, because
+* we are in the context of the process owning bfqq, thus we
+* have the io_cq of this process. So we can immediately
+* co

[PATCH V3 14/16] block, bfq: handle bursts of queue activations

2017-04-11 Thread Paolo Valente
From: Arianna Avanzini 

Many popular I/O-intensive services or applications spawn or
reactivate many parallel threads/processes during short time
intervals. Examples are systemd during boot or git grep.  These
services or applications benefit mostly from a high throughput: the
quicker the I/O generated by their processes is cumulatively served,
the sooner the target job of these services or applications gets
completed. As a consequence, it is almost always counterproductive to
weight-raise any of the queues associated to the processes of these
services or applications: in most cases it would just lower the
throughput, mainly because weight-raising also implies device idling.

To address this issue, an I/O scheduler needs, first, to detect which
queues are associated with these services or applications. In this
respect, we have that, from the I/O-scheduler standpoint, these
services or applications cause bursts of activations, i.e.,
activations of different queues occurring shortly after each
other. However, a shorter burst of activations may be caused also by
the start of an application that does not consist in a lot of parallel
I/O-bound threads (see the comments on the function bfq_handle_burst
for details).

In view of these facts, this commit introduces:
1) an heuristic to detect (only) bursts of queue activations caused by
   services or applications consisting in many parallel I/O-bound
   threads;
2) the prevention of device idling and weight-raising for the queues
   belonging to these bursts.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 404 ++--
 1 file changed, 389 insertions(+), 15 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 549f030..b7e3c86 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -360,6 +360,10 @@ struct bfq_queue {
 
/* bit vector: a 1 for each seeky requests in history */
u32 seek_history;
+
+   /* node for the device's burst list */
+   struct hlist_node burst_list_node;
+
/* position of the last request enqueued */
sector_t last_request_pos;
 
@@ -443,6 +447,17 @@ struct bfq_io_cq {
bool saved_IO_bound;
 
/*
+* Same purpose as the previous fields for the value of the
+* field keeping the queue's belonging to a large burst
+*/
+   bool saved_in_large_burst;
+   /*
+* True if the queue belonged to a burst list before its merge
+* with another cooperating queue.
+*/
+   bool was_in_burst_list;
+
+   /*
 * Similar to previous fields: save wr information.
 */
unsigned long saved_wr_coeff;
@@ -609,6 +624,36 @@ struct bfq_data {
 */
bool strict_guarantees;
 
+   /*
+* Last time at which a queue entered the current burst of
+* queues being activated shortly after each other; for more
+* details about this and the following parameters related to
+* a burst of activations, see the comments on the function
+* bfq_handle_burst.
+*/
+   unsigned long last_ins_in_burst;
+   /*
+* Reference time interval used to decide whether a queue has
+* been activated shortly after @last_ins_in_burst.
+*/
+   unsigned long bfq_burst_interval;
+   /* number of queues in the current burst of queue activations */
+   int burst_size;
+
+   /* common parent entity for the queues in the burst */
+   struct bfq_entity *burst_parent_entity;
+   /* Maximum burst size above which the current queue-activation
+* burst is deemed as 'large'.
+*/
+   unsigned long bfq_large_burst_thresh;
+   /* true if a large queue-activation burst is in progress */
+   bool large_burst;
+   /*
+* Head of the burst list (as for the above fields, more
+* details in the comments on the function bfq_handle_burst).
+*/
+   struct hlist_head burst_list;
+
/* if set to true, low-latency heuristics are enabled */
bool low_latency;
/*
@@ -671,7 +716,8 @@ struct bfq_data {
 };
 
 enum bfqq_state_flags {
-   BFQQF_busy = 0, /* has requests or is in service */
+   BFQQF_just_created = 0, /* queue just allocated */
+   BFQQF_busy, /* has requests or is in service */
BFQQF_wait_request, /* waiting for a request */
BFQQF_non_blocking_wait_rq, /*
 * waiting for a request
@@ -685,6 +731,10 @@ enum bfqq_state_flags {
 * having consumed at most 2/10 of
 * its budget
 */
+   BFQQF_in_large_burst,   /*
+* bfqq activated in a large burst,
+* see comments to bfq_handle_burst

[PATCH V3 11/16] block, bfq: reduce idling only in symmetric scenarios

2017-04-11 Thread Paolo Valente
From: Arianna Avanzini 

A seeky queue (i..e, a queue containing random requests) is assigned a
very small device-idling slice, for throughput issues. Unfortunately,
given the process associated with a seeky queue, this behavior causes
the following problem: if the process, say P, performs sync I/O and
has a higher weight than some other processes doing I/O and associated
with non-seeky queues, then BFQ may fail to guarantee to P its
reserved share of the throughput. The reason is that idling is key
for providing service guarantees to processes doing sync I/O [1].

This commit addresses this issue by allowing the device-idling slice
to be reduced for a seeky queue only if the scenario happens to be
symmetric, i.e., if all the queues are to receive the same share of
the throughput.

[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

Signed-off-by: Arianna Avanzini 
Signed-off-by: Riccardo Pizzetti 
Signed-off-by: Samuele Zecchini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 287 ++--
 1 file changed, 280 insertions(+), 7 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6e7388a..b97801f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -183,6 +183,20 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ * with a given weight.
+ */
+struct bfq_weight_counter {
+   unsigned int weight; /* weight of the entities this counter refers to */
+   unsigned int num_active; /* nr of active entities with this weight */
+   /*
+* Weights tree member (see bfq_data's @queue_weights_tree and
+* @group_weights_tree)
+*/
+   struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  *
  * A bfq_entity is used to represent either a bfq_queue (leaf node in the
@@ -212,6 +226,8 @@ struct bfq_sched_data {
 struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
+   /* pointer to the weight counter associated with this entity */
+   struct bfq_weight_counter *weight_counter;
 
/*
 * Flag, true if the entity is on a tree (either the active or
@@ -456,6 +472,25 @@ struct bfq_data {
struct bfq_group *root_group;
 
/*
+* rbtree of weight counters of @bfq_queues, sorted by
+* weight. Used to keep track of whether all @bfq_queues have
+* the same weight. The tree contains one counter for each
+* distinct weight associated to some active and not
+* weight-raised @bfq_queue (see the comments to the functions
+* bfq_weights_tree_[add|remove] for further details).
+*/
+   struct rb_root queue_weights_tree;
+   /*
+* rbtree of non-queue @bfq_entity weight counters, sorted by
+* weight. Used to keep track of whether all @bfq_groups have
+* the same weight. The tree contains one counter for each
+* distinct weight associated to some active @bfq_group (see
+* the comments to the functions bfq_weights_tree_[add|remove]
+* for further details).
+*/
+   struct rb_root group_weights_tree;
+
+   /*
 * Number of bfq_queues containing requests (including the
 * queue in service, even if it is idling).
 */
@@ -791,6 +826,11 @@ struct bfq_group_data {
  * to avoid too many special cases during group creation/
  * migration.
  * @stats: stats for this bfqg.
+ * @active_entities: number of active entities belonging to the group;
+ *   unused for the root group. Used to know whether there
+ *   are groups with more than one active @bfq_entity
+ *   (see the comments to the function
+ *   bfq_bfqq_may_idle()).
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *   determining if two or more queues have interleaving
  *   requests (see bfq_find_close_cooperator()).
@@ -818,6 +858,8 @@ struct bfq_group {
 
struct bfq_entity *my_entity;
 
+   int active_entities;
+
struct rb_root rq_pos_tree;
 
struct bfqg_stats stats;
@@ -1254,12 +1296,27 @@ static bool bfq_update_parent_budget(struct bfq_entity 
*next_in_service)
  * a candidate for next service (i.e, a candidate entity to serve
  * after the in-service entity is expired). The function then returns
  * true.
+ *
+ * In contrast, the entity could stil be a candidate for next service
+ * if it is not a queue, and has more than one child. In fact, even if
+ * one of its children is about to be set in service, other children
+ * may still be the next to serve. As a consequence, a non-queue

[PATCH V3 10/16] block, bfq: add Early Queue Merge (EQM)

2017-04-11 Thread Paolo Valente
From: Arianna Avanzini 

A set of processes may happen to perform interleaved reads, i.e.,
read requests whose union would give rise to a sequential read pattern.
There are two typical cases: first, processes reading fixed-size chunks
of data at a fixed distance from each other; second, processes reading
variable-size chunks at variable distances. The latter case occurs for
example with QEMU, which splits the I/O generated by a guest into
multiple chunks, and lets these chunks be served by a pool of I/O
threads, iteratively assigning the next chunk of I/O to the first
available thread. CFQ denotes as 'cooperating' a set of processes that
are doing interleaved I/O, and when it detects cooperating processes,
it merges their queues to obtain a sequential I/O pattern from the union
of their I/O requests, and hence boost the throughput.

Unfortunately, in the following frequent case, the mechanism
implemented in CFQ for detecting cooperating processes and merging
their queues is not responsive enough to handle also the fluctuating
I/O pattern of the second type of processes. Suppose that one process
of the second type issues a request close to the next request to serve
of another process of the same type. At that time the two processes
would be considered as cooperating. But, if the request issued by the
first process is to be merged with some other already-queued request,
then, from the moment at which this request arrives, to the moment
when CFQ controls whether the two processes are cooperating, the two
processes are likely to be already doing I/O in distant zones of the
disk surface or device memory.

CFQ uses however preemption to get a sequential read pattern out of
the read requests performed by the second type of processes too.  As a
consequence, CFQ uses two different mechanisms to achieve the same
goal: boosting the throughput with interleaved I/O.

This patch introduces Early Queue Merge (EQM), a unified mechanism to
get a sequential read pattern with both types of processes. The main
idea is to immediately check whether a newly-arrived request lets some
pair of processes become cooperating, both in the case of actual
request insertion and, to be responsive with the second type of
processes, in the case of request merge. Both types of processes are
then handled by just merging their queues.

Signed-off-by: Arianna Avanzini 
Signed-off-by: Mauro Andreolini 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 881 +---
 1 file changed, 840 insertions(+), 41 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index deb1f21c..6e7388a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -281,11 +281,12 @@ struct bfq_ttime {
  * struct bfq_queue - leaf schedulable entity.
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async. @cgroup holds a reference to
- * the cgroup, to be sure that it does not disappear while a bfqq
- * still references it (mostly to avoid races between request issuing
- * and task migration followed by cgroup destruction).  All the fields
- * are protected by the queue lock of the containing bfqd.
+ * io_context or more, if it  is  async or shared  between  cooperating
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it
+ * does not disappear while a bfqq still references it (mostly to avoid
+ * races between request issuing and task migration followed by cgroup
+ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
/* reference counter */
@@ -298,6 +299,16 @@ struct bfq_queue {
/* next ioprio and ioprio class if a change is in progress */
unsigned short new_ioprio, new_ioprio_class;
 
+   /*
+* Shared bfq_queue if queue is cooperating with one or more
+* other queues.
+*/
+   struct bfq_queue *new_bfqq;
+   /* request-position tree member (see bfq_group's @rq_pos_tree) */
+   struct rb_node pos_node;
+   /* request-position tree root (see bfq_group's @rq_pos_tree) */
+   struct rb_root *pos_root;
+
/* sorted list of pending requests */
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
@@ -347,6 +358,12 @@ struct bfq_queue {
/* pid of the process owning the queue, used for logging purposes */
pid_t pid;
 
+   /*
+* Pointer to the bfq_io_cq owning the bfq_queue, set to %NULL
+* if the queue is shared.
+*/
+   struct bfq_io_cq *bic;
+
/* current maximum weight-raising time for this queue */
unsigned long wr_cur_max_time;
/*
@@ -375,10 +392,13 @@ struct bfq_queue {
 * last transition from idle to backlogged.
 */
unsigned long service_from_backlogged;
+
/*
 * Value of wr start time when switching to soft rt
 */
unsigned

[PATCH V3 08/16] block, bfq: preserve a low latency also with NCQ-capable drives

2017-04-11 Thread Paolo Valente
I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7f94ad3..574a5f6 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6233,7 +6233,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
if (atomic_read(>icq.ioc->active_ref) == 0 ||
bfqd->bfq_slice_idle == 0 ||
-   (bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+   (bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+   bfqq->wr_coeff == 1))
enable_idle = 0;
else if (bfq_sample_valid(bfqq->ttime.ttime_samples)) {
if (bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle &&
-- 
2.10.0



[PATCH V3 06/16] block, bfq: improve responsiveness

2017-04-11 Thread Paolo Valente
This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following two special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

For brevity, we call just weight-raising the combination of these
two preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in the previous patch
allows BFQ to guarantee a high application responsiveness.

[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 Documentation/block/bfq-iosched.txt |   9 +
 block/bfq-iosched.c | 740 
 2 files changed, 675 insertions(+), 74 deletions(-)

diff --git a/Documentation/block/bfq-iosched.txt 
b/Documentation/block/bfq-iosched.txt
index 461b27f..1b87df6 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -375,6 +375,11 @@ default, low latency mode is enabled. If enabled, 
interactive and soft
 real-time applications are privileged and experience a lower latency,
 as explained in more detail in the description of how BFQ works.
 
+DO NOT enable this mode if you need full control on bandwidth
+distribution. In fact, if it is enabled, then BFQ automatically
+increases the bandwidth share of privileged applications, as the main
+means to guarantee a lower latency to them.
+
 timeout_sync
 
 
@@ -507,6 +512,10 @@ linear mapping between ioprio and weights, described at 
the beginning
 of the tunable section, is still valid, but all weights higher than
 IOPRIO_BE_NR*10 are mapped to ioprio 0.
 
+Recall that, if low-latency is set, then BFQ automatically raises the
+weight of the queues associated with interactive and soft real-time
+applications. Unset this tunable if you need/want to control weights.
+
 
 [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
 Scheduler", Proceedings of the First Workshop on Mobile System
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index dce273b..1a32c83 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -339,6 +339,17 @@ struct bfq_queue {
 
/* pid of the process owning the queue, used for logging purposes */
pid_t pid;
+
+   /* current maximum weight-raising time for this queue */
+   unsigned long wr_cur_max_time;
+   /*
+* Start time of the current weight-raising period if
+* the @bfq-queue is being weight-raised, otherwise
+* finish time of the last weight-raising period.
+*/
+   unsigned long last_wr_start_finish;
+   /* factor by which the weight of this queue is multiplied */
+   unsigned int wr_coeff;
 };
 
 /**
@@ -356,6 +367,11 @@ struct bfq_io_cq {
 #endif
 };
 
+enum bfq_device_speed {
+   BFQ_BFQD_FAST,
+   BFQ_BFQD_SLOW,
+};
+
 /**
  * struct bfq_data - per-device data structure.
  *
@@ -487,6 +503,34 @@ struct bfq_data {
 */
bool strict_guarantees;
 
+   /* if set to true, low-latency heuristics are enabled */
+   bool low_latency;
+   /*
+* Maximum factor by which the weight of a weight-raised queue
+* is multiplied.
+*/
+   unsigned int bfq_wr_coeff;
+   /* maximum duration of a weight-raising period (jiffies) */

[PATCH V3 09/16] block, bfq: reduce latency during request-pool saturation

2017-04-11 Thread Paolo Valente
This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment on the function
bfq_bfqq_may_idle(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one. Along the same line, if there are weight-raised queues,
then this patch halves the service rate of async (write) requests for
non-weight-raised queues.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 66 ++---
 1 file changed, 63 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 574a5f6..deb1f21c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -420,6 +420,8 @@ struct bfq_data {
 * queue in service, even if it is idling).
 */
int busy_queues;
+   /* number of weight-raised busy @bfq_queues */
+   int wr_busy_queues;
/* number of queued requests */
int queued;
/* number of requests dispatched and waiting for completion */
@@ -2490,6 +2492,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, 
struct bfq_queue *bfqq,
 
bfqd->busy_queues--;
 
+   if (bfqq->wr_coeff > 1)
+   bfqd->wr_busy_queues--;
+
bfqg_stats_update_dequeue(bfqq_group(bfqq));
 
bfq_deactivate_bfqq(bfqd, bfqq, true, expiration);
@@ -2506,6 +2511,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
 
bfq_mark_bfqq_busy(bfqq);
bfqd->busy_queues++;
+
+   if (bfqq->wr_coeff > 1)
+   bfqd->wr_busy_queues++;
 }
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
@@ -3779,7 +3787,16 @@ static unsigned long bfq_serv_to_charge(struct request 
*rq,
if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
return blk_rq_sectors(rq);
 
-   return blk_rq_sectors(rq) * bfq_async_charge_factor;
+   /*
+* If there are no weight-raised queues, then amplify service
+* by just the async charge factor; otherwise amplify service
+* by twice the async charge factor, to further reduce latency
+* for weight-raised queues.
+*/
+   if (bfqq->bfqd->wr_busy_queues == 0)
+   return blk_rq_sectors(rq) * bfq_async_charge_factor;
+
+   return blk_rq_sectors(rq) * 2 * bfq_async_charge_factor;
 }
 
 /**
@@ -4234,6 +4251,7 @@ static void bfq_add_request(struct request *rq)
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 
+   bfqd->wr_busy_queues++;
bfqq->entity.prio_changed = 1;
}
if (prev != bfqq->next_rq)
@@ -4474,6 +4492,8 @@ static void bfq_requests_merged(struct request_queue *q, 
struct request *rq,
 /* Must be called with bfqq != NULL */
 static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+   if (bfq_bfqq_busy(bfqq))
+   bfqq->bfqd->wr_busy_queues--;
bfqq->wr_coeff = 1;
bfqq->wr_cur_max_time = 0;
bfqq->last_wr_start_finish = jiffies;
@@ -5497,7 +5517,8 @@ static bool bfq_may_expire_for_budg_timeout(struct 
bfq_queue *bfqq)
 static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 {
struct bfq_data *bfqd = bfqq->bfqd;
-   bool idling_boosts_thr, asymmetric_scenario;
+   bool idling_boosts_thr, idling_boosts_thr_without_issues,
+   asymmetric_scenario;
 
if (bfqd->strict_guarantees)
return true;
@@ -5520,6 +5541,44 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
 
/*
+* The value of the next variable,
+* idling_boosts_thr_without_issues, is equal to that of
+* idling_boosts_thr, unless a special case holds. In this
+* special case, described below, idling may cause problems to
+* weight-raised queues.
+*
+* When the request pool is saturated (e.g., in the presence
+* of write hogs), if the processes associated with
+* non-weight-raised queues ask for requests at a lower rate,
+* then processes associated with weight-raised queues have a
+* higher probability to get a request from the pool
+* immediately (or at least soon) when they need one. Thus
+* they have a higher probability to actually get a fraction
+* of the device throughput proportional to their high
+* weight. This is espe

[PATCH V3 05/16] block, bfq: add more fairness with writes and slow processes

2017-04-11 Thread Paolo Valente
This patch deals with two sources of unfairness, which can also cause
high latencies and throughput loss. The first source is related to
write requests. Write requests tend to starve read requests, basically
because, on one side, writes are slower than reads, whereas, on the
other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient. The value of the
coefficient is the result of our tuning with different devices.

The second source of unfairness has to do with slowness detection:
when the in-service queue is expired, BFQ also controls whether the
queue has been "too slow", i.e., has consumed its last-assigned budget
at such a low rate that it would have been impossible to consume all
of this budget within the maximum time slice T_max (Subsec. 3.5 in
[1]). In this case, the queue is always (over)charged the whole
budget, to reduce its utilization of the device. Both this overcharge
and the slowness-detection criterion may cause unfairness.

First, always charging a full budget to a slow queue is too coarse. It
is much more accurate, and this patch lets BFQ do so, to charge an
amount of service 'equivalent' to the amount of time during which the
queue has been in service. As explained in more detail in the comments
on the code, this enables BFQ to provide time fairness among slow
queues.

Secondly, because of ZBR, a queue may be deemed as slow when its
associated process is performing I/O on the slowest zones of a
disk. However, unless the process is truly too slow, not reducing the
disk utilization of the queue is more profitable in terms of disk
throughput than the opposite. A similar problem is caused by logical
block mapping on non-rotational devices. For this reason, this patch
lets a queue be charged time, and not budget, only if the queue has
consumed less than 2/3 of its assigned budget. As an additional,
important benefit, this tolerance allows BFQ to preserve enough
elasticity to still perform bandwidth, and not time, distribution with
little unlucky or quasi-sequential processes.

Finally, for the same reasons as above, this patch makes slowness
detection itself much less harsh: a queue is deemed slow only if it
has consumed its budget at less than half of the peak rate.

[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 120 +---
 1 file changed, 85 insertions(+), 35 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 61d880b..dce273b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -753,6 +753,13 @@ static const int bfq_stats_min_budgets = 194;
 /* Default maximum budget values, in sectors and number of requests. */
 static const int bfq_default_max_budget = 16 * 1024;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout = HZ / 8;
 
@@ -1571,22 +1578,52 @@ static void bfq_bfqq_served(struct bfq_queue *bfqq, int 
served)
 }
 
 /**
- * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * bfq_bfqq_charge_time - charge an amount of service equivalent to the length
+ *   of the time interval during which bfqq has been in
+ *   service.
+ * @bfqd: the device
  * @bfqq: the queue that needs a service update.
+ * @time_ms: the amount of time during which the queue has received service
  *
- * When it's not possible to be fair in the service domain, because
- * a queue is not consuming its budget fast enough (the meaning of
- * fast depends on the timeout parameter), we charge it a full
- * budget.  In this way we should obtain a sort of time-domain
- * fairness among all the seeky/slow queues.
+ * If a queue does not consume its budget fast enough, then providing
+ * the queue with service fairness may impair throughput, more or less
+ * severely. For this reason, queues that consume their budget slowly
+ * are provided with time fairness instead of service fairness. This
+ * goal is achieved through the B

[PATCH V3 04/16] block, bfq: modify the peak-rate estimator

2017-04-11 Thread Paolo Valente
Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual device
peak rate, the higher the probability that processes incur budget
timeouts unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

Unfortunately, it is not trivial to estimate the peak rate correctly:
because of the presence of sw and hw queues between the scheduler and
the device components that finally serve I/O requests, it is hard to
say exactly when a given dispatched request is served inside the
device, and for how long. As a consequence, it is hard to know
precisely at what rate a given set of requests is actually served by
the device.

On the opposite end, the dispatch time of any request is trivially
available, and, from this piece of information, the "dispatch rate"
of requests can be immediately computed. So, the idea in the next
function is to use what is known, namely request dispatch times
(plus, when useful, request completion times), to estimate what is
unknown, namely in-device request service rate.

The main issue is that, because of the above facts, the rate at
which a certain set of requests is dispatched over a certain time
interval can vary greatly with respect to the rate at which the
same requests are then served. But, since the size of any
intermediate queue is limited, and the service scheme is lossless
(no request is silently dropped), the following obvious convergence
property holds: the number of requests dispatched MUST become
closer and closer to the number of requests completed as the
observation interval grows. This is the key property used in
this new version of the peak-rate estimator.

Signed-off-by: Paolo Valente 
Signed-off-by: Arianna Avanzini 
---
 block/bfq-iosched.c | 495 +++-
 1 file changed, 371 insertions(+), 124 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 553aee1..61d880b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -407,19 +407,37 @@ struct bfq_data {
/* on-disk position of the last served request */
sector_t last_position;
 
+   /* time of last request completion (ns) */
+   u64 last_completion;
+
+   /* time of first rq dispatch in current observation interval (ns) */
+   u64 first_dispatch;
+   /* time of last rq dispatch in current observation interval (ns) */
+   u64 last_dispatch;
+
/* beginning of the last budget */
ktime_t last_budget_start;
/* beginning of the last idle slice */
ktime_t last_idling_start;
-   /* number of samples used to calculate @peak_rate */
+
+   /* number of samples in current observation interval */
int peak_rate_samples;
+   /* num of samples of seq dispatches in current observation interval */
+   u32 sequential_samples;
+   /* total num of sectors transferred in current observation interval */
+   u64 tot_sectors_dispatched;
+   /* max rq size seen during current observation interval (sectors) */
+   u32 last_rq_max_size;
+   /* time elapsed from first dispatch in current observ. interval (us) */
+   u64 delta_from_first;
/*
-* Peak read/write rate, observed during the service of a
-* budget [BFQ_RATE_SHIFT * sectors/usec]. The value is
-* left-shifted by BFQ_RATE_SHIFT to increase precision in
+* Current estimate of the device peak rate, measured in
+* [BFQ_RATE_SHIFT * sectors/usec]. The left-shift by
+* BFQ_RATE_SHIFT is performed to increase precision in
 * fixed-point calculations.
 */
-   u64 peak_rate;
+   u32 peak_rate;
+
/* maximum budget allotted to a bfq_queue before rescheduling */
int bfq_max_budget;
 
@@ -740,7 +758,7 @@ static const int bfq_timeout = HZ / 8;
 
 static struct kmem_cache *bfq_pool;
 
-/* Below this threshold (in ms), we consider thinktime immediate. */
+/* Below this threshold (in ns), we consider thinktime immediate. */
 #define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
 
 /* hw_tag detection: parallel requests threshold and min samples needed. */
@@ -752,8 +770,12 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
 
-/* Min samples used for peak rate estimation (for autotuning). */
-#define BFQ_PEAK_RATE_SAMPLES  32
+/* Min number of samples required to perform peak-rate update */
+#define BFQ_RATE_MIN_SAMPLES   32
+/* Min observati

Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler

2017-04-11 Thread Paolo Valente

> Il giorno 11 apr 2017, alle ore 16:37, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-04-11 at 15:42 +0200, Paolo Valente wrote:
>> new patch series, addressing (both) issues raised by Bart [1].
> 
> Hello Paolo,
> 
> Is there a git tree available somewhere with these patches and without
> the single queue BFQ scheduler?
> 

Just pushed:
https://github.com/Algodev-github/bfq-mq/tree/add-bfq-mq-logical

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: [PATCH V3 02/16] block, bfq: add full hierarchical scheduling and cgroups support

2017-04-11 Thread Paolo Valente

> Il giorno 11 apr 2017, alle ore 23:47, Tejun Heo  ha scritto:
> 
> Hello,
> 
> On Tue, Apr 11, 2017 at 03:43:01PM +0200, Paolo Valente wrote:
>> From: Arianna Avanzini 
>> 
>> Add complete support for full hierarchical scheduling, with a cgroups
>> interface. Full hierarchical scheduling is implemented through the
>> 'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
>> associated with processes, and groups are represented in general by
>> entities. Given the bfq_queues associated with the processes belonging
>> to a given group, the entities representing these queues are sons of
>> the entity representing the group. At higher levels, if a group, say
>> G, contains other groups, then the entity representing G is the parent
>> entity of the entities representing the groups in G.
>> 
>> Hierarchical scheduling is performed as follows: if the timestamps of
>> a leaf entity (i.e., of a bfq_queue) change, and such a change lets
>> the entity become the next-to-serve entity for its parent entity, then
>> the timestamps of the parent entity are recomputed as a function of
>> the budget of its new next-to-serve leaf entity. If the parent entity
>> belongs, in its turn, to a group, and its new timestamps let it become
>> the next-to-serve for its parent entity, then the timestamps of the
>> latter parent entity are recomputed as well, and so on. When a new
>> bfq_queue must be set in service, the reverse path is followed: the
>> next-to-serve highest-level entity is chosen, then its next-to-serve
>> child entity, and so on, until the next-to-serve leaf entity is
>> reached, and the bfq_queue that this entity represents is set in
>> service.
>> 
>> Writeback is accounted for on a per-group basis, i.e., for each group,
>> the async I/O requests of the processes of the group are enqueued in a
>> distinct bfq_queue, and the entity associated with this queue is a
>> child of the entity associated with the group.
>> 
>> Weights can be assigned explicitly to groups and processes through the
>> cgroups interface, differently from what happens, for single
>> processes, if the cgroups interface is not used (as explained in the
>> description of the previous patch). In particular, since each node has
>> a full scheduler, each group can be assigned its own weight.
> 
> Can we please hold off on cgroup support for now?  I've been trying to
> chase down cpu scheduler latency issues lately and have some doubts
> about implementing cgroup support by simply nesting the timelines like
> this.
> 

Hi Tejun,
could you elaborate a bit more on this?  I mean, cgroups support has
been in BFQ (and CFQ) for almost ten years, perfectly working as far
as I know.  Of course it is perfectly working in terms of I/O and not
of CPU bandwidth distribution; and, for the moment, it is effective
only for devices below 30-50KIOPS.  What's the point in throwing
(momentarily?) away such a fundamental feature?  What am I missing?

Thanks,
Paolo


> Thanks
> 
> -- 
> tejun



Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler

2017-04-12 Thread Paolo Valente

> Il giorno 11 apr 2017, alle ore 20:31, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-04-11 at 19:37 +0200, Paolo Valente wrote:
>> Just pushed:
>> https://github.com/Algodev-github/bfq-mq/tree/add-bfq-mq-logical
> 
> Thanks!
> 
> But are you aware that the code on that branch doesn't build?
> 
> $ make all
> [ ... ]
> ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
> ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
> [ ... ]
> 
> $ PAGER= git grep bfq_mark_bfqq_busy
> block/bfq-wf2q.c:   bfq_mark_bfqq_busy(bfqq);
> 

That's exactly the complain of the kbuild test robot.  As I wrote,
build completes with no problem in my test system (Ubuntu 16.04, gcc
5.4.0), even with the exact offending tree and .config that the robot
reports.

I didn't understand what is going on.  In your case, as well as for
the test robot, the compilation of the file block/bfq-wf2q.c as a
module component fails, because that file does not contain the
definition of the reported functions.  But that definition is
(uniquely) in the file block/bfq-iosched.c, which is to be compiled
with the former file, according to the following rule in
block/Makefile:
obj-$(CONFIG_IOSCHED_BFQ)   += bfq-iosched.o bfq-wf2q.o bfq-cgroup.o

I have tried all combinations of configurations for bfq (builti-in or
module, with or without cgrousp support), always successfully.  If it
makes any sense to share this information, these are the exact
commands I used to test al combinations (in addition to make full
builds in some cases, and try make all as in your case):

make O=builddir M=block

and

make O=builddir M=block modules

Where is my mistake?

Thanks,
Paolo

> Bart.



Re: [kbuild-all] [PATCH V2 16/16] block, bfq: split bfq-iosched.c into multiple source files

2017-04-12 Thread Paolo Valente

> Il giorno 12 apr 2017, alle ore 10:39, Ye Xiaolong  ha 
> scritto:
> 
> On 04/11, Paolo Valente wrote:
>> 
>>> Il giorno 02 apr 2017, alle ore 12:02, kbuild test robot  
>>> ha scritto:
>>> 
>>> Hi Paolo,
>>> 
>>> [auto build test ERROR on block/for-next]
>>> [also build test ERROR on v4.11-rc4 next-20170331]
>>> [if your patch is applied to the wrong git tree, please drop us a note to 
>>> help improve the system]
>>> 
>> 
>> Hi,
>> this seems to be a false positive.  Build is correct with the tested
>> tree and the .config.
>> 
> 
> Hmm, this error is reproducible in 0day side, and you patches were applied on
> top of 803e16d "Merge branch 'for-4.12/block' into for-next", is it the same 
> as
> yours?
> 

I have downloaded the offending tree directly from the github page.

Here are my steps in super detail.

I followed the url:
https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
and downloaded the tree ("Browse the repository at this point in
history" link on the top commit, then "Download ZIP"), plus the
.config.gz attached to the email.

Then I built with no error.

To try to help understand where the mistake is, the compilation of the
files of course fails because each of the offending files does not
contain the definition of the reported functions.  But that definition
is contained in one of the other files for the same module.  I mean
one of the files listed in the following rule in block/Makefile:
obj-$(CONFIG_IOSCHED_BFQ)   += bfq-iosched.o bfq-wf2q.o bfq-cgroup.o

Maybe I'm making some mistake in the Makefile, or I forgot to modify
some other configuration file?

Help! :)

Thanks,
Paolo

> Thanks,
> Xiaolong
> 
>> Thanks,
>> Paolo
>> 
>>> url:
>>> https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
>>> base:   
>>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
>>> for-next
>>> config: i386-allmodconfig (attached as .config)
>>> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
>>> reproduce:
>>>   # save the attached .config to linux build tree
>>>   make ARCH=i386 
>>> 
>>> All errors (new ones prefixed by >>):
>>> 
>>>>> ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_clear_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_clear_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] 
>>>>> undefined!
>>>>> ERROR: "bfq_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_clear_bfqq_wait_request" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_timeout" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfqg_stats_set_start_empty_time" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_weights_tree_add" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_put_queue" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_bfqq_sync" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfqg_to_blkg" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfqq_group" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_weights_tree_remove" [block/bfq-wf2q.ko] undefined!
>>>>> ERROR: "bfq_bic_update_cgroup" [block/bfq-iosched.ko] undefined!
>>>>> ERROR: "bfqg_stats_set_start_idle_time" [block/bfq-iosched.ko] undefined!
>>>>> ERROR: "bfqg_stats_update_completion" [block/bfq-iosched.ko] undefined!
>>>>> ERROR: "bfq_bfqq_move" [block/bfq-iosched.ko] undefined!
>>>>> ERROR: "bfqg_put" [block/bfq-iosched.ko] undefined!
>>>>> ERROR: "next_queue_may_preempt" [block/bfq-iosched.ko] undefined!
>>> 
>>> ---
>>> 0-DAY kernel test infrastructureOpen Source Technology 
>>> Center
>>> https://lists.01.org/pipermail/kbuild-all   Intel 
>>> Corporation
>>> <.config.gz>
>> 
>> ___
>> kbuild-all mailing list
>> kbuild-...@lists.01.org
>> https://lists.01.org/mailman/listinfo/kbuild-all



Re: bfq-mq performance comparison to cfq

2017-04-26 Thread Paolo Valente

> Il giorno 25 apr 2017, alle ore 11:40, Juri Lelli  ha 
> scritto:
> 
> Hi,
> 
> sorry if I jump into this interesting conversation, but I felt some people
> might have missed this and might be interested as well (even if from a
> slightly different POW). Let me Cc them (Patrick, Morten, Peter, Joel,
> Andres).
> 
> On 19/04/17 09:02, Paolo Valente wrote:
>> 
>>> Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On 04/11/17 00:29, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>>>>>  ha scritto:
>>>>> 
>>>>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>>>>>> That said, if you do always want maximum throughput, even at the
>>>>>> expense of latency, then just switch off low-latency heuristics, i.e.,
>>>>>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>>>>>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>>>>>> still low also after forcing BFQ to an only-throughput mode, then you
>>>>>> hit some bug, and I'll have a little more work to do ...
>>>>> 
>>>>> Has it been considered to make applications tell the I/O scheduler
>>>>> whether to optimize for latency or for throughput? It shouldn't be that
>>>>> hard for window managers and shells to figure out whether or not a new
>>>>> application that is being started is interactive or not. This would
>>>>> require a mechanism that allows applications to provide such information
>>>>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
>>>>> scheduler trying to guess whether or not an application is an interactive
>>>>> application?
>>>> 
>>>> IMO that would be an (or maybe the) optimal solution, in terms of both
>>>> throughput and latency.  We have even developed a prototype doing what
>>>> you propose, for Android.  Unfortunately, I have not yet succeeded in
>>>> getting support, to turn it into candidate production code, or to make
>>>> a similar solution for lsb-compliant systems.
>>> 
>>> Hello Paolo,
>>> 
>>> What API was used by the Android application to tell the I/O scheduler 
>>> to optimize for latency? Do you think that it would be sufficient if the 
>>> application uses the ioprio_set() system call to set the I/O priority to 
>>> IOPRIO_CLASS_RT?
>>> 
>> 
>> That's exactly the hack we are using in our prototype.  However, it
>> can only be a temporary hack, because it mixes two slightly different
>> concepts: 1) the activation of weight raising and other mechanisms for
>> reducing latency for the target app, 2) the assignment of a different
>> priority class, which (cleanly) means just that processes in a lower
>> priority class will be served only when the processes of the target
>> app have no pending I/O request.  Finding a clean boosting API would
>> be one of the main steps to turn our prototype into a usable solution.
>> 
> 
> I also need to append here latest Bart's reply (which hasn't all the
> context):
> 
> On 19/04/17 15:43, Bart Van Assche wrote:
>> On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
>>>> Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
>>>>  ha scritto:
>>>> What API was used by the Android application to tell the I/O scheduler 
>>>> to optimize for latency? Do you think that it would be sufficient if the 
>>>> application uses the ioprio_set() system call to set the I/O priority to 
>>>> IOPRIO_CLASS_RT?
>>> 
>>> That's exactly the hack we are using in our prototype.  However, it
>>> can only be a temporary hack, because it mixes two slightly different
>>> concepts: 1) the activation of weight raising and other mechanisms for
>>> reducing latency for the target app, 2) the assignment of a different
>>> priority class, which (cleanly) means just that processes in a lower
>>> priority class will be served only when the processes of the target
>>> app have no pending I/O request.  Finding a clean boosting API would
>>> be one of the main steps to turn our prototype into a usable solution.
>> 
>> Hello Paolo,
>> 
>> Sorry but I do not agree that you call this use of I/O priorities a hack.
>> I also do not agree that I/O requests submitted by processes in a lower
>> priority 

Re: [PATCH] MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler

2017-04-28 Thread Paolo Valente

> Il giorno 28 apr 2017, alle ore 12:10, Ulf Hansson  
> ha scritto:
> 
> Seems like this was forgotten in the bfq-series from Paolo. Let's do it now
> so people don't miss out involving Paolo for any future changes or when
> reporting bugs.
> 
> Signed-off-by: Ulf Hansson 

Acked-by: Paolo Valente 

> ---
> MAINTAINERS | 8 
> 1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 38d3e4e..1bb06c5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2544,6 +2544,14 @@ F: block/
> F:kernel/trace/blktrace.c
> F:lib/sbitmap.c
> 
> +BFQ I/O SCHEDULER
> +M:   Paolo Valente 
> +M:   Jens Axboe 
> +L:   linux-bl...@vger.kernel.org
> +S:   Maintained
> +F:   block/bfq-*
> +F:   Documentation/block/bfq-iosched.txt
> +
> BLOCK2MTD DRIVER
> M:Joern Engel 
> L:linux-...@lists.infradead.org
> -- 
> 2.7.4
> 



Re: [PATCH V2 04/16] block, bfq: modify the peak-rate estimator

2017-04-04 Thread Paolo Valente

> Il giorno 31 mar 2017, alle ore 17:31, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
>> -static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue 
>> *bfqq,
>> -bool compensate)
>> +static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> +bool compensate, enum bfqq_expiration 
>> reason,
>> +unsigned long *delta_ms)
>>  {
>> -   u64 bw, usecs, expected, timeout;
>> -   ktime_t delta;
>> -   int update = 0;
>> +   ktime_t delta_ktime;
>> +   u32 delta_usecs;
>> +   bool slow = BFQQ_SEEKY(bfqq); /* if delta too short, use seekyness */
>>  
>> -   if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
>> +   if (!bfq_bfqq_sync(bfqq))
>> return false;
>>  
>> if (compensate)
>> -   delta = bfqd->last_idling_start;
>> +   delta_ktime = bfqd->last_idling_start;
>> else
>> -   delta = ktime_get();
>> -   delta = ktime_sub(delta, bfqd->last_budget_start);
>> -   usecs = ktime_to_us(delta);
>> -
>> -   /* Don't trust short/unrealistic values. */
>> -   if (usecs < 100 || usecs >= LONG_MAX)
>> -   return false;
>> -
>> -   /*
>> -* Calculate the bandwidth for the last slice.  We use a 64 bit
>> -* value to store the peak rate, in sectors per usec in fixed
>> -* point math.  We do so to have enough precision in the estimate
>> -* and to avoid overflows.
>> -*/
>> -   bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
>> -   do_div(bw, (unsigned long)usecs);
>> +   delta_ktime = ktime_get();
>> +   delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);
>> +   delta_usecs = ktime_to_us(delta_ktime);
>> +
> 
> This patch changes the type of the variable in which the result of 
> ktime_to_us()
> is stored from u64 into u32 and next compares that result with LONG_MAX. Since
> ktime_to_us() returns a signed 64-bit number, are you sure you want to store 
> that
> result in a 32-bit variable? If ktime_to_us() would e.g. return 
> 0x0100
> or 0x10100 then the assignment will truncate these numbers to 0x100.
> 

The instruction above the assignment you highlight stores in
delta_ktime the difference between 'now' and the last budget start.
The latter may have happened at most about 100 ms before 'now'.  So
there should be no overflow issue.

Thanks,
Paolo

> Bart.



<    5   6   7   8   9   10   11   12   >