Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-17 Thread Kyle Sanderson
Not to compound upon this again. However if BFQ isn't suitable to
replace CFQ for high I/O workloads (I've yet to see 20k IOPS on any
reasonably sized SAN (SC4020 / v5000, etc)), can't we at-least default
BFQ to become the default I/O scheduler for people otherwise
requesting CFQ? Paolo has had a team of students working on this for
years, even if the otherwise "secret weapon" is mainlined I highly
doubt his work will stop. We're pretty close to fixing hard I/O stalls
in Linux, mainlining being the last major burden.

While I've contributed nothing to BFQ code wise, absolutely let any of
us know if there's anything outstanding to solve hard lockups and I
believe any of us will try our best.

Kyle.

On Sun, Oct 16, 2016 at 12:02 PM, Paolo Valente
 wrote:
>
>> Il giorno 14 ott 2016, alle ore 20:35, Tejun Heo  ha 
>> scritto:
>>
>> Hello, Paolo.
>>
>> On Fri, Oct 14, 2016 at 07:13:41PM +0200, Paolo Valente wrote:
>>> That said, your 'thus' seems a little too strong: "bfq does not yet
>>> handle fast SSDs, thus we need something else".  What about the
>>> millions of devices (and people) still within 10-20 K IOPS, and
>>> experiencing awful latencies and lack of bandwidth guarantees?
>>
>> I'm not objecting to any of that.
>
> Ok, sorry for misunderstanding.  I'm just more and more confused about
> why a readily available, and not proven wrong solution has not yet
> been accepted, if everybody apparently acknowledges the problem.
>
>>  My point just is that bfq, at least
>> as currently implemented, is unfit for certain classes of use cases.
>>
>
> Absolutely correct.
>
 FWIW, it looks like the only way we can implement proportional control
 on highspeed ssds with acceptable overhead
>>>
>>> Maybe not: as I wrote to Viveck in a previous reply, containing
>>> pointers to documentation, we have already achieved twenty millions
>>> of decisions per second with a prototype driving existing
>>> proportional-share packet schedulers (essentially without
>>> modifications).
>>
>> And that doesn't require idling and thus doesn't severely impact
>> utilization?
>>
>
> Nope.  Packets are commonly assumed to be sent asynchronously.
> I guess that discussing the validity of this assumption is out of the
> scope of this thread.
>
> Thanks,
> Paolo
>
 is somehow finding a way to
 calculate the cost of each IO and throttle IOs according to that while
 controlling for latency as necessary.  Slice scheduling with idling
 seems too expensive with highspeed devices with high io depth.
>>>
>>> Yes, that's absolutely true.  I'm already thinking about an idleless
>>> solution.  As I already wrote, I'm willing to help with scheduling in
>>> blk-mq.  I hope there will be the opportunity to find some way to go
>>> at KS.
>>
>> It'd be great to have a proportional control mechanism whose overhead
>> is acceptable.  Unfortunately, we don't have one now and nothing seems
>> right around the corner.  (Mostly) work-conserving throttling would be
>> fiddlier to use but is something which is useful regardless of such
>> proportional control mechanism and can be obtained relatively easily.
>>
>> I don't see why the two approaches would be mutually exclusive.
>>
>> Thanks.
>>
>> --
>> tejun
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> Paolo Valente
> Algogroup
> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
> Via Campi 213/B
> 41125 Modena - Italy
> http://algogroup.unimore.it/people/paolo/
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-16 Thread Paolo Valente

> Il giorno 14 ott 2016, alle ore 20:35, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Fri, Oct 14, 2016 at 07:13:41PM +0200, Paolo Valente wrote:
>> That said, your 'thus' seems a little too strong: "bfq does not yet
>> handle fast SSDs, thus we need something else".  What about the
>> millions of devices (and people) still within 10-20 K IOPS, and
>> experiencing awful latencies and lack of bandwidth guarantees?
> 
> I'm not objecting to any of that.

Ok, sorry for misunderstanding.  I'm just more and more confused about
why a readily available, and not proven wrong solution has not yet
been accepted, if everybody apparently acknowledges the problem.

>  My point just is that bfq, at least
> as currently implemented, is unfit for certain classes of use cases.
> 

Absolutely correct.

>>> FWIW, it looks like the only way we can implement proportional control
>>> on highspeed ssds with acceptable overhead
>> 
>> Maybe not: as I wrote to Viveck in a previous reply, containing
>> pointers to documentation, we have already achieved twenty millions
>> of decisions per second with a prototype driving existing
>> proportional-share packet schedulers (essentially without
>> modifications).
> 
> And that doesn't require idling and thus doesn't severely impact
> utilization?
> 

Nope.  Packets are commonly assumed to be sent asynchronously.
I guess that discussing the validity of this assumption is out of the
scope of this thread.

Thanks,
Paolo

>>> is somehow finding a way to
>>> calculate the cost of each IO and throttle IOs according to that while
>>> controlling for latency as necessary.  Slice scheduling with idling
>>> seems too expensive with highspeed devices with high io depth.
>> 
>> Yes, that's absolutely true.  I'm already thinking about an idleless
>> solution.  As I already wrote, I'm willing to help with scheduling in
>> blk-mq.  I hope there will be the opportunity to find some way to go
>> at KS.
> 
> It'd be great to have a proportional control mechanism whose overhead
> is acceptable.  Unfortunately, we don't have one now and nothing seems
> right around the corner.  (Mostly) work-conserving throttling would be
> fiddlier to use but is something which is useful regardless of such
> proportional control mechanism and can be obtained relatively easily.
> 
> I don't see why the two approaches would be mutually exclusive.
> 
> Thanks.
> 
> -- 
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-14 Thread Paolo Valente

> Il giorno 14 ott 2016, alle ore 18:40, Tejun Heo  ha scritto:
> 
> Hello, Kyle.
> 
> On Sat, Oct 08, 2016 at 06:15:14PM -0700, Kyle Sanderson wrote:
>> How is this even a discussion when hard numbers, and trying any
>> reproduction case easily reproduce the issues that CFQ causes. Reading
>> this thread, and many others only grows not only my disappointment,
>> but whenever someone launches kterm or scrot and their machine
>> freezes, leaves a selective few individuals completely responsible for
>> this. Help those users, help yourself, help Linux.
> 
> So, just to be clear.  I wasn't arguing against bfq replacing cfq (or
> anything along that line) but that proportional control, as
> implemented, would be too costly for many use cases and thus we need
> something along the line of what Shaohua is proposing.
> 

Sorry for dropping in all the times, but the vision that you and some
other guys propose seems to miss some important piece (unless, now or
then, you will patiently prove me wrong, or I will finally understand
on my own why I'm wrong).

You are of course right: bfq, as a component of blk, and above all, as
a sort of derivative of CFQ (and of its overhead), has currently too
high a overhead to handle more than 10-20K IOPS.

That said, your 'thus' seems a little too strong: "bfq does not yet
handle fast SSDs, thus we need something else".  What about the
millions of devices (and people) still within 10-20 K IOPS, and
experiencing awful latencies and lack of bandwidth guarantees?

For certain systems or applications, it isn't even just a "buy a fast
SSD" matter, but a technological constraint.

> FWIW, it looks like the only way we can implement proportional control
> on highspeed ssds with acceptable overhead

Maybe not: as I wrote to Viveck in a previous reply, containing
pointers to documentation, we have already achieved twenty millions
of decisions per second with a prototype driving existing
proportional-share packet schedulers (essentially without
modifications).

> is somehow finding a way to
> calculate the cost of each IO and throttle IOs according to that while
> controlling for latency as necessary.  Slice scheduling with idling
> seems too expensive with highspeed devices with high io depth.
> 

Yes, that's absolutely true.  I'm already thinking about an idleless
solution.  As I already wrote, I'm willing to help with scheduling in
blk-mq.  I hope there will be the opportunity to find some way to go
at KS.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Paolo Valente

> Il giorno 06 ott 2016, alle ore 21:57, Shaohua Li  ha scritto:
> 
> On Thu, Oct 06, 2016 at 09:58:44AM +0200, Paolo Valente wrote:
>> 
>>> Il giorno 05 ott 2016, alle ore 22:46, Shaohua Li  ha scritto:
>>> 
>>> On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
 
> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li  ha 
> scritto:
> 
> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>> Hello, Paolo.
>> 
>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>> In this respect, for your generic, unpredictable scenario to make
>>> sense, there must exist at least one real system that meets the
>>> requirements of such a scenario.  Or, if such a real system does not
>>> yet exist, it must be possible to emulate it.  If it is impossible to
>>> achieve this last goal either, then I miss the usefulness
>>> of looking for solutions for such a scenario.
>>> 
>>> That said, let's define the instance(s) of the scenario that you find
>>> most representative, and let's test BFQ on it/them.  Numbers will give
>>> us the answers.  For example, what about all or part of the following
>>> groups:
>>> . one cyclically doing random I/O for some second and then sequential 
>>> I/O
>>> for the next seconds
>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>> . one starting an application cyclically
>>> . one playing back or streaming a movie
>>> 
>>> For each group, we could then measure the time needed to complete each
>>> phase of I/O in each cycle, plus the responsiveness in the group
>>> starting an application, plus the frame drop in the group streaming
>>> the movie.  In addition, we can measure the bandwidth/iops enjoyed by
>>> each group, plus, of course, the aggregate throughput of the whole
>>> system.  In particular we could compare results with throttling, BFQ,
>>> and CFQ.
>>> 
>>> Then we could write resulting numbers on the stone, and stick to them
>>> until something proves them wrong.
>>> 
>>> What do you (or others) think about it?
>> 
>> That sounds great and yeah it's lame that we didn't start with that.
>> Shaohua, would it be difficult to compare how bfq performs against
>> blk-throttle?
> 
> I had a test of BFQ.
 
 Thank you very much for testing BFQ!
 
> I'm using BFQ found at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php=DQIFAg=5VD0RTtNlTh3ycd41b3MUw=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78=
>  . version is
> 4.7.0-v8r3.
 
 That's the latest stable version.  The development version [1] already
 contains further improvements for fairness, latency and throughput.
 It is however still a release candidate.
 
 [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
 
> It's a LSI SSD, queue depth 32. I use default setting. fio script
> is:
> 
> [global]
> ioengine=libaio
> direct=1
> readwrite=randread
> bs=4k
> runtime=60
> time_based=1
> file_service_type=random:36
> overwrite=1
> thread=0
> group_reporting=1
> filename=/dev/sdb
> iodepth=1
> numjobs=8
> 
> [groupA]
> prio=2
> 
> [groupB]
> new_group
> prio=6
> 
> I'll change iodepth, numjobs and prio in different tests. result unit is 
> MB/s.
> 
> iodepth=1 numjobs=1 prio 4:4
> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
> 
> iodepth=8 numjobs=1 prio 4:4
> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
> 
> iodepth=1 numjobs=8 prio 4:4
> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
> 
> iodepth=1 numjobs=1 prio 2:6
> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
> 
> iodepth=8 numjobs=1 prio 2:6
> CFQ: 166:174 BFQ: 139:72  deadline: 202:202
> 
> iodepth=1 numjobs=8 prio 2:6
> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
> 
> CFQ isn't fair at all. BFQ is very good in this side, but has poor 
> throughput
> even prio is the default value.
> 
 
 Throughput is lower with BFQ for two reasons.
 
 First, you certainly left the low_latency in its default state, i.e.,
 on.  As explained, e.g., here [2], low_latency mode is totally geared
 towards maximum responsiveness and minimum latency for soft real-time
 applications (e.g., video players).  To achieve this goal, BFQ is
 willing to perform more idling, when necessary.  This lowers
 throughput (I'll get back on this at the end of the discussion of the
 second reason).
>>> 
>>> changing low_latency to 0 seems not change anything, at least for the test:
>>> iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k
>>> 
 The second, most 

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Paolo Valente

> Il giorno 06 ott 2016, alle ore 20:32, Vivek Goyal  ha 
> scritto:
> 
> On Thu, Oct 06, 2016 at 08:01:42PM +0200, Paolo Valente wrote:
>> 
>>> Il giorno 06 ott 2016, alle ore 19:49, Vivek Goyal  ha 
>>> scritto:
>>> 
>>> On Thu, Oct 06, 2016 at 03:15:50PM +0200, Paolo Valente wrote:
>>> 
>>> [..]
 Shaohua, I have just realized that I have unconsciously defended a
 wrong argument.  Although all the facts that I have reported are
 evidently true, I have argued as if the question was: "do we need to
 throw away throttling because there is proportional, or do we need to
 throw away proportional share because there is throttling?".  This
 question is simply wrong, as I think consciously (sorry for my
 dissociated behavior :) ).
>>> 
>>> I was wondering about the same. We need both and both should be able 
>>> to work with fast devices of today using blk-mq interfaces without
>>> much overhead.
>>> 
 
 The best goal to achieve is to have both a good throttling mechanism,
 and a good proportional share scheduler.  This goal would be valid if
 even if there was just one important scenario for each of the two
 approaches.  The vulnus here is that you guys are constantly, and
 rightly, working on solutions to achieve and consolidate reasonable
 QoS guarantees, but an apparently very good proportional-share
 scheduler has been kept off for years.  If you (or others) have good
 arguments to support this state of affairs, then this would probably
 be an important point to discuss.
>>> 
>>> Paolo, CFQ is legacy now and if we can come up with a proportional
>>> IO mechanism which works reasonably well with fast devices using
>>> blk-mq interfaces, that will be much more interesting.
>>> 
>> 
>> That's absolutely true.  But, why do we pretend not to know that, for
>> (at least) hundreds of thousands of users Linux will go on giving bad
>> responsiveness, starvation, high latency and unfairness, until blk
>> will not be used any more (assuming that these problems will somehow
>> disappear will blk-mq).  Many of these users are fully aware of these
>> Linux long-standing problems.  We could solve these problems by just
>> adding a scheduler that has already been adopted, and thus extensively
>> tested, by thousands of users.  And more and more people are aware of
>> this fact too.  Are we doing the right thing?
> 
> Hi Paolo,
> 

Hi

> People have been using CFQ for many years.

Yes, but allow me just to add that a lot of people have also been
unhappy with CFQ for many years.

> I am not sure if benefits 
> offered by BFQ over CFQ are significant enough to justify taking a
> completely new code and get rid of CFQ. Or are the benfits significant
> enough that one feels like putting time and effort into this and
> take chances wiht new code.
> 

Although I think that BFQ's benefits are relevant (but I'm a little
bit an interested party :) ), I do agree that abruptly replacing the
most used I/O scheduler (AFAIK) with a so different one is at least a
little risky.

> At this point of time replacing CFQ with something better is not a
> priority for me.

ok

> But if something better and stable goes upstream, I
> will gladly use it.
> 

Then, in case of success, I will be glad to receive some feedback from
you, and possibly use it to improve the set of ideas that we have put
into BFQ.

Thank you,
Paolo

> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Austin S. Hemmelgarn

On 2016-10-06 08:50, Paolo Valente wrote:



Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn 
 ha scritto:

On 2016-10-06 07:03, Mark Brown wrote:

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:



I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.



Not just desktops, also Android phones.



So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?


Right.


We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.


Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).


ISTR that the original argument for this approach essentially amounted to: 'If 
it's so much better, why do we need both?'.

Such an argument is valid only if the new design is better in all respects 
(which there isn't sufficient information to decide in this case), or the 
negative aspects are worth the improvements (which is too workload specific to 
decide for something like this).


All correct, apart from the workload-specific issue, which is not very clear to 
me. Over the last five years I have not found a single workload for which CFQ 
is better than BFQ, and none has been suggested.
My point is that whether or not BFQ is better depends on the workload. 
You can't test for every workload, so you can't say definitively that 
BFQ is better for every workload.  At a minimum, there are workloads 
where the deadline and noop schedulers are better, but they're very 
domain specific workloads.  Based on the numbers from Shaohua, it looks 
like CFQ has better throughput than BFQ, and that will affect some 
workloads (for most, the improved fairness is worth the reduced 
throughput, but there probably are some cases where it isn't).


Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: 
"we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, 
and thus replace, CFQ"
I agree that that's part of the issue, but I also don't entirely agree 
with the reasoning on it.  Until blk-mq has proper I/O scheduling, 
people will continue to use CFQ, and based on the way things are going, 
it will be multiple months before that happens, whereas BFQ exists and 
is working now.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Paolo Valente

> Il giorno 06 ott 2016, alle ore 09:58, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 05 ott 2016, alle ore 22:46, Shaohua Li  ha scritto:
>> 
>> On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
>>> 
 Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li  ha scritto:
 
 On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> Hello, Paolo.
> 
> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>> In this respect, for your generic, unpredictable scenario to make
>> sense, there must exist at least one real system that meets the
>> requirements of such a scenario.  Or, if such a real system does not
>> yet exist, it must be possible to emulate it.  If it is impossible to
>> achieve this last goal either, then I miss the usefulness
>> of looking for solutions for such a scenario.
>> 
>> That said, let's define the instance(s) of the scenario that you find
>> most representative, and let's test BFQ on it/them.  Numbers will give
>> us the answers.  For example, what about all or part of the following
>> groups:
>> . one cyclically doing random I/O for some second and then sequential I/O
>> for the next seconds
>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>> . one starting an application cyclically
>> . one playing back or streaming a movie
>> 
>> For each group, we could then measure the time needed to complete each
>> phase of I/O in each cycle, plus the responsiveness in the group
>> starting an application, plus the frame drop in the group streaming
>> the movie.  In addition, we can measure the bandwidth/iops enjoyed by
>> each group, plus, of course, the aggregate throughput of the whole
>> system.  In particular we could compare results with throttling, BFQ,
>> and CFQ.
>> 
>> Then we could write resulting numbers on the stone, and stick to them
>> until something proves them wrong.
>> 
>> What do you (or others) think about it?
> 
> That sounds great and yeah it's lame that we didn't start with that.
> Shaohua, would it be difficult to compare how bfq performs against
> blk-throttle?
 
 I had a test of BFQ.
>>> 
>>> Thank you very much for testing BFQ!
>>> 
 I'm using BFQ found at
 https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php=DQIFAg=5VD0RTtNlTh3ycd41b3MUw=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78=
  . version is
 4.7.0-v8r3.
>>> 
>>> That's the latest stable version.  The development version [1] already
>>> contains further improvements for fairness, latency and throughput.
>>> It is however still a release candidate.
>>> 
>>> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>>> 
 It's a LSI SSD, queue depth 32. I use default setting. fio script
 is:
 
 [global]
 ioengine=libaio
 direct=1
 readwrite=randread
 bs=4k
 runtime=60
 time_based=1
 file_service_type=random:36
 overwrite=1
 thread=0
 group_reporting=1
 filename=/dev/sdb
 iodepth=1
 numjobs=8
 
 [groupA]
 prio=2
 
 [groupB]
 new_group
 prio=6
 
 I'll change iodepth, numjobs and prio in different tests. result unit is 
 MB/s.
 
 iodepth=1 numjobs=1 prio 4:4
 CFQ: 28:28 BFQ: 21:21 deadline: 29:29
 
 iodepth=8 numjobs=1 prio 4:4
 CFQ: 162:162 BFQ: 102:98 deadline: 205:205
 
 iodepth=1 numjobs=8 prio 4:4
 CFQ: 157:157 BFQ: 81:92 deadline: 196:197
 
 iodepth=1 numjobs=1 prio 2:6
 CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
 
 iodepth=8 numjobs=1 prio 2:6
 CFQ: 166:174 BFQ: 139:72  deadline: 202:202
 
 iodepth=1 numjobs=8 prio 2:6
 CFQ: 148:150 BFQ: 90:77 deadline: 198:197
 
 CFQ isn't fair at all. BFQ is very good in this side, but has poor 
 throughput
 even prio is the default value.
 
>>> 
>>> Throughput is lower with BFQ for two reasons.
>>> 
>>> First, you certainly left the low_latency in its default state, i.e.,
>>> on.  As explained, e.g., here [2], low_latency mode is totally geared
>>> towards maximum responsiveness and minimum latency for soft real-time
>>> applications (e.g., video players).  To achieve this goal, BFQ is
>>> willing to perform more idling, when necessary.  This lowers
>>> throughput (I'll get back on this at the end of the discussion of the
>>> second reason).
>> 
>> changing low_latency to 0 seems not change anything, at least for the test:
>> iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k
>> 
>>> The second, most important reason, is that a minimum of idling is the
>>> *only* way to achieve differentiated bandwidth distribution, as you
>>> requested by setting different ioprios.  I stress that this 

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Paolo Valente

> Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn 
>  ha scritto:
> 
> On 2016-10-06 07:03, Mark Brown wrote:
>> On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
>>> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:
>> 
 I get that bfq can be a good compromise on most desktop workloads and
 behave reasonably well for some server workloads with the slice
 expiration mechanism but it really isn't an IO resource partitioning
 mechanism.
>> 
>>> Not just desktops, also Android phones.
>> 
>>> So why not have BFQ as a separate scheduling policy upstream,
>>> alongside CFQ, deadline and noop?
>> 
>> Right.
>> 
>>> We're already doing the per-usecase Kconfig thing for preemption.
>>> But maybe somebody already hates that and want to get rid of it,
>>> I don't know.
>> 
>> Hannes also suggested going back to making BFQ a separate scheduler
>> rather than replacing CFQ earlier, pointing out that it mitigates
>> against the risks of changing CFQ substantially at this point (which
>> seems to be the biggest issue here).
>> 
> ISTR that the original argument for this approach essentially amounted to: 
> 'If it's so much better, why do we need both?'.
> 
> Such an argument is valid only if the new design is better in all respects 
> (which there isn't sufficient information to decide in this case), or the 
> negative aspects are worth the improvements (which is too workload specific 
> to decide for something like this).

All correct, apart from the workload-specific issue, which is not very clear to 
me. Over the last five years I have not found a single workload for which CFQ 
is better than BFQ, and none has been suggested.

Anyway, leaving aside this fact, IMO the real problem here is that we are in a 
catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you 
cannot change, and thus replace, CFQ"

Thanks,
Paolo

--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Mark Brown
On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:

> > I get that bfq can be a good compromise on most desktop workloads and
> > behave reasonably well for some server workloads with the slice
> > expiration mechanism but it really isn't an IO resource partitioning
> > mechanism.

> Not just desktops, also Android phones.

> So why not have BFQ as a separate scheduling policy upstream,
> alongside CFQ, deadline and noop?

Right.

> We're already doing the per-usecase Kconfig thing for preemption.
> But maybe somebody already hates that and want to get rid of it,
> I don't know.

Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).


signature.asc
Description: PGP signature


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Linus Walleij
On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:

> I get that bfq can be a good compromise on most desktop workloads and
> behave reasonably well for some server workloads with the slice
> expiration mechanism but it really isn't an IO resource partitioning
> mechanism.

Not just desktops, also Android phones.

So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?

I understand the CPU scheduler people's position that they want
one scheduler for everyone's everyday loads (except RT and
SCHED_DEADLINE) and I guess that is the source of the highlander
"there can be only one" argument, but note this:

kernel/Kconfig.preempt:

config PREEMPT_NONE
bool "No Forced Preemption (Server)"
config PREEMPT_VOLUNTARY
bool "Voluntary Kernel Preemption (Desktop)"
config PREEMPT
bool "Preemptible Kernel (Low-Latency Desktop)"

We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Shaohua Li
On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
> 
> > Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li  ha scritto:
> > 
> > On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> >> Hello, Paolo.
> >> 
> >> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> >>> In this respect, for your generic, unpredictable scenario to make
> >>> sense, there must exist at least one real system that meets the
> >>> requirements of such a scenario.  Or, if such a real system does not
> >>> yet exist, it must be possible to emulate it.  If it is impossible to
> >>> achieve this last goal either, then I miss the usefulness
> >>> of looking for solutions for such a scenario.
> >>> 
> >>> That said, let's define the instance(s) of the scenario that you find
> >>> most representative, and let's test BFQ on it/them.  Numbers will give
> >>> us the answers.  For example, what about all or part of the following
> >>> groups:
> >>> . one cyclically doing random I/O for some second and then sequential I/O
> >>> for the next seconds
> >>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
> >>> . one starting an application cyclically
> >>> . one playing back or streaming a movie
> >>> 
> >>> For each group, we could then measure the time needed to complete each
> >>> phase of I/O in each cycle, plus the responsiveness in the group
> >>> starting an application, plus the frame drop in the group streaming
> >>> the movie.  In addition, we can measure the bandwidth/iops enjoyed by
> >>> each group, plus, of course, the aggregate throughput of the whole
> >>> system.  In particular we could compare results with throttling, BFQ,
> >>> and CFQ.
> >>> 
> >>> Then we could write resulting numbers on the stone, and stick to them
> >>> until something proves them wrong.
> >>> 
> >>> What do you (or others) think about it?
> >> 
> >> That sounds great and yeah it's lame that we didn't start with that.
> >> Shaohua, would it be difficult to compare how bfq performs against
> >> blk-throttle?
> > 
> > I had a test of BFQ.
> 
> Thank you very much for testing BFQ!
> 
> > I'm using BFQ found at
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php=DQIFAg=5VD0RTtNlTh3ycd41b3MUw=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78=
> >  . version is
> > 4.7.0-v8r3.
> 
> That's the latest stable version.  The development version [1] already
> contains further improvements for fairness, latency and throughput.
> It is however still a release candidate.
> 
> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
> 
> > It's a LSI SSD, queue depth 32. I use default setting. fio script
> > is:
> > 
> > [global]
> > ioengine=libaio
> > direct=1
> > readwrite=randread
> > bs=4k
> > runtime=60
> > time_based=1
> > file_service_type=random:36
> > overwrite=1
> > thread=0
> > group_reporting=1
> > filename=/dev/sdb
> > iodepth=1
> > numjobs=8
> > 
> > [groupA]
> > prio=2
> > 
> > [groupB]
> > new_group
> > prio=6
> > 
> > I'll change iodepth, numjobs and prio in different tests. result unit is 
> > MB/s.
> > 
> > iodepth=1 numjobs=1 prio 4:4
> > CFQ: 28:28 BFQ: 21:21 deadline: 29:29
> > 
> > iodepth=8 numjobs=1 prio 4:4
> > CFQ: 162:162 BFQ: 102:98 deadline: 205:205
> > 
> > iodepth=1 numjobs=8 prio 4:4
> > CFQ: 157:157 BFQ: 81:92 deadline: 196:197
> > 
> > iodepth=1 numjobs=1 prio 2:6
> > CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
> > 
> > iodepth=8 numjobs=1 prio 2:6
> > CFQ: 166:174 BFQ: 139:72  deadline: 202:202
> > 
> > iodepth=1 numjobs=8 prio 2:6
> > CFQ: 148:150 BFQ: 90:77 deadline: 198:197
> > 
> > CFQ isn't fair at all. BFQ is very good in this side, but has poor 
> > throughput
> > even prio is the default value.
> > 
> 
> Throughput is lower with BFQ for two reasons.
> 
> First, you certainly left the low_latency in its default state, i.e.,
> on.  As explained, e.g., here [2], low_latency mode is totally geared
> towards maximum responsiveness and minimum latency for soft real-time
> applications (e.g., video players).  To achieve this goal, BFQ is
> willing to perform more idling, when necessary.  This lowers
> throughput (I'll get back on this at the end of the discussion of the
> second reason).

changing low_latency to 0 seems not change anything, at least for the test:
iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k
 
> The second, most important reason, is that a minimum of idling is the
> *only* way to achieve differentiated bandwidth distribution, as you
> requested by setting different ioprios.  I stress that this constraint
> is not a technological accident, but a intrinsic, logical necessity.
> The proof is simple, and if the following explanation is too boring or
> confusing, I can show it to you with any trace of sync I/O.
> 
> First, to provide differentiated service, you need per-process
> scheduling, i.e., schedulers in which there is a separate 

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Shaohua Li
On Wed, Oct 05, 2016 at 09:57:22PM +0200, Paolo Valente wrote:
> 
> > Il giorno 05 ott 2016, alle ore 21:08, Shaohua Li  ha scritto:
> > 
> > On Wed, Oct 05, 2016 at 11:30:53AM -0700, Shaohua Li wrote:
> >> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> >>> Hello, Paolo.
> >>> 
> >>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>  In this respect, for your generic, unpredictable scenario to make
>  sense, there must exist at least one real system that meets the
>  requirements of such a scenario.  Or, if such a real system does not
>  yet exist, it must be possible to emulate it.  If it is impossible to
>  achieve this last goal either, then I miss the usefulness
>  of looking for solutions for such a scenario.
>  
>  That said, let's define the instance(s) of the scenario that you find
>  most representative, and let's test BFQ on it/them.  Numbers will give
>  us the answers.  For example, what about all or part of the following
>  groups:
>  . one cyclically doing random I/O for some second and then sequential I/O
>  for the next seconds
>  . one doing, say, quasi-sequential I/O in ON/OFF cycles
>  . one starting an application cyclically
>  . one playing back or streaming a movie
>  
>  For each group, we could then measure the time needed to complete each
>  phase of I/O in each cycle, plus the responsiveness in the group
>  starting an application, plus the frame drop in the group streaming
>  the movie.  In addition, we can measure the bandwidth/iops enjoyed by
>  each group, plus, of course, the aggregate throughput of the whole
>  system.  In particular we could compare results with throttling, BFQ,
>  and CFQ.
>  
>  Then we could write resulting numbers on the stone, and stick to them
>  until something proves them wrong.
>  
>  What do you (or others) think about it?
> >>> 
> >>> That sounds great and yeah it's lame that we didn't start with that.
> >>> Shaohua, would it be difficult to compare how bfq performs against
> >>> blk-throttle?
> >> 
> >> I had a test of BFQ. I'm using BFQ found at
> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php=DQIFAg=5VD0RTtNlTh3ycd41b3MUw=X13hAPkxmvBro1Ug8vcKHw=zB09S7v2QifXXTa6f2_r6YLjiXq3AwAi7sqO4o2UfBQ=oMKpjQMXfWmMwHmANB-Qnrm2EdERzz9Oef7jcLkbyFg=
> >>  . version is
> >> 4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio 
> >> script
> >> is:
> >> 
> >> [global]
> >> ioengine=libaio
> >> direct=1
> >> readwrite=randread
> >> bs=4k
> >> runtime=60
> >> time_based=1
> >> file_service_type=random:36
> >> overwrite=1
> >> thread=0
> >> group_reporting=1
> >> filename=/dev/sdb
> >> iodepth=1
> >> numjobs=8
> >> 
> >> [groupA]
> >> prio=2
> >> 
> >> [groupB]
> >> new_group
> >> prio=6
> >> 
> >> I'll change iodepth, numjobs and prio in different tests. result unit is 
> >> MB/s.
> >> 
> >> iodepth=1 numjobs=1 prio 4:4
> >> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
> >> 
> >> iodepth=8 numjobs=1 prio 4:4
> >> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
> >> 
> >> iodepth=1 numjobs=8 prio 4:4
> >> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
> >> 
> >> iodepth=1 numjobs=1 prio 2:6
> >> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
> >> 
> >> iodepth=8 numjobs=1 prio 2:6
> >> CFQ: 166:174 BFQ: 139:72  deadline: 202:202
> >> 
> >> iodepth=1 numjobs=8 prio 2:6
> >> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
> > 
> > More tests:
> > 
> > iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
> > CFQ:51:207  BFQ: 51:45  deadline: 51:216
> > 
> > iodepth=1 numjobs=1 prio 2:6, group A bs=4k, group B bs=64k
> > CFQ:25:249  BFQ: 23:42  deadline: 26:251
> > 
> 
> A true proportional share scheduler like BFQ works under the
> assumption to be the only limiter of the bandwidth of its clients.
> And the availability of such a scheduler should apparently make
> bandwidth limiting useless: once you have a mechanism that allows you
> to give each group the desired fraction of the bandwidth, and to
> redistribute excess bandwidth seamlessly when needed, what do you need
> additional limiting for?
> 
> But I'm not expert of any possible system configuration or
> requirement.  So, if you have practical examples, I would really
> appreciate them.  And I don't think it will be difficult to see what
> goes wrong in BFQ with external bw limitation, and to fix the
> problem.

I think the test emulates a very common configuration. We assign more IO
resources to high priority workload. But such workload doesn't always dispatch
enough io. That's why I set a rate limit. When this happend, we hope low
priority workload uses the disk bandwidth. That's the whole point of disk
sharing.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Paolo Valente

> Il giorno 05 ott 2016, alle ore 21:47, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li  ha scritto:
>> 
>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>> Hello, Paolo.
>>> 
>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
 In this respect, for your generic, unpredictable scenario to make
 sense, there must exist at least one real system that meets the
 requirements of such a scenario.  Or, if such a real system does not
 yet exist, it must be possible to emulate it.  If it is impossible to
 achieve this last goal either, then I miss the usefulness
 of looking for solutions for such a scenario.
 
 That said, let's define the instance(s) of the scenario that you find
 most representative, and let's test BFQ on it/them.  Numbers will give
 us the answers.  For example, what about all or part of the following
 groups:
 . one cyclically doing random I/O for some second and then sequential I/O
 for the next seconds
 . one doing, say, quasi-sequential I/O in ON/OFF cycles
 . one starting an application cyclically
 . one playing back or streaming a movie
 
 For each group, we could then measure the time needed to complete each
 phase of I/O in each cycle, plus the responsiveness in the group
 starting an application, plus the frame drop in the group streaming
 the movie.  In addition, we can measure the bandwidth/iops enjoyed by
 each group, plus, of course, the aggregate throughput of the whole
 system.  In particular we could compare results with throttling, BFQ,
 and CFQ.
 
 Then we could write resulting numbers on the stone, and stick to them
 until something proves them wrong.
 
 What do you (or others) think about it?
>>> 
>>> That sounds great and yeah it's lame that we didn't start with that.
>>> Shaohua, would it be difficult to compare how bfq performs against
>>> blk-throttle?
>> 
>> I had a test of BFQ.
> 
> Thank you very much for testing BFQ!
> 
>> I'm using BFQ found at
>> http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
>> 4.7.0-v8r3.
> 
> That's the latest stable version.  The development version [1] already
> contains further improvements for fairness, latency and throughput.
> It is however still a release candidate.
> 
> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
> 
>> It's a LSI SSD, queue depth 32. I use default setting. fio script
>> is:
>> 
>> [global]
>> ioengine=libaio
>> direct=1
>> readwrite=randread
>> bs=4k
>> runtime=60
>> time_based=1
>> file_service_type=random:36
>> overwrite=1
>> thread=0
>> group_reporting=1
>> filename=/dev/sdb
>> iodepth=1
>> numjobs=8
>> 
>> [groupA]
>> prio=2
>> 
>> [groupB]
>> new_group
>> prio=6
>> 
>> I'll change iodepth, numjobs and prio in different tests. result unit is 
>> MB/s.
>> 
>> iodepth=1 numjobs=1 prio 4:4
>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>> 
>> iodepth=8 numjobs=1 prio 4:4
>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>> 
>> iodepth=1 numjobs=8 prio 4:4
>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>> 
>> iodepth=1 numjobs=1 prio 2:6
>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>> 
>> iodepth=8 numjobs=1 prio 2:6
>> CFQ: 166:174 BFQ: 139:72  deadline: 202:202
>> 
>> iodepth=1 numjobs=8 prio 2:6
>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>> 
>> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
>> even prio is the default value.
>> 
> 
> Throughput is lower with BFQ for two reasons.
> 
> First, you certainly left the low_latency in its default state, i.e.,
> on.  As explained, e.g., here [2], low_latency mode is totally geared
> towards maximum responsiveness and minimum latency for soft real-time
> applications (e.g., video players).  To achieve this goal, BFQ is
> willing to perform more idling, when necessary.  This lowers
> throughput (I'll get back on this at the end of the discussion of the
> second reason).
> 
> The second, most important reason, is that a minimum of idling is the
> *only* way to achieve differentiated bandwidth distribution, as you
> requested by setting different ioprios.  I stress that this constraint
> is not a technological accident, but a intrinsic, logical necessity.
> The proof is simple, and if the following explanation is too boring or
> confusing, I can show it to you with any trace of sync I/O.
> 
> First, to provide differentiated service, you need per-process
> scheduling, i.e., schedulers in which there is a separate queue
> associated with each process.  Now, let A be the process with higher
> weight (ioprio), and B the process with lower weight.  Both processes
> are sync, thus, by definition, they issue requests as follows: a few
> requests (probably two, or a little bit more with larger iodepth),
> then a little break to wait for request completion, then the next
> small batch and so on.  For each 

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Shaohua Li
On Wed, Oct 05, 2016 at 11:30:53AM -0700, Shaohua Li wrote:
> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> > Hello, Paolo.
> > 
> > On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> > > In this respect, for your generic, unpredictable scenario to make
> > > sense, there must exist at least one real system that meets the
> > > requirements of such a scenario.  Or, if such a real system does not
> > > yet exist, it must be possible to emulate it.  If it is impossible to
> > > achieve this last goal either, then I miss the usefulness
> > > of looking for solutions for such a scenario.
> > > 
> > > That said, let's define the instance(s) of the scenario that you find
> > > most representative, and let's test BFQ on it/them.  Numbers will give
> > > us the answers.  For example, what about all or part of the following
> > > groups:
> > > . one cyclically doing random I/O for some second and then sequential I/O
> > > for the next seconds
> > > . one doing, say, quasi-sequential I/O in ON/OFF cycles
> > > . one starting an application cyclically
> > > . one playing back or streaming a movie
> > > 
> > > For each group, we could then measure the time needed to complete each
> > > phase of I/O in each cycle, plus the responsiveness in the group
> > > starting an application, plus the frame drop in the group streaming
> > > the movie.  In addition, we can measure the bandwidth/iops enjoyed by
> > > each group, plus, of course, the aggregate throughput of the whole
> > > system.  In particular we could compare results with throttling, BFQ,
> > > and CFQ.
> > > 
> > > Then we could write resulting numbers on the stone, and stick to them
> > > until something proves them wrong.
> > > 
> > > What do you (or others) think about it?
> > 
> > That sounds great and yeah it's lame that we didn't start with that.
> > Shaohua, would it be difficult to compare how bfq performs against
> > blk-throttle?
> 
> I had a test of BFQ. I'm using BFQ found at
> http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
> 4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio script
> is:
> 
> [global]
> ioengine=libaio
> direct=1
> readwrite=randread
> bs=4k
> runtime=60
> time_based=1
> file_service_type=random:36
> overwrite=1
> thread=0
> group_reporting=1
> filename=/dev/sdb
> iodepth=1
> numjobs=8
> 
> [groupA]
> prio=2
> 
> [groupB]
> new_group
> prio=6
> 
> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
> 
> iodepth=1 numjobs=1 prio 4:4
> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
> 
> iodepth=8 numjobs=1 prio 4:4
> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
> 
> iodepth=1 numjobs=8 prio 4:4
> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
> 
> iodepth=1 numjobs=1 prio 2:6
> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
> 
> iodepth=8 numjobs=1 prio 2:6
> CFQ: 166:174 BFQ: 139:72  deadline: 202:202
> 
> iodepth=1 numjobs=8 prio 2:6
> CFQ: 148:150 BFQ: 90:77 deadline: 198:197

More tests:

iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
CFQ:51:207  BFQ: 51:45  deadline: 51:216

iodepth=1 numjobs=1 prio 2:6, group A bs=4k, group B bs=64k
CFQ:25:249  BFQ: 23:42  deadline: 26:251

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Shaohua Li
On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> Hello, Paolo.
> 
> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> > In this respect, for your generic, unpredictable scenario to make
> > sense, there must exist at least one real system that meets the
> > requirements of such a scenario.  Or, if such a real system does not
> > yet exist, it must be possible to emulate it.  If it is impossible to
> > achieve this last goal either, then I miss the usefulness
> > of looking for solutions for such a scenario.
> > 
> > That said, let's define the instance(s) of the scenario that you find
> > most representative, and let's test BFQ on it/them.  Numbers will give
> > us the answers.  For example, what about all or part of the following
> > groups:
> > . one cyclically doing random I/O for some second and then sequential I/O
> > for the next seconds
> > . one doing, say, quasi-sequential I/O in ON/OFF cycles
> > . one starting an application cyclically
> > . one playing back or streaming a movie
> > 
> > For each group, we could then measure the time needed to complete each
> > phase of I/O in each cycle, plus the responsiveness in the group
> > starting an application, plus the frame drop in the group streaming
> > the movie.  In addition, we can measure the bandwidth/iops enjoyed by
> > each group, plus, of course, the aggregate throughput of the whole
> > system.  In particular we could compare results with throttling, BFQ,
> > and CFQ.
> > 
> > Then we could write resulting numbers on the stone, and stick to them
> > until something proves them wrong.
> > 
> > What do you (or others) think about it?
> 
> That sounds great and yeah it's lame that we didn't start with that.
> Shaohua, would it be difficult to compare how bfq performs against
> blk-throttle?

I had a test of BFQ. I'm using BFQ found at
http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio script
is:

[global]
ioengine=libaio
direct=1
readwrite=randread
bs=4k
runtime=60
time_based=1
file_service_type=random:36
overwrite=1
thread=0
group_reporting=1
filename=/dev/sdb
iodepth=1
numjobs=8

[groupA]
prio=2

[groupB]
new_group
prio=6

I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.

iodepth=1 numjobs=1 prio 4:4
CFQ: 28:28 BFQ: 21:21 deadline: 29:29

iodepth=8 numjobs=1 prio 4:4
CFQ: 162:162 BFQ: 102:98 deadline: 205:205

iodepth=1 numjobs=8 prio 4:4
CFQ: 157:157 BFQ: 81:92 deadline: 196:197

iodepth=1 numjobs=1 prio 2:6
CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29

iodepth=8 numjobs=1 prio 2:6
CFQ: 166:174 BFQ: 139:72  deadline: 202:202

iodepth=1 numjobs=8 prio 2:6
CFQ: 148:150 BFQ: 90:77 deadline: 198:197

CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
even prio is the default value.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Tejun Heo
Hello, Paolo.

On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> In this respect, for your generic, unpredictable scenario to make
> sense, there must exist at least one real system that meets the
> requirements of such a scenario.  Or, if such a real system does not
> yet exist, it must be possible to emulate it.  If it is impossible to
> achieve this last goal either, then I miss the usefulness
> of looking for solutions for such a scenario.
> 
> That said, let's define the instance(s) of the scenario that you find
> most representative, and let's test BFQ on it/them.  Numbers will give
> us the answers.  For example, what about all or part of the following
> groups:
> . one cyclically doing random I/O for some second and then sequential I/O
> for the next seconds
> . one doing, say, quasi-sequential I/O in ON/OFF cycles
> . one starting an application cyclically
> . one playing back or streaming a movie
> 
> For each group, we could then measure the time needed to complete each
> phase of I/O in each cycle, plus the responsiveness in the group
> starting an application, plus the frame drop in the group streaming
> the movie.  In addition, we can measure the bandwidth/iops enjoyed by
> each group, plus, of course, the aggregate throughput of the whole
> system.  In particular we could compare results with throttling, BFQ,
> and CFQ.
> 
> Then we could write resulting numbers on the stone, and stick to them
> until something proves them wrong.
> 
> What do you (or others) think about it?

That sounds great and yeah it's lame that we didn't start with that.
Shaohua, would it be difficult to compare how bfq performs against
blk-throttle?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Paolo Valente

> Il giorno 05 ott 2016, alle ore 15:12, Vivek Goyal  ha 
> scritto:
> 
> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> 
> [..]
>> Anyway, to avoid going on with trying speculations and arguments, let
>> me retry with a practical proposal.  BFQ is out there, free.  Let's
>> just test, measure and check whether we have already a solution to
>> the problems you/we are still trying to solve in Linux.
> 
> Hi Paolo,
> 
> Does BFQ implementaiton scale for fast storage devices using blk-mq
> interface. We will want to make sure that locking and other overhead of
> BFQ is very minimal so that overall throughput does not suffer.
> 

Of course BFQ needs to be modified to work in blk-mq.  I'm rather sure
its overhead will then be small enough, just because I have already
collaborated to a basically equivalent port from single to multi-queue for
packet scheduling (with Luigi Rizzo and others), and our prototype can
make over 15 million scheduling decisions per second, and keep latency
low, even with tens of concurrent clients running on a multi-core,
multi-socket system.

For details, here is the paper [1], plus some slides [2].

Actually, the solution in [1] is a global scheduler, which is more
complex than the first blk-mq version of BFQ that I have in mind,
namely, partitioned scheduling, in which there should be one
independent scheduler instance per core.  But this is still investigation
territory. BTW, I would really appreciate help/feedback on this task [3].

Thanks,
Paolo

[1] http://info.iet.unipi.it/~luigi/papers/20160921-pspat.pdf
[2] http://info.iet.unipi.it/~luigi/pspat/
[3] https://marc.info/?l=linux-kernel=147066540916339=2

> Vivek
> 


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Vivek Goyal
On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:

[..]
> Anyway, to avoid going on with trying speculations and arguments, let
> me retry with a practical proposal.  BFQ is out there, free.  Let's
> just test, measure and check whether we have already a solution to
> the problems you/we are still trying to solve in Linux.

Hi Paolo,

Does BFQ implementaiton scale for fast storage devices using blk-mq
interface. We will want to make sure that locking and other overhead of
BFQ is very minimal so that overall throughput does not suffer.

Vivek

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-05 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 22:27, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Tue, Oct 04, 2016 at 09:29:48PM +0200, Paolo Valente wrote:
>>> Hmm... I think we already discussed this but here's a really simple
>>> case.  There are three unknown workloads A, B and C and we want to
>>> give A certain best-effort guarantees (let's say around 80% of the
>>> underlying device) whether A is sharing the device with B or C.
>> 
>> That's the same example that you proposed me in our previous
>> discussion.  For this example I showed you, with many boring numbers,
>> that with BFQ you get the most accurate distribution of the resource.
> 
> Yes, it is about the same example and what I understood was that
> "accurate distribution of the resources" holds as long as the
> randomness is incidental (ie. due to layout on the filesystem and so
> on) with the slice expiration mechanism offsetting the actually random
> workloads.
> 

For completeness, this property holds whatever the workload is,
especially even if it changes.

>> If you have enough stamina, I can repeat them again.  To save your
> 
> I'll go back to the thread and re-read them.
> 

Maybe we can make this less boring, see the end of this email.

>> patience, here is a very brief summary.  In a concrete use case, the
>> unknown workloads turn into something like this: there will be a first
>> time interval during which A happens to be, say, sequential, B happens
>> to be, say, random and C happens to be, say, quasi-sequential.  Then
>> there will be a next time interval during which their characteristics
>> change, and so on.  It is easy (but boring, I acknowledge it) to show
>> that, for each of these time intervals BFQ provides the best possible
>> service in terms of fairness, bandwidth distribution, stability and so
>> on.  Why?  Because of the elastic bandwidth-time scheduling of BFQ
>> that we already discussed, and because BFQ is naturally accurate in
>> redistributing aggregate throughput proportionally, when needed.
> 
> Yeah, that's what I remember and for workload above certain level of
> randomness its time consumption is mapped to bw, right?
> 

Exactly.

>>> I get that bfq can be a good compromise on most desktop workloads and
>>> behave reasonably well for some server workloads with the slice
>>> expiration mechanism but it really isn't an IO resource partitioning
>>> mechanism.
>> 
>> Right.  My argument is that BFQ enables you to give to each client the
>> bandwidth and low-latency guarantees you want.  And this IMO is way
>> better than partitioning a resource and then getting unavoidable
>> unfairness and high latency.
> 
> But that statement only holds while bw is the main thing to guarantee,
> no?  The level of isolation that we're looking for here is fairly
> strict adherence to sub/few-milliseconds in terms of high percentile
> scheduling latency while within the configured bw/iops limits, not
> "overall this device is being used pretty well".
> 

Guaranteeing such a short-term latency, while guaranteeing not just bw
limits, but also proportional share distribution of the bw, is the
reason why we have devised BFQ years ago.

Anyway, to avoid going on with trying speculations and arguments, let
me retry with a practical proposal.  BFQ is out there, free.  Let's
just test, measure and check whether we have already a solution to
the problems you/we are still trying to solve in Linux.

In this respect, for your generic, unpredictable scenario to make
sense, there must exist at least one real system that meets the
requirements of such a scenario.  Or, if such a real system does not
yet exist, it must be possible to emulate it.  If it is impossible to
achieve this last goal either, then I miss the usefulness
of looking for solutions for such a scenario.

That said, let's define the instance(s) of the scenario that you find
most representative, and let's test BFQ on it/them.  Numbers will give
us the answers.  For example, what about all or part of the following
groups:
. one cyclically doing random I/O for some second and then sequential I/O
for the next seconds
. one doing, say, quasi-sequential I/O in ON/OFF cycles
. one starting an application cyclically
. one playing back or streaming a movie

For each group, we could then measure the time needed to complete each
phase of I/O in each cycle, plus the responsiveness in the group
starting an application, plus the frame drop in the group streaming
the movie.  In addition, we can measure the bandwidth/iops enjoyed by
each group, plus, of course, the aggregate throughput of the whole
system.  In particular we could compare results with throttling, BFQ,
and CFQ.

Then we could write resulting numbers on the stone, and stick to them
until something proves them wrong.

What do you (or others) think about it?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to 

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Paolo.

On Tue, Oct 04, 2016 at 09:29:48PM +0200, Paolo Valente wrote:
> > Hmm... I think we already discussed this but here's a really simple
> > case.  There are three unknown workloads A, B and C and we want to
> > give A certain best-effort guarantees (let's say around 80% of the
> > underlying device) whether A is sharing the device with B or C.
> 
> That's the same example that you proposed me in our previous
> discussion.  For this example I showed you, with many boring numbers,
> that with BFQ you get the most accurate distribution of the resource.

Yes, it is about the same example and what I understood was that
"accurate distribution of the resources" holds as long as the
randomness is incidental (ie. due to layout on the filesystem and so
on) with the slice expiration mechanism offsetting the actually random
workloads.

> If you have enough stamina, I can repeat them again.  To save your

I'll go back to the thread and re-read them.

> patience, here is a very brief summary.  In a concrete use case, the
> unknown workloads turn into something like this: there will be a first
> time interval during which A happens to be, say, sequential, B happens
> to be, say, random and C happens to be, say, quasi-sequential.  Then
> there will be a next time interval during which their characteristics
> change, and so on.  It is easy (but boring, I acknowledge it) to show
> that, for each of these time intervals BFQ provides the best possible
> service in terms of fairness, bandwidth distribution, stability and so
> on.  Why?  Because of the elastic bandwidth-time scheduling of BFQ
> that we already discussed, and because BFQ is naturally accurate in
> redistributing aggregate throughput proportionally, when needed.

Yeah, that's what I remember and for workload above certain level of
randomness its time consumption is mapped to bw, right?

> > I get that bfq can be a good compromise on most desktop workloads and
> > behave reasonably well for some server workloads with the slice
> > expiration mechanism but it really isn't an IO resource partitioning
> > mechanism.
> 
> Right.  My argument is that BFQ enables you to give to each client the
> bandwidth and low-latency guarantees you want.  And this IMO is way
> better than partitioning a resource and then getting unavoidable
> unfairness and high latency.

But that statement only holds while bw is the main thing to guarantee,
no?  The level of isolation that we're looking for here is fairly
strict adherence to sub/few-milliseconds in terms of high percentile
scheduling latency while within the configured bw/iops limits, not
"overall this device is being used pretty well".

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 21:14, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Tue, Oct 04, 2016 at 09:02:47PM +0200, Paolo Valente wrote:
>> That's exactly what BFQ has succeeded in doing in all the tests
>> devised so far.  Can you give me a concrete example for which I can
>> try with BFQ and with any other mechanism you deem better.  If
>> you are right, numbers will just make your point.
> 
> Hmm... I think we already discussed this but here's a really simple
> case.  There are three unknown workloads A, B and C and we want to
> give A certain best-effort guarantees (let's say around 80% of the
> underlying device) whether A is sharing the device with B or C.
> 

That's the same example that you proposed me in our previous
discussion.  For this example I showed you, with many boring numbers,
that with BFQ you get the most accurate distribution of the resource.

If you have enough stamina, I can repeat them again.  To save your
patience, here is a very brief summary.  In a concrete use case, the
unknown workloads turn into something like this: there will be a first
time interval during which A happens to be, say, sequential, B happens
to be, say, random and C happens to be, say, quasi-sequential.  Then
there will be a next time interval during which their characteristics
change, and so on.  It is easy (but boring, I acknowledge it) to show
that, for each of these time intervals BFQ provides the best possible
service in terms of fairness, bandwidth distribution, stability and so
on.  Why?  Because of the elastic bandwidth-time scheduling of BFQ
that we already discussed, and because BFQ is naturally accurate in
redistributing aggregate throughput proportionally, when needed.

> I get that bfq can be a good compromise on most desktop workloads and
> behave reasonably well for some server workloads with the slice
> expiration mechanism but it really isn't an IO resource partitioning
> mechanism.
> 

Right.  My argument is that BFQ enables you to give to each client the
bandwidth and low-latency guarantees you want.  And this IMO is way
better than partitioning a resource and then getting unavoidable
unfairness and high latency.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Paolo.

On Tue, Oct 04, 2016 at 09:02:47PM +0200, Paolo Valente wrote:
> That's exactly what BFQ has succeeded in doing in all the tests
> devised so far.  Can you give me a concrete example for which I can
> try with BFQ and with any other mechanism you deem better.  If
> you are right, numbers will just make your point.

Hmm... I think we already discussed this but here's a really simple
case.  There are three unknown workloads A, B and C and we want to
give A certain best-effort guarantees (let's say around 80% of the
underlying device) whether A is sharing the device with B or C.

I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 17:56, Tejun Heo  ha scritto:
> 
> Hello, Vivek.
> 
> On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
>> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
>>> Hi,
>>> 
>>> The background is we don't have an ioscheduler for blk-mq yet, so we can't
>>> prioritize processes/cgroups.
>> 
>> So this is an interim solution till we have ioscheduler for blk-mq?
> 
> It's a common permanent solution which applies to both !mq and mq.
> 
>>> This patch set tries to add basic arbitration
>>> between cgroups with blk-throttle. It adds a new limit io.high for
>>> blk-throttle. It's only for cgroup2.
>>> 
>>> io.max is a hard limit throttling. cgroups with a max limit never dispatch 
>>> more
>>> IO than their max limit. While io.high is a best effort throttling. cgroups
>>> with high limit can run above their high limit at appropriate time.
>>> Specifically, if all cgroups reach their high limit, all cgroups can run 
>>> above
>>> their high limit. If any cgroup runs under its high limit, all other cgroups
>>> will run according to their high limit.
>> 
>> Hi Shaohua,
>> 
>> I still don't understand why we should not implement a weight based
>> proportional IO mechanism and how this mechanism is better than proportional 
>> IO .
> 
> Oh, if we actually can implement proportional IO control, it'd be
> great.  The problem is that we have no way of knowing IO cost for
> highspeed ssd devices.  CFQ gets around the problem by using the
> walltime as the measure of resource usage and scheduling time slices,
> which works fine for rotating disks but horribly for highspeed ssds.
> 

Could you please elaborate more on this point?  BFQ uses sectors
served to measure service, and, on the all the fast devices on which
we have tested it, it accurately distributes
bandwidth as desired, redistributes excess bandwidth with any issue,
and guarantees high responsiveness and low latency at application and
system level (e.g., ~0 drop rate in video playback, with any background
workload tested).

Could you please suggest me some test to show how sector-based
guarantees fails?

Thanks,
Paolo

> We can get some semblance of proportional control by just counting bw
> or iops but both break down badly as a means to measure the actual
> resource consumption depending on the workload.  While limit based
> control is more tedious to configure, it doesn't misrepresent what's
> going on and is a lot less likely to produce surprising outcomes.
> 
> We *can* try to concoct something which tries to do proportional
> control for highspeed ssds but that's gonna be quite a bit of
> complexity and I'm not so sure it'd be justifiable given that we can't
> even figure out measurement of the most basic operating unit.
> 
>> Agreed that we have issues with proportional IO and we don't have good
>> solutions for these problems. But I can't see that how this mechanism
>> will overcome these problems either.
> 
> It mostly defers the burden to the one who's configuring the limits
> and expects it to know the characteristics of the device and workloads
> and configure accordingly.  It's quite a bit more tedious to use but
> should be able to cover good portion of use cases without being overly
> complicated.  I agree that it'd be nice to have a simple proportional
> control but as you said can't see a good solution for it at the
> moment.
> 
>> IIRC, biggest issue with proportional IO was that a low prio group might
>> fill up the device queue with plenty of IO requests and later when high
>> prio cgroup comes, it will still experience latencies anyway. And solution
>> to the problem probably would be to get some awareness in device about 
>> priority of request and map weights to those priority. That way higher
>> prio requests get prioritized.
> 
> Nah, the real problem is that we can't even decide what the
> proportions should be based on.  The most fundamental part is missing.
> 
>> Or run device at lower queue depth. That will improve latencies but migth
>> reduce overall throughput.
> 
> And that we can't do this (and thus basically operate close to
> scheduling time slices) for highspeed ssds.
> 
>> Or thorottle number of buffered writes (as Jens's writeback throttling)
>> patches were doing. Buffered writes seem to be biggest culprit for 
>> increased latencies and being able to control these should help.
> 
> That's a different topic.
> 
>> ioprio/weight based proportional IO mechanism is much more generic and
>> much easier to configure for any kind of storage. io.high is absolute
>> limit and makes it much harder to configure. One needs to know a lot
>> about underlying volume/device's bandwidth (which varies a lot anyway
>> based on workload).
> 
> Yeap, no disagreement there, but it still is a workable solution.
> 
>> IMHO, we seem to be trying to cater to one specific use case using
>> this mechanism. Something ioprio/weight based will be much more
>> generic and we should 

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Vivek.

On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> > Hi,
> > 
> > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > prioritize processes/cgroups.
> 
> So this is an interim solution till we have ioscheduler for blk-mq?

It's a common permanent solution which applies to both !mq and mq.

> > This patch set tries to add basic arbitration
> > between cgroups with blk-throttle. It adds a new limit io.high for
> > blk-throttle. It's only for cgroup2.
> > 
> > io.max is a hard limit throttling. cgroups with a max limit never dispatch 
> > more
> > IO than their max limit. While io.high is a best effort throttling. cgroups
> > with high limit can run above their high limit at appropriate time.
> > Specifically, if all cgroups reach their high limit, all cgroups can run 
> > above
> > their high limit. If any cgroup runs under its high limit, all other cgroups
> > will run according to their high limit.
> 
> Hi Shaohua,
> 
> I still don't understand why we should not implement a weight based
> proportional IO mechanism and how this mechanism is better than proportional 
> IO .

Oh, if we actually can implement proportional IO control, it'd be
great.  The problem is that we have no way of knowing IO cost for
highspeed ssd devices.  CFQ gets around the problem by using the
walltime as the measure of resource usage and scheduling time slices,
which works fine for rotating disks but horribly for highspeed ssds.

We can get some semblance of proportional control by just counting bw
or iops but both break down badly as a means to measure the actual
resource consumption depending on the workload.  While limit based
control is more tedious to configure, it doesn't misrepresent what's
going on and is a lot less likely to produce surprising outcomes.

We *can* try to concoct something which tries to do proportional
control for highspeed ssds but that's gonna be quite a bit of
complexity and I'm not so sure it'd be justifiable given that we can't
even figure out measurement of the most basic operating unit.

> Agreed that we have issues with proportional IO and we don't have good
> solutions for these problems. But I can't see that how this mechanism
> will overcome these problems either.

It mostly defers the burden to the one who's configuring the limits
and expects it to know the characteristics of the device and workloads
and configure accordingly.  It's quite a bit more tedious to use but
should be able to cover good portion of use cases without being overly
complicated.  I agree that it'd be nice to have a simple proportional
control but as you said can't see a good solution for it at the
moment.

> IIRC, biggest issue with proportional IO was that a low prio group might
> fill up the device queue with plenty of IO requests and later when high
> prio cgroup comes, it will still experience latencies anyway. And solution
> to the problem probably would be to get some awareness in device about 
> priority of request and map weights to those priority. That way higher
> prio requests get prioritized.

Nah, the real problem is that we can't even decide what the
proportions should be based on.  The most fundamental part is missing.

> Or run device at lower queue depth. That will improve latencies but migth
> reduce overall throughput.

And that we can't do this (and thus basically operate close to
scheduling time slices) for highspeed ssds.

> Or thorottle number of buffered writes (as Jens's writeback throttling)
> patches were doing. Buffered writes seem to be biggest culprit for 
> increased latencies and being able to control these should help.

That's a different topic.

> ioprio/weight based proportional IO mechanism is much more generic and
> much easier to configure for any kind of storage. io.high is absolute
> limit and makes it much harder to configure. One needs to know a lot
> about underlying volume/device's bandwidth (which varies a lot anyway
> based on workload).

Yeap, no disagreement there, but it still is a workable solution.

> IMHO, we seem to be trying to cater to one specific use case using
> this mechanism. Something ioprio/weight based will be much more
> generic and we should explore implementing that along with building
> notion of ioprio in devices. When these two work together, we might
> be able to see good results. Just software mechanism alone might not
> be enough.

I don't think it's catering to specific use cases.  It is a generic
mechanism which demands knowledge and experimentation to configure.
It's more a way for the kernel to cop out and defer figuring out
device characteristics to userland.  If you have a better idea, I'm
all ears.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Vivek Goyal
On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> Hi,
> 
> The background is we don't have an ioscheduler for blk-mq yet, so we can't
> prioritize processes/cgroups.

So this is an interim solution till we have ioscheduler for blk-mq?

> This patch set tries to add basic arbitration
> between cgroups with blk-throttle. It adds a new limit io.high for
> blk-throttle. It's only for cgroup2.
> 
> io.max is a hard limit throttling. cgroups with a max limit never dispatch 
> more
> IO than their max limit. While io.high is a best effort throttling. cgroups
> with high limit can run above their high limit at appropriate time.
> Specifically, if all cgroups reach their high limit, all cgroups can run above
> their high limit. If any cgroup runs under its high limit, all other cgroups
> will run according to their high limit.

Hi Shaohua,

I still don't understand why we should not implement a weight based
proportional IO mechanism and how this mechanism is better than proportional IO 
.

Agreed that we have issues with proportional IO and we don't have good
solutions for these problems. But I can't see that how this mechanism
will overcome these problems either.

IIRC, biggest issue with proportional IO was that a low prio group might
fill up the device queue with plenty of IO requests and later when high
prio cgroup comes, it will still experience latencies anyway. And solution
to the problem probably would be to get some awareness in device about 
priority of request and map weights to those priority. That way higher
prio requests get prioritized.

Or run device at lower queue depth. That will improve latencies but migth
reduce overall throughput.

Or thorottle number of buffered writes (as Jens's writeback throttling)
patches were doing. Buffered writes seem to be biggest culprit for 
increased latencies and being able to control these should help.

ioprio/weight based proportional IO mechanism is much more generic and
much easier to configure for any kind of storage. io.high is absolute
limit and makes it much harder to configure. One needs to know a lot
about underlying volume/device's bandwidth (which varies a lot anyway
based on workload).

IMHO, we seem to be trying to cater to one specific use case using
this mechanism. Something ioprio/weight based will be much more
generic and we should explore implementing that along with building
notion of ioprio in devices. When these two work together, we might
be able to see good results. Just software mechanism alone might not
be enough.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html