Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-06-13 Thread Jan Kara
On Wed 12-06-19 12:36:53, Srivatsa S. Bhat wrote:
> 
> [ Adding Greg to CC ]
> 
> On 6/12/19 6:04 AM, Jan Kara wrote:
> > On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
> >> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
> >>> On 5/30/19 3:45 AM, Paolo Valente wrote:
> 
> >> [...]
>  At any rate, since you pointed out that you are interested in
>  out-of-the-box performance, let me complete the context: in case
>  low_latency is left set, one gets, in return for this 12% loss,
>  a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>  times of applications under load [1];
>  b) 500-1000% higher throughput in multi-client server workloads, as I
>  already pointed out [2].
> 
> >>>
> >>> I'm very happy that you could solve the problem without having to
> >>> compromise on any of the performance characteristics/features of BFQ!
> >>>
> >>>
>  I'm going to prepare complete patches.  In addition, if ok for you,
>  I'll report these results on the bug you created.  Then I guess we can
>  close it.
> 
> >>>
> >>> Sounds great!
> >>>
> >>
> >> Hi Paolo,
> >>
> >> Hope you are doing great!
> >>
> >> I was wondering if you got a chance to post these patches to LKML for
> >> review and inclusion... (No hurry, of course!)
> >>
> >> Also, since your fixes address the performance issues in BFQ, do you
> >> have any thoughts on whether they can be adapted to CFQ as well, to
> >> benefit the older stable kernels that still support CFQ?
> > 
> > Since CFQ doesn't exist in current upstream kernel anymore, I seriously
> > doubt you'll be able to get any performance improvements for it in the
> > stable kernels...
> > 
> 
> I suspected as much, but that seems unfortunate though. The latest LTS
> kernel is based on 4.19, which still supports CFQ. It would have been
> great to have a process to address significant issues on older
> kernels too.

Well, you could still tune the performance difference by changing
slice_idle and group_idle tunables for CFQ (in
/sys/block//queue/iosched/).  Changing these to lower values will
reduce the throughput loss when switching between cgroups at the cost of
lower accuracy of enforcing configured IO proportions among cgroups.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-06-13 Thread Jens Axboe
On 6/12/19 1:36 PM, Srivatsa S. Bhat wrote:
> 
> [ Adding Greg to CC ]
> 
> On 6/12/19 6:04 AM, Jan Kara wrote:
>> On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
>>> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
 On 5/30/19 3:45 AM, Paolo Valente wrote:
>
>>> [...]
> At any rate, since you pointed out that you are interested in
> out-of-the-box performance, let me complete the context: in case
> low_latency is left set, one gets, in return for this 12% loss,
> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
> times of applications under load [1];
> b) 500-1000% higher throughput in multi-client server workloads, as I
> already pointed out [2].
>

 I'm very happy that you could solve the problem without having to
 compromise on any of the performance characteristics/features of BFQ!


> I'm going to prepare complete patches.  In addition, if ok for you,
> I'll report these results on the bug you created.  Then I guess we can
> close it.
>

 Sounds great!

>>>
>>> Hi Paolo,
>>>
>>> Hope you are doing great!
>>>
>>> I was wondering if you got a chance to post these patches to LKML for
>>> review and inclusion... (No hurry, of course!)
>>>
>>> Also, since your fixes address the performance issues in BFQ, do you
>>> have any thoughts on whether they can be adapted to CFQ as well, to
>>> benefit the older stable kernels that still support CFQ?
>>
>> Since CFQ doesn't exist in current upstream kernel anymore, I seriously
>> doubt you'll be able to get any performance improvements for it in the
>> stable kernels...
>>
> 
> I suspected as much, but that seems unfortunate though. The latest LTS
> kernel is based on 4.19, which still supports CFQ. It would have been
> great to have a process to address significant issues on older
> kernels too.
> 
> Greg, do you have any thoughts on this? The context is that both CFQ
> and BFQ I/O schedulers have issues that cause I/O throughput to suffer
> upto 10x - 30x on certain workloads and system configurations, as
> reported in [1].
> 
> In this thread, Paolo posted patches to fix BFQ performance on
> mainline. However CFQ suffers from the same performance collapse, but
> CFQ was removed from the kernel in v5.0. So obviously the usual stable
> backporting path won't work here for several reasons:
> 
>1. There won't be a mainline commit to backport from, as CFQ no
>   longer exists in mainline.
> 
>2. This is not a security/stability fix, and is likely to involve
>   invasive changes.
> 
> I was wondering if there was a way to address the performance issues
> in CFQ in the older stable kernels (including the latest LTS 4.19),
> despite the above constraints, since the performance drop is much too
> significant. I guess not, but thought I'd ask :-)
> 
> [1]. 
> https://lore.kernel.org/lkml/8d72fcf7-bbb4-2965-1a06-e9fc177a8...@csail.mit.edu/

This issue has always been there. There will be no specific patches made
for stable for something that doesn't even exist in the newer kernels.

-- 
Jens Axboe



Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-22 Thread Paolo Valente


> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat 
>  ha scritto:
> 
> On 5/22/19 2:09 AM, Paolo Valente wrote:
>> 
>> First, thank you very much for testing my patches, and, above all, for
>> sharing those huge traces!
>> 
>> According to the your traces, the residual 20% lower throughput that you
>> record is due to the fact that the BFQ injection mechanism takes a few
>> hundredths of seconds to stabilize, at the beginning of the workload.
>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>> that you see without this new patch.  After that time, there
>> seems to be no loss according to the trace.
>> 
>> The problem is that a loss lasting only a few hundredths of seconds is
>> however not negligible for a write workload that lasts only 3-4
>> seconds.  Could you please try writing a larger file?
>> 
> 
> I tried running dd for longer (about 100 seconds), but still saw around
> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
> mq-deadline and noop.

Ok, then now the cause is the periodic reset of the mechanism.

It would be super easy to fill this gap, by just gearing the mechanism
toward a very aggressive injection.  The problem is maintaining
control.  As you can imagine from the performance gap between CFQ (or
BFQ with malfunctioning injection) and BFQ with this fix, it is very
hard to succeed in maximizing the throughput while at the same time
preserving control on per-group I/O.

On the bright side, you might be interested in one of the benefits
that BFQ gives in return for this ~10% loss of throughput, in a
scenario that may be important for you (according to affiliation you
report): from ~500% to ~1000% higher throughput when you have to serve
the I/O of multiple VMs, and to guarantee at least no starvation to
any VM [1].  The same holds with multiple clients or containers, and
in general with any set of entities that may compete for storage.

[1] 
https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

> But I'm not too worried about that difference.
> 
>> In addition, I wanted to ask you whether you measured BFQ throughput
>> with traces disabled.  This may make a difference.
>> 
> 
> The above result (1.4 MB/s) was obtained with traces disabled.
> 
>> After trying writing a larger file, you can try with low_latency on.
>> On my side, it causes results to become a little unstable across
>> repetitions (which is expected).
>> 
> With low_latency on, I get between 60 KB/s - 100 KB/s.
> 

Gosh, full regression.  Fortunately, it is simply meaningless to use
low_latency in a scenario where the goal is to guarantee per-group
bandwidths.  Low-latency heuristics, to reach their (low-latency)
goals, modify the I/O schedule compared to the best schedule for
honoring group weights and boosting throughput.  So, as recommended in
BFQ documentation, just switch low_latency off if you want to control
I/O with groups.  It may still make sense to leave low_latency on
in some specific case, which I don't want to bother you about.

However, I feel bad with such a low throughput :)  Would you be so
kind to provide me with a trace?

Thanks,
Paolo

> Regards,
> Srivatsa
> VMware Photon OS



signature.asc
Description: Message signed with OpenPGP


Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-22 Thread Srivatsa S. Bhat
On 5/22/19 2:12 AM, Paolo Valente wrote:
> 
>> Il giorno 22 mag 2019, alle ore 11:02, Srivatsa S. Bhat 
>>  ha scritto:
>>
>>
>> Let's continue here on LKML itself.
> 
> Just done :)
> 
>> The only reason I created the
>> bugzilla entry is to attach the tarball of the traces, assuming
>> that it would allow me to upload a 20 MB file (since email attachment
>> didn't work). But bugzilla's file restriction is much smaller than
>> that, so it didn't work out either, and I resorted to using dropbox.
>> So we don't need the bugzilla entry anymore; I might as well close it
>> to avoid confusion.
>>
> 
> No no, don't close it: it can reach people that don't use LKML.  We
> just have to remember to report back at the end of this.

Ah, good point!

>  BTW, I also
> think that the bug is incorrectly filed against 5.1, while all these
> tests and results concern 5.2-rcX.
> 

Fixed now, thank you for pointing out!
 
Regards,
Srivatsa
VMware Photon OS


Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-22 Thread Srivatsa S. Bhat
On 5/22/19 2:09 AM, Paolo Valente wrote:
> 
> First, thank you very much for testing my patches, and, above all, for
> sharing those huge traces!
> 
> According to the your traces, the residual 20% lower throughput that you
> record is due to the fact that the BFQ injection mechanism takes a few
> hundredths of seconds to stabilize, at the beginning of the workload.
> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
> that you see without this new patch.  After that time, there
> seems to be no loss according to the trace.
> 
> The problem is that a loss lasting only a few hundredths of seconds is
> however not negligible for a write workload that lasts only 3-4
> seconds.  Could you please try writing a larger file?
> 

I tried running dd for longer (about 100 seconds), but still saw around
1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
mq-deadline and noop. But I'm not too worried about that difference.

> In addition, I wanted to ask you whether you measured BFQ throughput
> with traces disabled.  This may make a difference.
> 

The above result (1.4 MB/s) was obtained with traces disabled.

> After trying writing a larger file, you can try with low_latency on.
> On my side, it causes results to become a little unstable across
> repetitions (which is expected).
> 
With low_latency on, I get between 60 KB/s - 100 KB/s.

Regards,
Srivatsa
VMware Photon OS


Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-21 Thread Paolo Valente


> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat 
>>  ha scritto:
>> 
>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for BFQ,
>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>> for you.
>>> 
>> 
>> Hi Paolo,
>> 
>> Thank you for looking into this!
>> 
>> I just tried current mainline at commit 72cf0b07, but unfortunately
>> didn't see any improvement:
>> 
>> dd if=/dev/zero of=/root/test.img bs=512 count=1 oflag=dsync
>> 
>> With mq-deadline, I get:
>> 
>> 512 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>> 
>> With bfq, I get:
>> 512 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>> 
> 
> Hi Srivatsa,
> thanks for reproducing this on mainline.  I seem to have reproduced a
> bonsai-tree version of this issue.

Hi again Srivatsa,
I've analyzed the trace, and I've found the cause of the loss of
throughput in on my side.  To find out whether it is the same cause as
on your side, I've prepared a script that executes your test and takes
a trace during the test.  If ok for you, could you please
- change the value for the DEVS parameter in the attached script, if
  needed
- execute the script
- send me the trace file that the script will leave in your working
dir

Looking forward to your trace,
Paolo



dsync_test.sh
Description: Binary data

>  Before digging into the block
> trace, I'd like to ask you for some feedback.
> 
> First, in my test, the total throughput of the disk happens to be
> about 20 times as high as that enjoyed by dd, regardless of the I/O
> scheduler.  I guess this massive overhead is normal with dsync, but
> I'd like know whether it is about the same on your side.  This will
> help me understand whether I'll actually be analyzing about the same
> problem as yours.
> 
> Second, the commands I used follow.  Do they implement your test case
> correctly?
> 
> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
> [root@localhost tmp]# echo $BASHPID > 
> /sys/fs/cgroup/blkio/testgrp/cgroup.procs
> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
> [mq-deadline] bfq none
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=1 
> oflag=dsync
> 1+0 record dentro
> 1+0 record fuori
> 512 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=1 
> oflag=dsync
> 1+0 record dentro
> 1+0 record fuori
> 512 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
> 
> Thanks,
> Paolo
> 
>> Please let me know if any more info about my setup might be helpful.
>> 
>> Thank you!
>> 
>> Regards,
>> Srivatsa
>> VMware Photon OS
>> 
>>> 
 Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat 
  ha scritto:
 
 
 Hi,
 
 One of my colleagues noticed upto 10x - 30x drop in I/O throughput
 running the following command, with the CFQ I/O scheduler:
 
 dd if=/dev/zero of=/root/test.img bs=512 count=1 oflags=dsync
 
 Throughput with CFQ: 60 KB/s
 Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
 
 I spent some time looking into it and found that this is caused by the
 undesirable interaction between 4 different components:
 
 - blkio cgroup controller enabled
 - ext4 with the jbd2 kthread running in the root blkio cgroup
 - dd running on ext4, in any other blkio cgroup than that of jbd2
 - CFQ I/O scheduler with defaults for slice_idle and group_idle
 
 
 When docker is enabled, systemd creates a blkio cgroup called
 system.slice to run system services (and docker) under it, and a
 separate blkio cgroup called user.slice for user processes. So, when
 dd is invoked, it runs under user.slice.
 
 The dd command above includes the dsync flag, which performs an
 fdatasync after every write to the output file. Since dd is writing to
 a file on ext4, jbd2 will be active, committing transactions
 corresponding to those fdatasync requests from dd. (In other words, dd
 depends on jdb2, in order to make forward progress). But jdb2 being a
 kernel thread, runs in the root blkio cgroup, as opposed to dd, which
 runs under user.slice.
 
 Now, if the I/O scheduler in use for the underlying block device is
 CFQ, then its inter-queue/inter-group idling takes effect (via the
 slice_idle and group_idle parameters, both of which default to 8ms).
 Therefore, everytime CFQ switches between processing requests from dd
 vs jbd2, this 8ms idle time is injected, which slows down the overall
 throughput tremendously!
 
 To verify this theory, I tried various experiments, and in all cases,
 the 4 pre

Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-20 Thread Jan Kara
On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
> > I've addressed these issues in my last batch of improvements for
> > BFQ, which landed in the upcoming 5.2. If you give it a try, and
> > still see the problem, then I'll be glad to reproduce it, and
> > hopefully fix it for you.
> 
> Hi Paolo, I'm curious if you could give a quick summary about what you
> changed in BFQ?
> 
> I was considering adding support so that if userspace calls fsync(2)
> or fdatasync(2), to attach the process's CSS to the transaction, and
> then charge all of the journal metadata writes the process's CSS.  If
> there are multiple fsync's batched into the transaction, the first
> process which forced the early transaction commit would get charged
> the entire journal write.  OTOH, journal writes are sequential I/O, so
> the amount of disk time for writing the journal is going to be
> relatively small, and especially, the fact that work from other
> cgroups is going to be minimal, especially if hadn't issued an
> fsync().

But this makes priority-inversion problems with ext4 journal worse, doesn't
it? If we submit journal commit in blkio cgroup of some random process, it
may get throttled which then effectively blocks the whole filesystem. Or do
you want to implement a more complex back-pressure mechanism where you'd
just account to different blkio cgroup during journal commit and then
throttle as different point where you are not blocking other tasks from
progress?

> In the case where you have three cgroups all issuing fsync(2) and they
> all landed in the same jbd2 transaction thanks to commit batching, in
> the ideal world we would split up the disk time usage equally across
> those three cgroups.  But it's probably not worth doing that...
> 
> That being said, we probably do need some BFQ support, since in the
> case where we have multiple processes doing buffered writes w/o fsync,
> we do charnge the data=ordered writeback to each block cgroup.  Worse,
> the commit can't complete until the all of the data integrity
> writebacks have completed.  And if there are N cgroups with dirty
> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> of idle time tacked onto the commit time.

Yeah. At least in some cases, we know there won't be any more IO from a
particular cgroup in the near future (e.g. transaction commit completing,
or when the layers above IO scheduler already know which IO they are going
to submit next) and in that case idling is just a waste of time. But so far
I haven't decided how should look a reasonably clean interface for this
that isn't specific to a particular IO scheduler implementation.

Honza
-- 
Jan Kara 
SUSE Labs, CR