Re: Understanding BTRFS RAID0 Performance

2018-10-08 Thread Austin S. Hemmelgarn

On 2018-10-05 20:34, Duncan wrote:

Wilson, Ellis posted on Fri, 05 Oct 2018 15:29:52 + as excerpted:


Is there any tuning in BTRFS that limits the number of outstanding reads
at a time to a small single-digit number, or something else that could
be behind small queue depths?  I can't otherwise imagine what the
difference would be on the read path between ext4 vs btrfs when both are
on mdraid.


It seems I forgot to directly answer that question in my first reply.
Thanks for restating it.

Btrfs doesn't really expose much performance tuning (yet?), at least
outside the code itself.  There are a few very limited knobs, but they're
just that, few and limited or broad-stroke.

There are mount options like ssd/nossd, ssd_spread/nossd_spread, the
space_cache set of options (see below), flushoncommit/noflushoncommit,
commit=, etc (see the btrfs (5) manpage), but nothing really to
influence stride length, etc, or to optimize chunk placement between ssd
and non-ssd devices, for instance.

And there's a few filesystem features, normally set at mkfs.btrfs time
(and thus covered in the mkfs.btrfs manpage) but some of which can be
tuned later, but generally, the defaults have changed over time to
reflect the best case, and the older variants are there primarily to
retain backward compatibility with old kernels and tools that didn't
handle the newer variants.

That said, as I think about it there are some tunables that may be worth
experimenting with.  Most or all of these are covered in the btrfs (5)
manpage.

* Given the large device numbers you mention and raid0, you're likely
dealing with multi-TB-scale filesystems.  At this level, the
space_cache=v2 mount option may be useful.  It's not the default yet as
btrfs check, etc, don't yet handle it, but given your raid0 choice you
may not be concerned about that.  Need only be given once after which v2
is "on" for the filesystem until turned off.

* Consider experimenting with the thread_pool=n mount option.  I've seen
very little discussion of this one, but given your interest in
parallelization, it could make a difference.
Probably not as much as you might think.  I'll explain a bit more 
further down where this is being mentioned again.


* Possibly the commit= (default 30) mount option.  In theory,
upping this may allow better write merging, tho your interest seems to be
more on the read side, and the commit time has consequences at crash time.
Based on my own experience, having a higher commit time doesn't impact 
read or write performance much or really help all that much with write 
merging.  All it really helps with is minimizing overhead, but it's not 
even all that great at doing that.


* The autodefrag mount option may be considered if you do a lot of
existing file updates, as is common with database or VM image files.  Due
to COW this triggers high fragmentation on btrfs, and autodefrag should
help control that.  Note that autodefrag effectively increases the
minimum extent size from 4 KiB to, IIRC, 16 MB, tho it may be less, and
doesn't operate at whole-file size, so larger repeatedly-modified files
will still have some fragmentation, just not as much.  Obviously, you
wouldn't see the read-time effects of this until the filesystem has aged
somewhat, so it may not show up on your benchmarks.

(Another option for such files is setting them nocow or using the
nodatacow mount option, but this turns off checksumming and if it's on,
compression for those files, and has a few other non-obvious caveats as
well, so isn't something I recommend.  Instead of using nocow, I'd
suggest putting such files on a dedicated traditional non-cow filesystem
such as ext4, and I consider nocow at best a workaround option for those
who prefer to use btrfs as a single big storage pool and thus don't want
to do the dedicated non-cow filesystem for some subset of their files.)

* Not really for reads but for btrfs and any cow-based filesystem, you
almost certainly want the (not btrfs specific) noatime mount option.
Actually...  This can help a bit for some workloads.  Just like the 
commit time, it comes down to a matter of overhead.  Essentially, if you 
read a file regularly, than with the default of relatime, you've got a 
guaranteed write requiring a commit of the metadata tree once every 24 
hours.  It's not much to worry about for just one file, but if you're 
reading a very large number of files all the time, it can really add up.


* While it has serious filesystem integrity implications and thus can't
be responsibly recommended, there is the nobarrier mount option.  But if
you're already running raid0 on a large number of devices you're already
gambling with device stability, and this /might/ be an additional risk
you're willing to take, as it should increase performance.  But for
normal users it's simply not worth the risk, and if you do choose to use
it, it's at your own risk.
Agreed, if you're running RAID0 with this many drives, nobarrier may be 
worth it for a 

Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Duncan
Wilson, Ellis posted on Fri, 05 Oct 2018 15:29:52 + as excerpted:

> Is there any tuning in BTRFS that limits the number of outstanding reads
> at a time to a small single-digit number, or something else that could
> be behind small queue depths?  I can't otherwise imagine what the
> difference would be on the read path between ext4 vs btrfs when both are
> on mdraid.

It seems I forgot to directly answer that question in my first reply.  
Thanks for restating it.

Btrfs doesn't really expose much performance tuning (yet?), at least 
outside the code itself.  There are a few very limited knobs, but they're 
just that, few and limited or broad-stroke.

There are mount options like ssd/nossd, ssd_spread/nossd_spread, the 
space_cache set of options (see below), flushoncommit/noflushoncommit, 
commit=, etc (see the btrfs (5) manpage), but nothing really to 
influence stride length, etc, or to optimize chunk placement between ssd 
and non-ssd devices, for instance.

And there's a few filesystem features, normally set at mkfs.btrfs time 
(and thus covered in the mkfs.btrfs manpage) but some of which can be 
tuned later, but generally, the defaults have changed over time to 
reflect the best case, and the older variants are there primarily to 
retain backward compatibility with old kernels and tools that didn't 
handle the newer variants.

That said, as I think about it there are some tunables that may be worth 
experimenting with.  Most or all of these are covered in the btrfs (5) 
manpage.

* Given the large device numbers you mention and raid0, you're likely 
dealing with multi-TB-scale filesystems.  At this level, the 
space_cache=v2 mount option may be useful.  It's not the default yet as 
btrfs check, etc, don't yet handle it, but given your raid0 choice you 
may not be concerned about that.  Need only be given once after which v2 
is "on" for the filesystem until turned off.

* Consider experimenting with the thread_pool=n mount option.  I've seen 
very little discussion of this one, but given your interest in 
parallelization, it could make a difference.

* Possibly the commit= (default 30) mount option.  In theory, 
upping this may allow better write merging, tho your interest seems to be 
more on the read side, and the commit time has consequences at crash time.

* The autodefrag mount option may be considered if you do a lot of 
existing file updates, as is common with database or VM image files.  Due 
to COW this triggers high fragmentation on btrfs, and autodefrag should 
help control that.  Note that autodefrag effectively increases the 
minimum extent size from 4 KiB to, IIRC, 16 MB, tho it may be less, and 
doesn't operate at whole-file size, so larger repeatedly-modified files 
will still have some fragmentation, just not as much.  Obviously, you 
wouldn't see the read-time effects of this until the filesystem has aged 
somewhat, so it may not show up on your benchmarks.

(Another option for such files is setting them nocow or using the 
nodatacow mount option, but this turns off checksumming and if it's on, 
compression for those files, and has a few other non-obvious caveats as 
well, so isn't something I recommend.  Instead of using nocow, I'd 
suggest putting such files on a dedicated traditional non-cow filesystem 
such as ext4, and I consider nocow at best a workaround option for those 
who prefer to use btrfs as a single big storage pool and thus don't want 
to do the dedicated non-cow filesystem for some subset of their files.)

* Not really for reads but for btrfs and any cow-based filesystem, you 
almost certainly want the (not btrfs specific) noatime mount option.

* While it has serious filesystem integrity implications and thus can't 
be responsibly recommended, there is the nobarrier mount option.  But if 
you're already running raid0 on a large number of devices you're already 
gambling with device stability, and this /might/ be an additional risk 
you're willing to take, as it should increase performance.  But for 
normal users it's simply not worth the risk, and if you do choose to use 
it, it's at your own risk.

* If you're enabling the discard mount option, consider trying with it 
off, as it can affect performance if your devices don't support queued-
trim.  The alternative is fstrim, presumably scheduled to run once a week 
or so.  (The util-linux package includes an fstrim systemd timer and 
service set to run once a week.  You can activate that, or equivalent 
cron job if you're not on systemd.)

* For filesystem features you may look at no_holes and skinny_metadata.  
These are both quite stable and at least skinny-metadata is now the 
default.  These are normally set at mkfs.btrfs time, but can be modified 
later.  Setting at mkfs time should be more efficient.

* At mkfs.btrfs time, you can set metadata --nodesize.  The newer default 
is 16 KiB, while the old default was the (minimum for amd64/x86) 4 KiB, 
and the maximum is 64 KiB.  See the mkfs.btrfs manpage for 

Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Wilson, Ellis
On 10/05/2018 06:40 AM, Duncan wrote:
> Wilson, Ellis posted on Thu, 04 Oct 2018 21:33:29 + as excerpted:
> 
>> Hi all,
>>
>> I'm attempting to understand a roughly 30% degradation in BTRFS RAID0
>> for large read I/Os across six disks compared with ext4 atop mdadm
>> RAID0.
>>
>> Specifically, I achieve performance parity with BTRFS in terms of
>> single-threaded write and read, and multi-threaded write, but poor
>> performance for multi-threaded read.  The relative discrepancy appears
>> to grow as one adds disks.
> 
> [...]
> 
>> Before I dive into the BTRFS source or try tracing in a different way, I
>> wanted to see if this was a well-known artifact of BTRFS RAID0 and, even
>> better, if there's any tunables available for RAID0 in BTRFS I could
>> play with.  The man page for mkfs.btrfs and btrfstune in the tuning
>> regard seemed...sparse.
> 
> This is indeed well known for btrfs at this point, as it hasn't been
> multi-read-thread optimized yet.  I'm personally more familiar with the
> raid1 case, where which one of the two copies gets the read is simply
> even/odd-PID-based, but AFAIK raid0 isn't particularly optimized either.
> 
> The recommended workaround is (as you might expect) btrfs on top of
> mdraid.  In fact, while it doesn't apply to your case, btrfs raid1 on top
> of mdraid0s is often recommended as an alternative to btrfs raid10, as
> that gives you the best of both worlds -- the data and metadata integrity
> protection of btrfs checksums and fallback (with writeback of the correct
> version) to the other copy if the first copy read fails checksum
> verification, with the much better optimized mdraid0 performance.  So it
> stands to reason that the same recommendation would apply to raid0 --
> just do single-mode btrfs on mdraid0, for better performance than the as
> yet unoptimized btrfs raid0.

Thank you very much Duncan.  I failed to mention that I'd tried this 
before as well, but was hoping to avoid it as it felt like a kludge and 
it didn't give me the big jump I expected so I forgot about it.

I retested and btrfs on mdraid in a six-wide RAID0 does improve 
performance slightly -- I see typically 990MB/s, and up to around 
1.1GB/s in the best case.  Same options to fio as my original email. 
Still a ways away from ext4 (which admittedly may be cheating a bit 
since it seems to detect the md0 underneath of it and adjust its stride 
length accordingly, though I may be over-representing it's intelligence 
about this).

The I/O sizes improve greatly to parity with ext4 atop mdraid, but the 
queue depth is still fairly low -- even with many processes it rarely 
exceeds 5 or 6.  This is true if I run fio with or without the aio ioengine.

Is there any tuning in BTRFS that limits the number of outstanding reads 
at a time to a small single-digit number, or something else that could 
be behind small queue depths?  I can't otherwise imagine what the 
difference would be on the read path between ext4 vs btrfs when both are 
on mdraid.

Thanks again for your insights,

ellis


Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Duncan
Wilson, Ellis posted on Thu, 04 Oct 2018 21:33:29 + as excerpted:

> Hi all,
> 
> I'm attempting to understand a roughly 30% degradation in BTRFS RAID0
> for large read I/Os across six disks compared with ext4 atop mdadm
> RAID0.
> 
> Specifically, I achieve performance parity with BTRFS in terms of
> single-threaded write and read, and multi-threaded write, but poor
> performance for multi-threaded read.  The relative discrepancy appears
> to grow as one adds disks.

[...]

> Before I dive into the BTRFS source or try tracing in a different way, I
> wanted to see if this was a well-known artifact of BTRFS RAID0 and, even
> better, if there's any tunables available for RAID0 in BTRFS I could
> play with.  The man page for mkfs.btrfs and btrfstune in the tuning
> regard seemed...sparse.

This is indeed well known for btrfs at this point, as it hasn't been 
multi-read-thread optimized yet.  I'm personally more familiar with the 
raid1 case, where which one of the two copies gets the read is simply 
even/odd-PID-based, but AFAIK raid0 isn't particularly optimized either.

The recommended workaround is (as you might expect) btrfs on top of 
mdraid.  In fact, while it doesn't apply to your case, btrfs raid1 on top 
of mdraid0s is often recommended as an alternative to btrfs raid10, as 
that gives you the best of both worlds -- the data and metadata integrity 
protection of btrfs checksums and fallback (with writeback of the correct 
version) to the other copy if the first copy read fails checksum 
verification, with the much better optimized mdraid0 performance.  So it 
stands to reason that the same recommendation would apply to raid0 -- 
just do single-mode btrfs on mdraid0, for better performance than the as 
yet unoptimized btrfs raid0.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Nikolay Borisov



On  5.10.2018 00:33, Wilson, Ellis wrote:
> Hi all,
> 
> I'm attempting to understand a roughly 30% degradation in BTRFS RAID0 
> for large read I/Os across six disks compared with ext4 atop mdadm RAID0.
> 
> Specifically, I achieve performance parity with BTRFS in terms of 
> single-threaded write and read, and multi-threaded write, but poor 
> performance for multi-threaded read.  The relative discrepancy appears 
> to grow as one adds disks.  At 6 disks in a RAID0 (yes, I know, and I do 
> not care about data persistence as I have this solved at a different 
> layer) I see approximately 1.3GB/s for ext4 atop mdadm, but only about 
> 950MB/s for BTRFS, both using four threads to read and write four 
> different large files.  Across a large number of my nodes this 
> aggregates to a sizable performance loss.
> 
> This has been a long and winding road for me, but to keep my question 
> somewhat succinct, I'm down to the level of block tracing and one thing 
> that stands out between the two traces is the number of rather small 
> read I/O's that reach one of the drives in the test is vastly different 
> for mdadm RAID0 vs BTRFS, which I think explains (in part at least) the 
> performance drop off.  The read queue depth for BTRFS hovers in the 
> upper single digits while the ext4/mdadm queue depth is towards 20.  I'm 
> unsure right now if this is related or not.
> 
> Benchmark: FIO was used with the following command:
> fio --name=read --rw=read --bs=1M --direct=0 --size=16G --numjobs=4 
> --runtime=120 --group_reporting

Right, so you are doing sequential reads. Since btrfs uses
generic_read_file_iter as its read-related operations and what it just
calls btrfs_readpage which ends up in:

btrfs_readpage
  extent_read_full_page
   __extent_read_full_page
__do_readpage
  submit_extent_page <- Here we have some code which is supposed to
detect contiguous bios detection and merging

So my first guess would be to instrument the code around the merging
logic and see if it works as expected and is able to merge the majority
of the bios.

> 
> The block sizes and counts of I/Os at that size I'm seeing for both 
> cases comes in like the following (my max_segment_kb_size is 4K, hence 
> the above typical upper-end):
> 
> BTRFS:
>   Count  Read I/O Size
>21849 128
>   18 640
>9 768
>3 1280
>9 1408
>3 2048
>3 2560
> 1011 2688
>  507 2816
> 
> ext4 on mdadm RAID0:
>   Count  Read I/O Size
>9 8
>3 16
>5 256
>5 768
>   19 1024
>  716 1536
>5 1592
>5 2504
>  695 2560
>   24 4096
>   21 6656
>  477 8192
> 
> Before I dive into the BTRFS source or try tracing in a different way, I 
> wanted to see if this was a well-known artifact of BTRFS RAID0 and, even 
> better, if there's any tunables available for RAID0 in BTRFS I could 
> play with.  The man page for mkfs.btrfs and btrfstune in the tuning 
> regard seemed...sparse.>
> Any help or pointers are greatly appreciated!>
> Thanks,
> 
> ellis
> 


Understanding BTRFS RAID0 Performance

2018-10-04 Thread Wilson, Ellis
Hi all,

I'm attempting to understand a roughly 30% degradation in BTRFS RAID0 
for large read I/Os across six disks compared with ext4 atop mdadm RAID0.

Specifically, I achieve performance parity with BTRFS in terms of 
single-threaded write and read, and multi-threaded write, but poor 
performance for multi-threaded read.  The relative discrepancy appears 
to grow as one adds disks.  At 6 disks in a RAID0 (yes, I know, and I do 
not care about data persistence as I have this solved at a different 
layer) I see approximately 1.3GB/s for ext4 atop mdadm, but only about 
950MB/s for BTRFS, both using four threads to read and write four 
different large files.  Across a large number of my nodes this 
aggregates to a sizable performance loss.

This has been a long and winding road for me, but to keep my question 
somewhat succinct, I'm down to the level of block tracing and one thing 
that stands out between the two traces is the number of rather small 
read I/O's that reach one of the drives in the test is vastly different 
for mdadm RAID0 vs BTRFS, which I think explains (in part at least) the 
performance drop off.  The read queue depth for BTRFS hovers in the 
upper single digits while the ext4/mdadm queue depth is towards 20.  I'm 
unsure right now if this is related or not.

Benchmark: FIO was used with the following command:
fio --name=read --rw=read --bs=1M --direct=0 --size=16G --numjobs=4 
--runtime=120 --group_reporting

The block sizes and counts of I/Os at that size I'm seeing for both 
cases comes in like the following (my max_segment_kb_size is 4K, hence 
the above typical upper-end):

BTRFS:
  Count  Read I/O Size
   21849 128
  18 640
   9 768
   3 1280
   9 1408
   3 2048
   3 2560
1011 2688
 507 2816

ext4 on mdadm RAID0:
  Count  Read I/O Size
   9 8
   3 16
   5 256
   5 768
  19 1024
 716 1536
   5 1592
   5 2504
 695 2560
  24 4096
  21 6656
 477 8192

Before I dive into the BTRFS source or try tracing in a different way, I 
wanted to see if this was a well-known artifact of BTRFS RAID0 and, even 
better, if there's any tunables available for RAID0 in BTRFS I could 
play with.  The man page for mkfs.btrfs and btrfstune in the tuning 
regard seemed...sparse.

Any help or pointers are greatly appreciated!

Thanks,

ellis