Re: btrfs raid10 performance

2018-06-26 Thread Austin S. Hemmelgarn

On 2018-06-25 21:05, Sterling Windmill wrote:

I am running a single btrfs RAID10 volume of eight LUKS devices, each
using a 2TB SATA hard drive as a backing store. The SATA drives are a
mixture of Seagate and Western Digital drives, some with RPMs ranging
from 5400 to 7200. Each seems to individually performance test where I
would expect for drives of this caliber. They are all attached to an
LSI PCIe SAS controller and configured in JBOD.

I have a relatively beefy quad core Xeon CPU that supports AES-NI and
don't think LUKS is my bottleneck.

Here's some info from the resulting filesystem:

   btrfs fi df /storage
   Data, RAID10: total=6.30TiB, used=6.29TiB
   System, RAID10: total=8.00MiB, used=560.00KiB
   Metadata, RAID10: total=9.00GiB, used=7.64GiB
   GlobalReserve, single: total=512.00MiB, used=0.00B

In general I see good performance, especially read performance which
is enough to regularly saturate my gigabit network when copying files
from this host via samba. Reads are definitely taking advantage of the
multiple copies of data available and spreading the load among all
drives.

Writes aren't quite as rosy, however.

When writing files using dd like in this example:

   dd if=/dev/zero of=tempfile bs=1M count=10240 conv=fdatasync,notrun
c status=progress

And running a command like:

   iostat -m 1

to monitor disk I/O, writes seem to only focus on one of the eight
disks at a time, moving from one drive to the next. This results in a
sustained 55-90 MB/sec throughput depending on which disk is being
written to (remember, some have faster spindle speed than others).

Am I wrong to expect btrfs' RAID10 mode to write to multiple disks
simultaneously and to break larger writes into smaller stripes across
my four pairs of disks?

I had trouble identifying whether btrfs RAID10 is writing (64K?)
stripes or (1GB?) blocks to disk in this mode. The latter might make
more sense based upon what I'm seeing?

Anything else I should be trying to narrow down the bottleneck?
First, you're probably incorrect that the disk access is being 
parallelized.  Given that BTRFS still doesn't parallelize writes in 
raid1 mode, I very much doubt it does so in raid10 mode.  Parallelizing 
writes is a performance optimization that still hasn't really been 
tackled by anyone.  Realistically, BTRFS writes to exactly one disk at a 
time.  So, in a four disk raid10 array, it first writes to disk 1, waits 
for that to finish, then writes to disk 2, waits for that to finish, 
then 3, waits, and then four.  Overall, this makes writes rather slow.


As far as striping across multiple disks, yes, that does happen.  The 
specifics of this are a bit complicated though, and require explaining a 
bit about how BTRFS works in general.


BTRFS uses a two-stage allocator, first allocating 'large' regions of 
disk space to be used for a specific type of data called chunks, and 
then allocating blocks out of those regions to actually store the data. 
There are three chunk types, data (used for storing actual file 
contents), metadata (used for storing things like filenames, access 
times, directory structure, etc), and system (used to store the 
allocation information for all the other chunks in the filesystem). 
Data chunks are typically 1 GB in size, metadata are typically 256 MB in 
size, and system chunks are highly variable but don't really matter for 
this explanation.  The chunk level is where the actual replication and 
striping happen, and the chunk size represents what is exposed to the 
block allocator (so every 1 GB data chunk exposes 1 GB of space to the 
block allocator).


Now, replicated (raid1 or dup profiles) chunks work just like you would 
expect, each of the two allocations for the chunk is 1 GB, and each byte 
is stored as-is in both.  Striped (raid0 or raid10 profiles) are 
somewhat more complicated, and I actually don't know exactly how they 
end up allocated at the lower level.  However, I do know how the 
striping works.  In short, you can treat each striped set (either a full 
raid0 chunk, or half a raid10 chunk) as being functionally identical in 
operation to a conventional RAID0 array, striping occurs at a small 
block granularity (I think it's equal to the block size, which would be 
4k in most cases), which unfortunately compounds the performance issues 
caused by BTRFS only writing to one disk at a time.


As far as improving the performance, I've got two suggestions for 
alternative storage arrangements:


* If you want to just stick with only BTRFS for storage, try just using 
raid1 mode.  It will give you the same theoretical total capacity as 
raid10 does and will slow down reads somewhat, but should speed up 
writes significantly (because you're only writing to two devices, not 
striping across two sets of four).


* If you're willing to try something a bit different, convert your 
storage array to two LVM or MD RAID0 volumes composed of four devices 
each, and then run BTRFS in raid1 mode on top of th

btrfs raid10 performance

2018-06-25 Thread Sterling Windmill
I am running a single btrfs RAID10 volume of eight LUKS devices, each
using a 2TB SATA hard drive as a backing store. The SATA drives are a
mixture of Seagate and Western Digital drives, some with RPMs ranging
from 5400 to 7200. Each seems to individually performance test where I
would expect for drives of this caliber. They are all attached to an
LSI PCIe SAS controller and configured in JBOD.

I have a relatively beefy quad core Xeon CPU that supports AES-NI and
don't think LUKS is my bottleneck.

Here's some info from the resulting filesystem:

  btrfs fi df /storage
  Data, RAID10: total=6.30TiB, used=6.29TiB
  System, RAID10: total=8.00MiB, used=560.00KiB
  Metadata, RAID10: total=9.00GiB, used=7.64GiB
  GlobalReserve, single: total=512.00MiB, used=0.00B

In general I see good performance, especially read performance which
is enough to regularly saturate my gigabit network when copying files
from this host via samba. Reads are definitely taking advantage of the
multiple copies of data available and spreading the load among all
drives.

Writes aren't quite as rosy, however.

When writing files using dd like in this example:

  dd if=/dev/zero of=tempfile bs=1M count=10240 conv=fdatasync,notrun
c status=progress

And running a command like:

  iostat -m 1

to monitor disk I/O, writes seem to only focus on one of the eight
disks at a time, moving from one drive to the next. This results in a
sustained 55-90 MB/sec throughput depending on which disk is being
written to (remember, some have faster spindle speed than others).

Am I wrong to expect btrfs' RAID10 mode to write to multiple disks
simultaneously and to break larger writes into smaller stripes across
my four pairs of disks?

I had trouble identifying whether btrfs RAID10 is writing (64K?)
stripes or (1GB?) blocks to disk in this mode. The latter might make
more sense based upon what I'm seeing?

Anything else I should be trying to narrow down the bottleneck?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html