Re: btrfs performance - ssd array

2015-01-17 Thread Wang Shilong
Hello,

 
 Hello,
 
 
 Could you check how many extents with BTRFS and Ext4:
 # filefrag test1
 
 So my findings are odd:
 
 On BTRFS when I run fio with a single worker thread (target file is
 12GB large,and its 100% random write of 4kb blocks), then number of
 extents reported by filefrag is around 3.
 However when I do the same with 4 worker threads, I get some crazy
 number of extents - test1: 3141866 extents found. Also when running
 with 4 threads when I check CPU, the sys% utilization takes 80% of CPU
 ( in the top output I see that all is consumed by kworker processes).
 
 On the EXT4 I get only 13 extents even when running with 4 worker
 threads. (note that I created RAID10 using mdadm before setting up
 ext4 there in order to get comparable storage solution to what we
 test with  BTRFS).
 
 Another odd thing is, that it takes very long time for the filefrag
 utility to return the result on the BTRFS and not only for the case
 where I got 3 milions of extents but also for the first case where I
 ran single worker and the number of extents was only 3. Filefrag on
 EXT4 returns immediately.

So looks like Btrfs lock contention problem.

Take a look at look btrfs_drop_extents(), even for nodatacow
, there will be still many items removal for FS tree.

Unfortunately, btrfs lock contention problems seem very sensible
to item removal operations.

And, here Your SSD is very fast? hm, i am not very sensible to
Your IOPS number.

You could verify this problem by reduing threads number for example
compare 1 thread results. aslo i guess btrfs seq write formace should
be less serious here…
 
 
 To see if this is because bad fragments for BTRFS. I am still not
 sure how fio will test randwrite here, so here is possibilities:
 
 case1:
 if fio don’t repeat write same position for several time, i think
 you could add --overite=0, and retest to see if it helps.
 
 Not sure  what parameter do you mean here.


I mean ‘--overwrite' is an option for fio.

 
 case2:
if fio randwrite did write same position for several time, i think
you could use ‘-o nodatacow’ mount option to verify if this is because
BTRFS COW caused serious fragments.
 
 
 It seems that mounting it with this option does have some effect but
 not very significant and it is not very deterministic. The IOPs are
 slightly higher at the beginning (~25 000 IOPs) but IOPs perfromance
 is very spiky and I can still see that CPU sys% is very high. As soon
 as the kworker threads start consuming CPU, the IOPs performance goes
 down again to some ~15 000 IOPs.

Best Regards,
Wang Shilong

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance - ssd array

2015-01-15 Thread P. Remek
Hello,


 Could you check how many extents with BTRFS and Ext4:
 # filefrag test1

So my findings are odd:

On BTRFS when I run fio with a single worker thread (target file is
12GB large,and its 100% random write of 4kb blocks), then number of
extents reported by filefrag is around 3.
However when I do the same with 4 worker threads, I get some crazy
number of extents - test1: 3141866 extents found. Also when running
with 4 threads when I check CPU, the sys% utilization takes 80% of CPU
( in the top output I see that all is consumed by kworker processes).

On the EXT4 I get only 13 extents even when running with 4 worker
threads. (note that I created RAID10 using mdadm before setting up
ext4 there in order to get comparable storage solution to what we
test with  BTRFS).

Another odd thing is, that it takes very long time for the filefrag
utility to return the result on the BTRFS and not only for the case
where I got 3 milions of extents but also for the first case where I
ran single worker and the number of extents was only 3. Filefrag on
EXT4 returns immediately.


 To see if this is because bad fragments for BTRFS. I am still not
 sure how fio will test randwrite here, so here is possibilities:

 case1:
  if fio don’t repeat write same position for several time, i think
  you could add --overite=0, and retest to see if it helps.

Not sure  what parameter do you mean here.

 case2:
 if fio randwrite did write same position for several time, i think
 you could use ‘-o nodatacow’ mount option to verify if this is because
 BTRFS COW caused serious fragments.


It seems that mounting it with this option does have some effect but
not very significant and it is not very deterministic. The IOPs are
slightly higher at the beginning (~25 000 IOPs) but IOPs perfromance
is very spiky and I can still see that CPU sys% is very high. As soon
as the kworker threads start consuming CPU, the IOPs performance goes
down again to some ~15 000 IOPs.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs performance - ssd array

2015-01-12 Thread P. Remek
Hello,

we are currently investigating possiblities and performance limits of
the Btrfs filesystem. Now it seems we are getting pretty poor
performance for the writes and I would like to ask, if our results
makes sense and if it is a result of some well known performance
bottleneck.

Our setup:

Server:
   CPU: dual socket: E5-2630 v2
   RAM: 32 GB ram
   OS: Ubuntu server 14.10
   Kernel: 3.19.0-031900rc2-generic
   btrfs tools: Btrfs v3.14.1
   2x LSI 9300 HBAs - SAS3 12/Gbs
   8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs

Both HBAs see all 8 disks and we have set up multipathing using
multipath command and device mapper. Then we using this command to
create the filesystem:

mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1
/dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4
/dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7


We run performance test using following command:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
--name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G
--numjobs=24 --readwrite=randwrite


The results for the random read are more or less comparable with the
performance of EXT4 filesystem, we get approximately 300 000 IOPs for
random read.

For random write however, we are getting only about 15 000 IOPs, which
is much lower than for ESX4 (~200 000 IOPs for RAID10).


Regards,
Premek
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance - ssd array

2015-01-12 Thread Austin S Hemmelgarn
On 2015-01-12 08:51, P. Remek wrote:
 Hello,
 
 we are currently investigating possiblities and performance limits of
 the Btrfs filesystem. Now it seems we are getting pretty poor
 performance for the writes and I would like to ask, if our results
 makes sense and if it is a result of some well known performance
 bottleneck.
 
 Our setup:
 
 Server:
 CPU: dual socket: E5-2630 v2
 RAM: 32 GB ram
 OS: Ubuntu server 14.10
 Kernel: 3.19.0-031900rc2-generic
 btrfs tools: Btrfs v3.14.1
 2x LSI 9300 HBAs - SAS3 12/Gbs
 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
 
 Both HBAs see all 8 disks and we have set up multipathing using
 multipath command and device mapper. Then we using this command to
 create the filesystem:
 
 mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1
 /dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4
 /dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7
You almost certainly DO NOT want to use BTRFS raid10 unless you have known good 
backups and are willing to deal with the downtime associated with restoring 
them.  The current incarnation of raid10 in BTRFS is much worse than LVM/MD 
based soft-raid with respect to data recoverability.  I would suggest using 
BTRFS raid1 in this case (which behaves like MD-RAID10 when used with more than 
2 devices), possibly on top of LVM/MD RAID0 if you really need the performance.
 
 
 We run performance test using following command:
 
 fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
 --name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G
 --numjobs=24 --readwrite=randwrite
 
 
 The results for the random read are more or less comparable with the
 performance of EXT4 filesystem, we get approximately 300 000 IOPs for
 random read.
 
 For random write however, we are getting only about 15 000 IOPs, which
 is much lower than for ESX4 (~200 000 IOPs for RAID10).


While I don't have any conclusive numbers, I have noticed myself that random 
write based AIO on BTRFS does tend to be slower on other filesystems.  Also, 
LVM/MD based RAID10 does outperform BTRFS' raid10 implementation, and probably 
will for quite a while; however, I've also noticed that faster RAM does provide 
a bigger benefit for BTRFS than it does for LVM (~2.5% greater performance for 
BTRFS than for LVM when switching from DDR3-1333 to DDR3-1600 on otherwise 
identical hardware), so you might consider looking into that.

Another thing to consider is that the kernel's default I/O scheduler and the 
default parameters for that I/O scheduler are almost always suboptimal for 
SSD's, and this tends to show far more with BTRFS than anything else.  
Personally I've found that using the CFQ I/O scheduler with the following 
parameters works best for a majority of SSD's:
1. slice_idle=0
2. back_seek_penalty=1
3. back_seek_max set equal to the size in sectors of the device
4. nr_requests and quantum set to the hardware command queue depth

You can easily set these persistently for a given device with a udev rule like 
this:
  KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', 
ATTR{queue/scheduler}='cfq', ATTR{queue/iosched/back_seek_penalty}='1', 
ATTR{queue/iosched/back_seek_max}='device_size', 
ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', 
ATTR{queue/nr_requests}='128'

Make sure to replace '128' in the rule with whatever the command queue depth is 
for the device in question (It's usually 128 or 256, occasionally more), and 
device_size with the size of the device in kibibytes.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs performance - ssd array

2015-01-12 Thread Patrik Lundquist
On 12 January 2015 at 15:54, Austin S Hemmelgarn ahferro...@gmail.com wrote:

 Another thing to consider is that the kernel's default I/O scheduler and the 
 default parameters for that I/O scheduler are almost always suboptimal for 
 SSD's, and this tends to show far more with BTRFS than anything else.  
 Personally I've found that using the CFQ I/O scheduler with the following 
 parameters works best for a majority of SSD's:
 1. slice_idle=0
 2. back_seek_penalty=1
 3. back_seek_max set equal to the size in sectors of the device
 4. nr_requests and quantum set to the hardware command queue depth

 You can easily set these persistently for a given device with a udev rule 
 like this:
   KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', 
 ATTR{queue/scheduler}='cfq', ATTR{queue/iosched/back_seek_penalty}='1', 
 ATTR{queue/iosched/back_seek_max}='device_size', 
 ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', 
 ATTR{queue/nr_requests}='128'

 Make sure to replace '128' in the rule with whatever the command queue depth 
 is for the device in question (It's usually 128 or 256, occasionally more), 
 and device_size with the size of the device in kibibytes.


So is it size in sectors of the device or size of the device in
kibibytes for back_seek_max? :-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance - ssd array

2015-01-12 Thread P. Remek
Another thing to consider is that the kernel's default I/O scheduler and the 
default parameters for that I/O scheduler are almost always suboptimal for 
SSD's, and this tends to show far more with BTRFS than anything else.  
Personally I've found that using the CFQ I/O scheduler with the following 
parameters works best for a majority of SSD's:
1. slice_idle=0
2. back_seek_penalty=1
3. back_seek_max set equal to the size in sectors of the device
4. nr_requests and quantum set to the hardware command queue depth

I will give these suggestions a try but I don't expect any big gain.
Notice that the difference between EXT4 and BTRFS random write is
massive - its 200 000 IOPs vs. 15 000 IOPs and the device and kernel
parameters are exactly the same (it is same machine) for both test
scenarios. It suggests that something is taking down write performance
in the Btrfs implementation.

Notice also that we did some performance tuning ( queue scheduling set
to noop, irq affinity distribution and pinning to specific numa nodes
and cores etc.)

Regards,
Premek


On Mon, Jan 12, 2015 at 3:54 PM, Austin S Hemmelgarn
ahferro...@gmail.com wrote:
 On 2015-01-12 08:51, P. Remek wrote:
 Hello,

 we are currently investigating possiblities and performance limits of
 the Btrfs filesystem. Now it seems we are getting pretty poor
 performance for the writes and I would like to ask, if our results
 makes sense and if it is a result of some well known performance
 bottleneck.

 Our setup:

 Server:
 CPU: dual socket: E5-2630 v2
 RAM: 32 GB ram
 OS: Ubuntu server 14.10
 Kernel: 3.19.0-031900rc2-generic
 btrfs tools: Btrfs v3.14.1
 2x LSI 9300 HBAs - SAS3 12/Gbs
 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs

 Both HBAs see all 8 disks and we have set up multipathing using
 multipath command and device mapper. Then we using this command to
 create the filesystem:

 mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1
 /dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4
 /dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7
 You almost certainly DO NOT want to use BTRFS raid10 unless you have known 
 good backups and are willing to deal with the downtime associated with 
 restoring them.  The current incarnation of raid10 in BTRFS is much worse 
 than LVM/MD based soft-raid with respect to data recoverability.  I would 
 suggest using BTRFS raid1 in this case (which behaves like MD-RAID10 when 
 used with more than 2 devices), possibly on top of LVM/MD RAID0 if you really 
 need the performance.


 We run performance test using following command:

 fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
 --name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G
 --numjobs=24 --readwrite=randwrite


 The results for the random read are more or less comparable with the
 performance of EXT4 filesystem, we get approximately 300 000 IOPs for
 random read.

 For random write however, we are getting only about 15 000 IOPs, which
 is much lower than for ESX4 (~200 000 IOPs for RAID10).


 While I don't have any conclusive numbers, I have noticed myself that random 
 write based AIO on BTRFS does tend to be slower on other filesystems.  Also, 
 LVM/MD based RAID10 does outperform BTRFS' raid10 implementation, and 
 probably will for quite a while; however, I've also noticed that faster RAM 
 does provide a bigger benefit for BTRFS than it does for LVM (~2.5% greater 
 performance for BTRFS than for LVM when switching from DDR3-1333 to DDR3-1600 
 on otherwise identical hardware), so you might consider looking into that.

 Another thing to consider is that the kernel's default I/O scheduler and the 
 default parameters for that I/O scheduler are almost always suboptimal for 
 SSD's, and this tends to show far more with BTRFS than anything else.  
 Personally I've found that using the CFQ I/O scheduler with the following 
 parameters works best for a majority of SSD's:
 1. slice_idle=0
 2. back_seek_penalty=1
 3. back_seek_max set equal to the size in sectors of the device
 4. nr_requests and quantum set to the hardware command queue depth

 You can easily set these persistently for a given device with a udev rule 
 like this:
   KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', 
 ATTR{queue/scheduler}='cfq', ATTR{queue/iosched/back_seek_penalty}='1', 
 ATTR{queue/iosched/back_seek_max}='device_size', 
 ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', 
 ATTR{queue/nr_requests}='128'

 Make sure to replace '128' in the rule with whatever the command queue depth 
 is for the device in question (It's usually 128 or 256, occasionally more), 
 and device_size with the size of the device in kibibytes.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance - ssd array

2015-01-12 Thread Austin S Hemmelgarn

On 2015-01-12 10:35, P. Remek wrote:

Another thing to consider is that the kernel's default I/O scheduler and the 
default parameters for that I/O scheduler are almost always suboptimal for SSD's, 
and this tends to show far more with BTRFS than anything else.  Personally 
I've found that using the CFQ I/O scheduler with the following parameters 
works best for a majority of SSD's:
1. slice_idle=0
2. back_seek_penalty=1
3. back_seek_max set equal to the size in sectors of the device
4. nr_requests and quantum set to the hardware command queue depth


I will give these suggestions a try but I don't expect any big gain.
Notice that the difference between EXT4 and BTRFS random write is
massive - its 200 000 IOPs vs. 15 000 IOPs and the device and kernel
parameters are exactly the same (it is same machine) for both test
scenarios. It suggests that something is taking down write performance
in the Btrfs implementation.

Notice also that we did some performance tuning ( queue scheduling set
to noop, irq affinity distribution and pinning to specific numa nodes
and cores etc.)

The stuff about the I/O scheduler is more general advice for dealing 
with SSD's than anything BTRFS specific.  I've found though that on SATA 
(I don't have anywhere near the kind of budget needed for SAS disks, and 
even less so for SAS SSD's) connected SSD's at least, using the no-op 
I/O scheduler get's better small burst performance, but it causes 
horrible latency spikes whenever trying to do something that requires 
bulk throughput with random writes (rsync being an excellent example of 
this).


Something else I thought of after my initial reply, due to the COW 
nature of BTRFS, you will generally get better performance of metadata 
operations with shallower directory structures (largely because mtime 
updates propagate up the directory tree to the root of the filesystem).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs performance - ssd array

2015-01-12 Thread Wang Shilong
Hello,

 Hello,
 
 we are currently investigating possiblities and performance limits of
 the Btrfs filesystem. Now it seems we are getting pretty poor
 performance for the writes and I would like to ask, if our results
 makes sense and if it is a result of some well known performance
 bottleneck.
 
 Our setup:
 
 Server:
   CPU: dual socket: E5-2630 v2
   RAM: 32 GB ram
   OS: Ubuntu server 14.10
   Kernel: 3.19.0-031900rc2-generic
   btrfs tools: Btrfs v3.14.1
   2x LSI 9300 HBAs - SAS3 12/Gbs
   8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
 
 Both HBAs see all 8 disks and we have set up multipathing using
 multipath command and device mapper. Then we using this command to
 create the filesystem:
 
 mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1
 /dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4
 /dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7
 
 
 We run performance test using following command:
 
 fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
 --name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G
 --numjobs=24 --readwrite=randwrite

Could you check how many extents with BTRFS and Ext4:
# filefrag test1

To see if this is because bad fragments for BTRFS. I am still not
sure how fio will test randwrite here, so here is possibilities:

case1:
 if fio don’t repeat write same position for several time, i think
 you could add --overite=0, and retest to see if it helps.

case2:
if fio randwrite did write same position for several time, i think
you could use ‘-o nodatacow’ mount option to verify if this is because
BTRFS COW caused serious fragments.

 
 
 The results for the random read are more or less comparable with the
 performance of EXT4 filesystem, we get approximately 300 000 IOPs for
 random read.
 
 For random write however, we are getting only about 15 000 IOPs, which
 is much lower than for ESX4 (~200 000 IOPs for RAID10).
 
 
 Regards,
 Premek
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Best Regards,
Wang Shilong

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs performance - ssd array

2015-01-12 Thread Austin S Hemmelgarn

On 2015-01-12 10:11, Patrik Lundquist wrote:

On 12 January 2015 at 15:54, Austin S Hemmelgarn ahferro...@gmail.com wrote:


Another thing to consider is that the kernel's default I/O scheduler and the 
default parameters for that I/O scheduler are almost always suboptimal for 
SSD's, and this tends to show far more with BTRFS than anything else.  
Personally I've found that using the CFQ I/O scheduler with the following 
parameters works best for a majority of SSD's:
1. slice_idle=0
2. back_seek_penalty=1
3. back_seek_max set equal to the size in sectors of the device
4. nr_requests and quantum set to the hardware command queue depth

You can easily set these persistently for a given device with a udev rule like 
this:
   KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', ATTR{queue/scheduler}='cfq', 
ATTR{queue/iosched/back_seek_penalty}='1', 
ATTR{queue/iosched/back_seek_max}='device_size', 
ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', 
ATTR{queue/nr_requests}='128'

Make sure to replace '128' in the rule with whatever the command queue depth is for 
the device in question (It's usually 128 or 256, occasionally more), and 
device_size with the size of the device in kibibytes.



So is it size in sectors of the device or size of the device in
kibibytes for back_seek_max? :-)

size in kibibytes, sorry about the confusion, I forgot to correct every 
instance of saying it was size in sectors after I reread the documentation.




smime.p7s
Description: S/MIME Cryptographic Signature