Re: btrfs performance - ssd array
Hello, Hello, Could you check how many extents with BTRFS and Ext4: # filefrag test1 So my findings are odd: On BTRFS when I run fio with a single worker thread (target file is 12GB large,and its 100% random write of 4kb blocks), then number of extents reported by filefrag is around 3. However when I do the same with 4 worker threads, I get some crazy number of extents - test1: 3141866 extents found. Also when running with 4 threads when I check CPU, the sys% utilization takes 80% of CPU ( in the top output I see that all is consumed by kworker processes). On the EXT4 I get only 13 extents even when running with 4 worker threads. (note that I created RAID10 using mdadm before setting up ext4 there in order to get comparable storage solution to what we test with BTRFS). Another odd thing is, that it takes very long time for the filefrag utility to return the result on the BTRFS and not only for the case where I got 3 milions of extents but also for the first case where I ran single worker and the number of extents was only 3. Filefrag on EXT4 returns immediately. So looks like Btrfs lock contention problem. Take a look at look btrfs_drop_extents(), even for nodatacow , there will be still many items removal for FS tree. Unfortunately, btrfs lock contention problems seem very sensible to item removal operations. And, here Your SSD is very fast? hm, i am not very sensible to Your IOPS number. You could verify this problem by reduing threads number for example compare 1 thread results. aslo i guess btrfs seq write formace should be less serious here… To see if this is because bad fragments for BTRFS. I am still not sure how fio will test randwrite here, so here is possibilities: case1: if fio don’t repeat write same position for several time, i think you could add --overite=0, and retest to see if it helps. Not sure what parameter do you mean here. I mean ‘--overwrite' is an option for fio. case2: if fio randwrite did write same position for several time, i think you could use ‘-o nodatacow’ mount option to verify if this is because BTRFS COW caused serious fragments. It seems that mounting it with this option does have some effect but not very significant and it is not very deterministic. The IOPs are slightly higher at the beginning (~25 000 IOPs) but IOPs perfromance is very spiky and I can still see that CPU sys% is very high. As soon as the kworker threads start consuming CPU, the IOPs performance goes down again to some ~15 000 IOPs. Best Regards, Wang Shilong -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs performance - ssd array
Hello, Could you check how many extents with BTRFS and Ext4: # filefrag test1 So my findings are odd: On BTRFS when I run fio with a single worker thread (target file is 12GB large,and its 100% random write of 4kb blocks), then number of extents reported by filefrag is around 3. However when I do the same with 4 worker threads, I get some crazy number of extents - test1: 3141866 extents found. Also when running with 4 threads when I check CPU, the sys% utilization takes 80% of CPU ( in the top output I see that all is consumed by kworker processes). On the EXT4 I get only 13 extents even when running with 4 worker threads. (note that I created RAID10 using mdadm before setting up ext4 there in order to get comparable storage solution to what we test with BTRFS). Another odd thing is, that it takes very long time for the filefrag utility to return the result on the BTRFS and not only for the case where I got 3 milions of extents but also for the first case where I ran single worker and the number of extents was only 3. Filefrag on EXT4 returns immediately. To see if this is because bad fragments for BTRFS. I am still not sure how fio will test randwrite here, so here is possibilities: case1: if fio don’t repeat write same position for several time, i think you could add --overite=0, and retest to see if it helps. Not sure what parameter do you mean here. case2: if fio randwrite did write same position for several time, i think you could use ‘-o nodatacow’ mount option to verify if this is because BTRFS COW caused serious fragments. It seems that mounting it with this option does have some effect but not very significant and it is not very deterministic. The IOPs are slightly higher at the beginning (~25 000 IOPs) but IOPs perfromance is very spiky and I can still see that CPU sys% is very high. As soon as the kworker threads start consuming CPU, the IOPs performance goes down again to some ~15 000 IOPs. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs performance - ssd array
Hello, we are currently investigating possiblities and performance limits of the Btrfs filesystem. Now it seems we are getting pretty poor performance for the writes and I would like to ask, if our results makes sense and if it is a result of some well known performance bottleneck. Our setup: Server: CPU: dual socket: E5-2630 v2 RAM: 32 GB ram OS: Ubuntu server 14.10 Kernel: 3.19.0-031900rc2-generic btrfs tools: Btrfs v3.14.1 2x LSI 9300 HBAs - SAS3 12/Gbs 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs Both HBAs see all 8 disks and we have set up multipathing using multipath command and device mapper. Then we using this command to create the filesystem: mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1 /dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4 /dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7 We run performance test using following command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G --numjobs=24 --readwrite=randwrite The results for the random read are more or less comparable with the performance of EXT4 filesystem, we get approximately 300 000 IOPs for random read. For random write however, we are getting only about 15 000 IOPs, which is much lower than for ESX4 (~200 000 IOPs for RAID10). Regards, Premek -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs performance - ssd array
On 2015-01-12 08:51, P. Remek wrote: Hello, we are currently investigating possiblities and performance limits of the Btrfs filesystem. Now it seems we are getting pretty poor performance for the writes and I would like to ask, if our results makes sense and if it is a result of some well known performance bottleneck. Our setup: Server: CPU: dual socket: E5-2630 v2 RAM: 32 GB ram OS: Ubuntu server 14.10 Kernel: 3.19.0-031900rc2-generic btrfs tools: Btrfs v3.14.1 2x LSI 9300 HBAs - SAS3 12/Gbs 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs Both HBAs see all 8 disks and we have set up multipathing using multipath command and device mapper. Then we using this command to create the filesystem: mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1 /dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4 /dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7 You almost certainly DO NOT want to use BTRFS raid10 unless you have known good backups and are willing to deal with the downtime associated with restoring them. The current incarnation of raid10 in BTRFS is much worse than LVM/MD based soft-raid with respect to data recoverability. I would suggest using BTRFS raid1 in this case (which behaves like MD-RAID10 when used with more than 2 devices), possibly on top of LVM/MD RAID0 if you really need the performance. We run performance test using following command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G --numjobs=24 --readwrite=randwrite The results for the random read are more or less comparable with the performance of EXT4 filesystem, we get approximately 300 000 IOPs for random read. For random write however, we are getting only about 15 000 IOPs, which is much lower than for ESX4 (~200 000 IOPs for RAID10). While I don't have any conclusive numbers, I have noticed myself that random write based AIO on BTRFS does tend to be slower on other filesystems. Also, LVM/MD based RAID10 does outperform BTRFS' raid10 implementation, and probably will for quite a while; however, I've also noticed that faster RAM does provide a bigger benefit for BTRFS than it does for LVM (~2.5% greater performance for BTRFS than for LVM when switching from DDR3-1333 to DDR3-1600 on otherwise identical hardware), so you might consider looking into that. Another thing to consider is that the kernel's default I/O scheduler and the default parameters for that I/O scheduler are almost always suboptimal for SSD's, and this tends to show far more with BTRFS than anything else. Personally I've found that using the CFQ I/O scheduler with the following parameters works best for a majority of SSD's: 1. slice_idle=0 2. back_seek_penalty=1 3. back_seek_max set equal to the size in sectors of the device 4. nr_requests and quantum set to the hardware command queue depth You can easily set these persistently for a given device with a udev rule like this: KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', ATTR{queue/scheduler}='cfq', ATTR{queue/iosched/back_seek_penalty}='1', ATTR{queue/iosched/back_seek_max}='device_size', ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', ATTR{queue/nr_requests}='128' Make sure to replace '128' in the rule with whatever the command queue depth is for the device in question (It's usually 128 or 256, occasionally more), and device_size with the size of the device in kibibytes. smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs performance - ssd array
On 12 January 2015 at 15:54, Austin S Hemmelgarn ahferro...@gmail.com wrote: Another thing to consider is that the kernel's default I/O scheduler and the default parameters for that I/O scheduler are almost always suboptimal for SSD's, and this tends to show far more with BTRFS than anything else. Personally I've found that using the CFQ I/O scheduler with the following parameters works best for a majority of SSD's: 1. slice_idle=0 2. back_seek_penalty=1 3. back_seek_max set equal to the size in sectors of the device 4. nr_requests and quantum set to the hardware command queue depth You can easily set these persistently for a given device with a udev rule like this: KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', ATTR{queue/scheduler}='cfq', ATTR{queue/iosched/back_seek_penalty}='1', ATTR{queue/iosched/back_seek_max}='device_size', ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', ATTR{queue/nr_requests}='128' Make sure to replace '128' in the rule with whatever the command queue depth is for the device in question (It's usually 128 or 256, occasionally more), and device_size with the size of the device in kibibytes. So is it size in sectors of the device or size of the device in kibibytes for back_seek_max? :-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs performance - ssd array
Another thing to consider is that the kernel's default I/O scheduler and the default parameters for that I/O scheduler are almost always suboptimal for SSD's, and this tends to show far more with BTRFS than anything else. Personally I've found that using the CFQ I/O scheduler with the following parameters works best for a majority of SSD's: 1. slice_idle=0 2. back_seek_penalty=1 3. back_seek_max set equal to the size in sectors of the device 4. nr_requests and quantum set to the hardware command queue depth I will give these suggestions a try but I don't expect any big gain. Notice that the difference between EXT4 and BTRFS random write is massive - its 200 000 IOPs vs. 15 000 IOPs and the device and kernel parameters are exactly the same (it is same machine) for both test scenarios. It suggests that something is taking down write performance in the Btrfs implementation. Notice also that we did some performance tuning ( queue scheduling set to noop, irq affinity distribution and pinning to specific numa nodes and cores etc.) Regards, Premek On Mon, Jan 12, 2015 at 3:54 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2015-01-12 08:51, P. Remek wrote: Hello, we are currently investigating possiblities and performance limits of the Btrfs filesystem. Now it seems we are getting pretty poor performance for the writes and I would like to ask, if our results makes sense and if it is a result of some well known performance bottleneck. Our setup: Server: CPU: dual socket: E5-2630 v2 RAM: 32 GB ram OS: Ubuntu server 14.10 Kernel: 3.19.0-031900rc2-generic btrfs tools: Btrfs v3.14.1 2x LSI 9300 HBAs - SAS3 12/Gbs 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs Both HBAs see all 8 disks and we have set up multipathing using multipath command and device mapper. Then we using this command to create the filesystem: mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1 /dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4 /dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7 You almost certainly DO NOT want to use BTRFS raid10 unless you have known good backups and are willing to deal with the downtime associated with restoring them. The current incarnation of raid10 in BTRFS is much worse than LVM/MD based soft-raid with respect to data recoverability. I would suggest using BTRFS raid1 in this case (which behaves like MD-RAID10 when used with more than 2 devices), possibly on top of LVM/MD RAID0 if you really need the performance. We run performance test using following command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G --numjobs=24 --readwrite=randwrite The results for the random read are more or less comparable with the performance of EXT4 filesystem, we get approximately 300 000 IOPs for random read. For random write however, we are getting only about 15 000 IOPs, which is much lower than for ESX4 (~200 000 IOPs for RAID10). While I don't have any conclusive numbers, I have noticed myself that random write based AIO on BTRFS does tend to be slower on other filesystems. Also, LVM/MD based RAID10 does outperform BTRFS' raid10 implementation, and probably will for quite a while; however, I've also noticed that faster RAM does provide a bigger benefit for BTRFS than it does for LVM (~2.5% greater performance for BTRFS than for LVM when switching from DDR3-1333 to DDR3-1600 on otherwise identical hardware), so you might consider looking into that. Another thing to consider is that the kernel's default I/O scheduler and the default parameters for that I/O scheduler are almost always suboptimal for SSD's, and this tends to show far more with BTRFS than anything else. Personally I've found that using the CFQ I/O scheduler with the following parameters works best for a majority of SSD's: 1. slice_idle=0 2. back_seek_penalty=1 3. back_seek_max set equal to the size in sectors of the device 4. nr_requests and quantum set to the hardware command queue depth You can easily set these persistently for a given device with a udev rule like this: KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', ATTR{queue/scheduler}='cfq', ATTR{queue/iosched/back_seek_penalty}='1', ATTR{queue/iosched/back_seek_max}='device_size', ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', ATTR{queue/nr_requests}='128' Make sure to replace '128' in the rule with whatever the command queue depth is for the device in question (It's usually 128 or 256, occasionally more), and device_size with the size of the device in kibibytes. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs performance - ssd array
On 2015-01-12 10:35, P. Remek wrote: Another thing to consider is that the kernel's default I/O scheduler and the default parameters for that I/O scheduler are almost always suboptimal for SSD's, and this tends to show far more with BTRFS than anything else. Personally I've found that using the CFQ I/O scheduler with the following parameters works best for a majority of SSD's: 1. slice_idle=0 2. back_seek_penalty=1 3. back_seek_max set equal to the size in sectors of the device 4. nr_requests and quantum set to the hardware command queue depth I will give these suggestions a try but I don't expect any big gain. Notice that the difference between EXT4 and BTRFS random write is massive - its 200 000 IOPs vs. 15 000 IOPs and the device and kernel parameters are exactly the same (it is same machine) for both test scenarios. It suggests that something is taking down write performance in the Btrfs implementation. Notice also that we did some performance tuning ( queue scheduling set to noop, irq affinity distribution and pinning to specific numa nodes and cores etc.) The stuff about the I/O scheduler is more general advice for dealing with SSD's than anything BTRFS specific. I've found though that on SATA (I don't have anywhere near the kind of budget needed for SAS disks, and even less so for SAS SSD's) connected SSD's at least, using the no-op I/O scheduler get's better small burst performance, but it causes horrible latency spikes whenever trying to do something that requires bulk throughput with random writes (rsync being an excellent example of this). Something else I thought of after my initial reply, due to the COW nature of BTRFS, you will generally get better performance of metadata operations with shallower directory structures (largely because mtime updates propagate up the directory tree to the root of the filesystem). smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs performance - ssd array
Hello, Hello, we are currently investigating possiblities and performance limits of the Btrfs filesystem. Now it seems we are getting pretty poor performance for the writes and I would like to ask, if our results makes sense and if it is a result of some well known performance bottleneck. Our setup: Server: CPU: dual socket: E5-2630 v2 RAM: 32 GB ram OS: Ubuntu server 14.10 Kernel: 3.19.0-031900rc2-generic btrfs tools: Btrfs v3.14.1 2x LSI 9300 HBAs - SAS3 12/Gbs 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs Both HBAs see all 8 disks and we have set up multipathing using multipath command and device mapper. Then we using this command to create the filesystem: mkfs.btrfs -f -d raid10 /dev/mapper/prm-0 /dev/mapper/prm-1 /dev/mapper/prm-2 /dev/mapper/prm-3 /dev/mapper/prm-4 /dev/mapper/prm-5 /dev/mapper/prm-6 /dev/mapper/prm-7 We run performance test using following command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test1 --filename=test1 --bs=4k --iodepth=32 --size=12G --numjobs=24 --readwrite=randwrite Could you check how many extents with BTRFS and Ext4: # filefrag test1 To see if this is because bad fragments for BTRFS. I am still not sure how fio will test randwrite here, so here is possibilities: case1: if fio don’t repeat write same position for several time, i think you could add --overite=0, and retest to see if it helps. case2: if fio randwrite did write same position for several time, i think you could use ‘-o nodatacow’ mount option to verify if this is because BTRFS COW caused serious fragments. The results for the random read are more or less comparable with the performance of EXT4 filesystem, we get approximately 300 000 IOPs for random read. For random write however, we are getting only about 15 000 IOPs, which is much lower than for ESX4 (~200 000 IOPs for RAID10). Regards, Premek -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best Regards, Wang Shilong -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs performance - ssd array
On 2015-01-12 10:11, Patrik Lundquist wrote: On 12 January 2015 at 15:54, Austin S Hemmelgarn ahferro...@gmail.com wrote: Another thing to consider is that the kernel's default I/O scheduler and the default parameters for that I/O scheduler are almost always suboptimal for SSD's, and this tends to show far more with BTRFS than anything else. Personally I've found that using the CFQ I/O scheduler with the following parameters works best for a majority of SSD's: 1. slice_idle=0 2. back_seek_penalty=1 3. back_seek_max set equal to the size in sectors of the device 4. nr_requests and quantum set to the hardware command queue depth You can easily set these persistently for a given device with a udev rule like this: KERNEL=='sda', SUBSYSTEM=='block', ACTION=='add', ATTR{queue/scheduler}='cfq', ATTR{queue/iosched/back_seek_penalty}='1', ATTR{queue/iosched/back_seek_max}='device_size', ATTR{queue/iosched/quantum}='128', ATTR{queue/iosched/slice_idle}='0', ATTR{queue/nr_requests}='128' Make sure to replace '128' in the rule with whatever the command queue depth is for the device in question (It's usually 128 or 256, occasionally more), and device_size with the size of the device in kibibytes. So is it size in sectors of the device or size of the device in kibibytes for back_seek_max? :-) size in kibibytes, sorry about the confusion, I forgot to correct every instance of saying it was size in sectors after I reread the documentation. smime.p7s Description: S/MIME Cryptographic Signature